# SEC Edgar Data Analysis

## Introduction

In this notebook, we will analyze the SEC Edgar data. The SEC Edgar data is a dataset of financial reports of companies that are filed with the SEC. The dataset contains the following columns:

- `cik`: The Central Index Key for the filing entity.
- `name`: The name of the entity.
- `ticker`: The ticker symbol of the entity.
- `sic`: The Standard Industrial Classification code for the filing.
- `adsh`: The Accession Number for the submission.
- `countryba`: The ISO country code for the filing's business address.
- `stprba`: The region for the filing's business address.
- `cityba`: The city for the filing's business address.
- `zipba`: The zip code for the filing's business address.
- `bas1`: The street address for the filing's business address.
- `form`: The submission type of the filing.
- `period`: The period end date.
- `fy`: The fiscal year end date.
- `fp`: The fiscal period focus (Q1, Q2, Q3, FY).
- `filed`: The date the report was filed.
- `accepted`: The date the report was accepted.
- `prevrpt`: The Accession Number for the previous report.
- `detail`: The file name of the primary financial statements and notes.
- `instance`: The file name of the XBRL instance document.
- `nciks`: The number of additional Central Index Keys for the filing.
- `aciks`: The number of additional Central Index Keys for the filing that are not included in the submission.
- `year`: The year of the filing.
- `quarter`: The quarter of the filing.
- `month`: The month of the filing.
- `day`: The day of the filing.
- `hour`: The hour of the filing.

We will analyze the dataset to understand the financial reports of companies that are filed with the SEC.

## Libraries

We will use the following libraries in this notebook:

- `pandas` for data manipulation.
- `requests` for making HTTP requests.
- `numpy` for numerical operations.
- `calendar` for calendar operations.
- `logging` for logging operations.
- `os` for file operations.

## Custom Functions

We will define the following custom functions in this notebook:

- `edgar_functions.py`: This file contains custom functions for analyzing the SEC Edgar data.
- 

Let's load the data and take a look at the first few rows.



In [None]:
# Initialize the environment

import pandas as pd
import requests
import json
import os
import csv
import zipfile
import logging

headers = {"User-Agent": "amr@bashconsultants.com"}  # Need to add your email address here

def cik_ticker(ticker, headers=headers):
    ticker = ticker.upper().replace(".", "-")
    ticker_json = requests.get(
        "https://www.sec.gov/files/company_tickers.json", headers=headers
    ).json()

    for company in ticker_json.values():
        if company["ticker"] == ticker:
            cik = str(company["cik_str"]).zfill(10)
            return cik

    raise ValueError(f"Ticker {ticker} not found in SEC database")

In [11]:
# Select the ticker you want to get the CIK for and run the function

ticker = "ccs"
cik_id = cik_ticker(ticker)
print(cik_id)

0001576940


In [29]:
# get the json data for the company with the CIK based on the ticker

def get_submission_data_for_ticker(ticker, headers=headers, only_filings_df=False):
    """
    Get the data in json form for a given ticker. For example: 'cik', 'entityType', 'sic', 'sicDescription', 'insiderTransactionForOwnerExists', 'insiderTransactionForIssuerExists', 'name', 'tickers', 'exchanges', 'ein', 'description', 'website', 'investorWebsite', 'category', 'fiscalYearEnd', 'stateOfIncorporation', 'stateOfIncorporationDescription', 'addresses', 'phone', 'flags', 'formerNames', 'filings'

    Args:
        ticker (str): The ticker symbol of the company.

    Returns:
        json: The submissions for the company.

    Raises:
        ValueError: If ticker is not a string.
    """
    cik = cik_ticker(ticker)
    headers = headers
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    company_json = requests.get(url, headers=headers).json()
    if only_filings_df:
        return pd.DataFrame(company_json["filings"]["recent"])
    else:
        return company_json

In [20]:
#  
submission_data = get_submission_data_for_ticker(ticker, only_filings_df=False)
print(submission_data)


{'cik': '1576940', 'entityType': 'operating', 'sic': '1531', 'sicDescription': 'Operative Builders', 'insiderTransactionForOwnerExists': 0, 'insiderTransactionForIssuerExists': 1, 'name': 'Century Communities, Inc.', 'tickers': ['CCS'], 'exchanges': ['NYSE'], 'ein': '680521411', 'description': '', 'website': '', 'investorWebsite': '', 'category': 'Large accelerated filer', 'fiscalYearEnd': '1231', 'stateOfIncorporation': 'DE', 'stateOfIncorporationDescription': 'DE', 'addresses': {'mailing': {'street1': '8390 E. CRESCENT PKWY., SUITE 650', 'street2': None, 'city': 'GREENWOOD VILLAGE', 'stateOrCountry': 'CO', 'zipCode': '80111', 'stateOrCountryDescription': 'CO'}, 'business': {'street1': '8390 E. CRESCENT PKWY., SUITE 650', 'street2': None, 'city': 'GREENWOOD VILLAGE', 'stateOrCountry': 'CO', 'zipCode': '80111', 'stateOrCountryDescription': 'CO'}}, 'phone': '303.770.8300', 'flags': '', 'formerNames': [], 'filings': {'recent': {'accessionNumber': ['0001140361-24-015429', '0001140361-24-0

In [21]:
def export_to_json(data, cik_id, filename):
    if isinstance(data, pd.DataFrame):
        data = pd.DataFrame.to_json(data)  # convert DataFrame to JSON
    with open(f'company-{filename}-{cik_id}.json', 'w') as json_file:
        json.dump(data, json_file, indent=3)

In [23]:
data_dict = submission_data  # replace with your actual data
filename = "submissions"
export_to_json(data_dict, cik_id, filename)

In [30]:
def get_filtered_filings(
    ticker, ten_k=True, just_accession_numbers=False, headers=headers
):
    company_filings_df = get_submission_data_for_ticker(
        ticker, only_filings_df=True, headers=headers
    )
    if ten_k:
        df = company_filings_df[company_filings_df["form"] == "10-K"]
    else:
        df = company_filings_df[company_filings_df["form"] == "10-Q"]
    if just_accession_numbers:
        df = df.set_index("reportDate")
        accession_df = df["accessionNumber"]
        return accession_df
    else:
        return df

In [31]:
filings = get_filtered_filings(ticker, ten_k=True, just_accession_numbers=True, headers=headers)

filings

reportDate
2023-12-31    0001576940-24-000005
2022-12-31    0001576940-23-000005
2021-12-31    0001576940-22-000006
2020-12-31    0001576940-21-000009
2019-12-31    0001562762-20-000031
2018-12-31    0001576940-19-000023
2017-12-31    0001576940-18-000052
2016-12-31    0001576940-17-000023
2015-12-31    0001576940-16-000025
2014-12-31    0001562762-15-000061
Name: accessionNumber, dtype: object

In [32]:
# get the data for the company based on the CIK

def get_facts(ticker, headers=headers):
    cik = cik_ticker(ticker)
    url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"
    company_facts = requests.get(url, headers=headers).json()
    return company_facts

In [33]:
# Get the facts for the company
facts = get_facts(ticker)
facts

{'cik': 1576940,
 'entityName': 'Century Communities, Inc.',
 'facts': {'dei': {'EntityCommonStockSharesOutstanding': {'label': 'Entity Common Stock, Shares Outstanding',
    'description': "Indicate number of shares or other units outstanding of each of registrant's classes of capital or common stock or other ownership interests, if and as stated on cover of related periodic report. Where multiple classes or units exist define each class/interest by adding class of stock items such as Common Class A [Member], Common Class B [Member] or Partnership Interest [Member] onto the Instrument [Domain] of the Entity Listings, Instrument.",
    'units': {'shares': [{'end': '2014-08-05',
       'val': 21504704,
       'accn': '0001562762-14-000228',
       'fy': 2014,
       'fp': 'Q2',
       'form': '10-Q',
       'filed': '2014-08-13',
       'frame': 'CY2014Q2I'},
      {'end': '2014-11-10',
       'val': 21483528,
       'accn': '0001562762-14-000325',
       'fy': 2014,
       'fp': 'Q3',


In [34]:
# get the account facts for the company for us-gaap

facts["facts"]["us-gaap"]

{'AccountsPayableCurrentAndNoncurrent': {'label': 'Accounts Payable',
  'description': "Carrying value as of the balance sheet date of liabilities incurred (and for which invoices have typically been received) and payable to vendors for goods and services received that are used in an entity's business.",
  'units': {'USD': [{'end': '2013-12-31',
     'val': 8313000,
     'accn': '0001562762-14-000228',
     'fy': 2014,
     'fp': 'Q2',
     'form': '10-Q',
     'filed': '2014-08-13'},
    {'end': '2013-12-31',
     'val': 8313000,
     'accn': '0001562762-14-000325',
     'fy': 2014,
     'fp': 'Q3',
     'form': '10-Q',
     'filed': '2014-11-14'},
    {'end': '2013-12-31',
     'val': 8313000,
     'accn': '0001562762-15-000061',
     'fy': 2014,
     'fp': 'FY',
     'form': '10-K',
     'filed': '2015-03-06',
     'frame': 'CY2013Q4I'},
    {'end': '2014-06-30',
     'val': 11267000,
     'accn': '0001562762-14-000228',
     'fy': 2014,
     'fp': 'Q2',
     'form': '10-Q',
     'f

In [35]:
us_gaap_levels = facts["facts"]["us-gaap"].keys()
us_gaap_levels


dict_keys(['AccountsPayableCurrentAndNoncurrent', 'AccountsPayableOtherCurrentAndNoncurrent', 'AccountsReceivableNet', 'AccrualForTaxesOtherThanIncomeTaxesCurrentAndNoncurrent', 'AccruedIncomeTaxes', 'AccruedLiabilitiesAndOtherLiabilities', 'AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment', 'AcquisitionCosts', 'AdditionalPaidInCapitalCommonStock', 'AdjustmentsToAdditionalPaidInCapitalOther', 'AdjustmentsToAdditionalPaidInCapitalSharebasedCompensationRequisiteServicePeriodRecognitionValue', 'AdjustmentsToAdditionalPaidInCapitalTaxEffectFromShareBasedCompensation', 'AllocatedShareBasedCompensationExpense', 'AllowanceForDoubtfulAccountsReceivable', 'AmortizationOfIntangibleAssets', 'AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount', 'AssetImpairmentCharges', 'Assets', 'AssetsHeldForSaleNotPartOfDisposalGroupCurrent', 'BilledContractReceivables', 'BillingsInExcessOfCost', 'BusinessAcquisitionCostOfAcquiredEntityTransactionCosts', 'BusinessAcqui

In [36]:
# export the account facts to a csv file

import csv

with open('acct_facts.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    # Write headers
    writer.writerow(["us_gaap_list", "acct_label", "acct_description"])
    
    for us_gaap_list in facts["facts"]["us-gaap"]:
        acct_label = facts["facts"]["us-gaap"][us_gaap_list]["label"]
        acct_description = facts["facts"]["us-gaap"][us_gaap_list]["description"]
        print(f"{us_gaap_list}, {acct_label}, {acct_description}")
        writer.writerow([us_gaap_list, acct_label, acct_description])

AccountsPayableCurrentAndNoncurrent, Accounts Payable, Carrying value as of the balance sheet date of liabilities incurred (and for which invoices have typically been received) and payable to vendors for goods and services received that are used in an entity's business.
AccountsPayableOtherCurrentAndNoncurrent, Accounts Payable, Other, Amount of obligations incurred and payable classified as other.
AccountsReceivableNet, Accounts Receivable, after Allowance for Credit Loss, Amount, after allowance for credit loss, of right to consideration from customer for product sold and service rendered in normal course of business.
AccrualForTaxesOtherThanIncomeTaxesCurrentAndNoncurrent, Accrual for Taxes Other than Income Taxes, Carrying value as of the balance sheet date of obligations incurred and payable for real and property taxes.
AccruedIncomeTaxes, Accrued Income Taxes, Carrying amount as of the balance sheet date of the unpaid sum of the known and estimated amounts payable to satisfy all 

## Processing bulk data

In [None]:
import csv
import zipfile
import json
import logging

# Set up logging
logging.basicConfig(filename='error_log.txt', level=logging.ERROR)

def facts_DF():
    with zipfile.ZipFile('companyfacts.zip', 'r') as z:
        for filename in z.namelist():
            try:
                with z.open(filename) as f:
                    facts = json.load(f)
                    if 'us-gaap' in facts["facts"]:
                        us_gaap_data = facts["facts"]["us-gaap"]
                        for fact, details in us_gaap_data.items():
                            acct_label = details["label"]
                            acct_description = details["description"]
                            yield fact, acct_label, acct_description
            except Exception as e:
                print(f"Error processing file {filename}: {e}")
                logging.error(f"Error processing file {filename}: {e}")

seen = set()

with open('acct_facts.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    # Write headers
    writer.writerow(["us_gaap_list", "acct_label", "acct_description"])
    
    json_output = []
    
    for us_gaap_list, acct_label, acct_description in facts_DF():
        # Create a tuple of the row
        row = (us_gaap_list, acct_label, acct_description)
        # If we've already seen this row, skip it
        if row in seen:
            continue
        # Add the row to the set of seen rows
        seen.add(row)
        print(f"{us_gaap_list}, {acct_label}, {acct_description}")
        writer.writerow([us_gaap_list, acct_label, acct_description])
        json_output.append({
            "us_gaap_list": us_gaap_list,
            "acct_label": acct_label,
            "acct_description": acct_description
        })
    
    # Write to JSON file
    with open('acct_facts.json', 'w') as json_file:
        json.dump(json_output, json_file, indent=3)

In [None]:
data_dict = facts  # replace with your actual data
filename = "facts"
export_to_json(data_dict, cik_id, filename)

In [None]:
def facts_DF(ticker, headers=headers):
    facts = get_facts(ticker, headers)
    us_gaap_data = facts["facts"]["us-gaap"]
    df_data = []
    for fact, details in us_gaap_data.items():
        for unit in details["units"]:
            for item in details["units"][unit]:
                row = item.copy()
                row["fact"] = fact
                df_data.append(row)

    df = pd.DataFrame(df_data)
    df["end"] = pd.to_datetime(df["end"])
    df["start"] = pd.to_datetime(df["start"])
    df = df.drop_duplicates(subset=["fact", "end", "val"])
    df.set_index("end", inplace=True)
    labels_dict = {fact: details["label"] for fact, details in us_gaap_data.items()}
    return df, labels_dict