# PHASE 1: DATA ACQUISITION AND INGESTION
## DATA ACQUISITION

The purpose of this module is to  
1. Download the Financial Statement Data Sets from the SEC website: https://www.sec.gov/dera/data/financial-statement-data-sets.html  
2. Unzip the data and store locally.
3. Merge the data in a dictionary of pandas dataframes.

In [4]:
# to scrape the SEC page
from bs4 import BeautifulSoup

# to download the zip file data
import requests

# to extract the zip files
import zipfile

# to interpret the zip file stream from requests
# https://stackoverflow.com/questions/9419162/download-returned-zip-file-from-url
import io

# to work with paths
import os

In [5]:
# Open the SEC website
page_url = 'https://www.sec.gov/dera/data/financial-statement-data-sets.html'
r = requests.get(page_url)
html_doc = r.text

# To keep track of all year-quarters (eg 2014q3) discovered
list_of_year_quarters = []

# Find all hyperlinks pointing to .zip files
soup = BeautifulSoup(html_doc)
for a in soup.find_all('a', href=True):
    if '.zip' in a['href']:
        # The hrefs are relative to the domain: add domain.
        zip_file_url = 'https://www.sec.gov'+a['href']
        year_quarter = zip_file_url[-10:-4]
        list_of_year_quarters.append(year_quarter)
        
        # Avoid duplicate downloading
        if not year_quarter in os.listdir('./SEC_Datasets'):
            # Stream the file and print status update. Then, extract the zip file.
            zip_data = requests.get(zip_file_url, stream=True)
            with zipfile.ZipFile(io.BytesIO(zip_data.content), 'r') as zip_data_file:
                zip_data_file.extractall('./SEC_Datasets/'+year_quarter)

## DATA INGESTION

The purpose of this module is to  
1. Reads the raw .txt data files, acquired from SEC.gov, into Pandas DataFrames.
2. Merges all quarterly files into a dictionary of DataFrames.
3. Pickle this dictionary for future use.
4. Generate and pickle a list of company names and state present in this dataset. This will form the input for the data gathering module.

In [6]:
# to work with data
import pandas as pd

# to write dictionary to disk
import pickle

In [7]:
# Read files into Pandas
# The desired object is a DICTIONARY of DATAFRAMES. Each DATAFRAME consists of all quarterly
# files appended. eg. DataFrames['NUM'] yields a large dataframe with all NUM data across
# all datasets downloaded.

data_file_names = ['NUM','PRE','SUB']

# to avoid duplicate effort during editing.
if 'dict_of_dfs_num_pre_sub_tag.p' in os.listdir('.'):
    with open('dict_of_dfs_num_pre_sub_tag.p', 'rb') as picklefile:
        DataFrames = pickle.load(picklefile)
else:    
    DataFrames = {}

# to avoid duplicate effort during editing        
if not DataFrames:
    # process all files
    results = {}
    for name in data_file_names:
        results[name] = []
        for qtr in list_of_year_quarters:
            filepath = "./SEC_Datasets/" + qtr + "/" + name + '.txt'
            df = pd.read_csv(filepath, 
                     sep='\t', 
                     # This analyzes the entire file in 1 pass, improving accuracy of formating etc.
                     low_memory=False, 
                     header=0,
                     # Although the docs indicate utf-8 encoding, there exist non-compliant characters.
                     # Stack Overflow recommended trying latin-1, which works.
                     encoding='latin-1',
                     # To avoid pandas guessing different dtypes in different files.
                     dtype='str')
            results[name].append(df)
            print('appended '+qtr+' to '+name)
        DataFrames[name] = pd.concat(results[name])
        print('merged '+name)
            
else: print('Files already processed')

Files already processed


In [12]:
DataFrames['SUB'].head(3)

Unnamed: 0,adsh,cik,name,sic,countryba,stprba,cityba,zipba,bas1,bas2,...,period,fy,fp,filed,accepted,prevrpt,detail,instance,nciks,aciks
0,0000002178-18-000067,2178,"ADAMS RESOURCES & ENERGY, INC.",5172,US,TX,HOUSTON,77027,17 S. BRIAR HOLLOW LN.,,...,20180930,2018,Q3,20181107,2018-11-07 16:28:00.0,0,1,ae-20180930_htm.xml,1,
1,0000002488-18-000189,2488,ADVANCED MICRO DEVICES INC,3674,US,CA,SANTA CLARA,95054,2485 AUGUSTINE DRIVE,,...,20180930,2018,Q3,20181031,2018-10-31 16:15:00.0,0,1,amd-20180929.xml,1,
2,0000002969-18-000044,2969,AIR PRODUCTS & CHEMICALS INC /DE/,2810,US,PA,ALLENTOWN,18195-1501,7201 HAMILTON BLVD,,...,20180930,2018,FY,20181120,2018-11-20 14:48:00.0,0,1,apd-10xkx30sep2018_htm.xml,1,


In [18]:
if not 'dict_of_dfs_num_pre_sub_tag.p' in os.listdir('.'):
    with open('dict_of_dfs_num_pre_sub_tag.p', 'wb') as picklefile:
        pickle.dump(DataFrames, picklefile)
else:
    print('file already pickled')

file already pickled


## Outcome

The working directory now contains the following data files:
1. A folder 'SEC_Datasets' containing the .txt datafiles acquired from the SEC website.
2. A pickle file 'dict_of_dfs_num_pre_sub_tag.p' containing the DICTIONARY of 3 DATAFRAMES of SEC data.  
This will form the input for PHASE II - financial analysis.