# Download and Merge all SEC Files

We download SEC company filing data from this website: https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html 


This program will generate the following directory (folder) structure:
- inside current directory (where you run this notebook): folder "data"
- inside data folder: folder "sec"
- inside sec folder: folder "downloads" and folder "merged"

Import these libraries

In [1]:
import pandas as pd
import requests, zipfile, io
import os
from pathlib import Path

Our download function:

In [2]:
def download_file(period):
    url = 'https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/'+period+'_notes.zip'
    
    unzip_folder_name = 'data/sec/downloads/' + period                           # Where to put contents of unzipped file  
    
    r = requests.get(url)
    if r.ok:                                                                     # If download worked
        print('Downloaded:', url, 'to:', unzip_folder_name)
        Path(unzip_folder_name).mkdir(parents=True, exist_ok=True)               # Make the folder where we unzip the file to   
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(members=['sub.tsv','num.tsv'], path=unzip_folder_name)      # Unzip file to the folder we just made
    else:
        print('File not found:', period)

If you previously had trouble downloading the files, and you are worried about running out of space:
1. delete all files in your "downloads" folder
1. empty your trash to make sure that memory is free
1. check available memory, you need about 40 GB (you can delete the files later)

Now download the **most recent file** (you can do this every month, for example in May, dowload the April file with '2021_04'):

In [8]:
download_file('2021_03')

Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2021_03_notes.zip to: data/sec/downloads/2021_03


Now download **all previous files** from 2010 to 2021-2:     
(later years have larger file sizes, so the downloads will start fast and then slow down)

In [4]:
for year in [2009]:                          # Get the quarterly files for 2009
    for quarter in [2,3,4]:
        period = str(year)+'q'+str(quarter)
        download_file(period)

for year in range(2010,2020):                # Get the quarterly files 2010 - 2019
    for quarter in [1,2,3,4]:
        period = str(year)+'q'+str(quarter)
        download_file(period)
        
for year in [2020]:                          # Get the quarterly files for 2020
    for quarter in [1,2,3]:
        period = str(year)+'q'+str(quarter)
        download_file(period)        
        
for year in [2020]:                          # Get the monthly files for 2020
    for month in [10,11,12]:
        period = str(year)+'_'+str(month)
        download_file(period)
                
for year in [2021]:                          # Get the monthly files for 2021
    for month in [1,2]:
        period = str(year)+'_0'+str(month) if month<10 else str(year)+'_'+str(month)
        download_file(period)      
        
print('Download finished!')

Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2009q2_notes.zip to: data/sec/downloads/2009q2
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2009q3_notes.zip to: data/sec/downloads/2009q3
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2009q4_notes.zip to: data/sec/downloads/2009q4
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q1_notes.zip to: data/sec/downloads/2010q1
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q2_notes.zip to: data/sec/downloads/2010q2
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q3_notes.zip to: data/sec/downloads/2010q3
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q4_notes.zip to: data/sec/downloads/2010q4
Downloaded: https://www.sec.gov/files/dera/data/financi

Now use this function to merge all the files:

In [9]:
def merge_sec_files(folder):
    
    keep_these_columns = ['cik','sic','countryinc','tag','filed','ddate','qtrs','value']
    keep_these_tags_with_segments = ['EntityCommonStockSharesOutstanding','CommonStockSharesOutstanding']

    filings = pd.read_table('data/sec/downloads/'+folder+'/sub.tsv')
    numbers = pd.read_table('data/sec/downloads/'+folder+'/num.tsv', encoding='ISO-8859-1', error_bad_lines=False) 

    filings = filings[filings.form.isin(['10-Q','10-K']) & filings.cik.notnull()]
    numbers = numbers[(numbers.dimh=='0x00000000') | numbers.tag.isin(keep_these_tags_with_segments)]

    merged = numbers.merge(filings, on='adsh', how='inner').set_index('adsh')[keep_these_columns]

    merged['filed'] = pd.to_datetime(merged.filed, format='%Y%m%d', errors='coerce')   
    merged['ddate'] = pd.to_datetime(merged.ddate, format='%Y%m%d', errors='coerce')    
    
    return merged[merged.filed.notnull() & merged.ddate.notnull()].drop_duplicates()

Merge **all files** in your downloads folder (ignore the warning messages):     
(this will take a while)

In [12]:
Path('data/sec/merged').mkdir(parents=True, exist_ok=True)              # Make the folder where we save the marged file.  
  
for folder in sorted(os.listdir('data/sec/downloads/')):                # Loop over all folders in directory "downloads".  
    if not folder.startswith("."):                                      # Exclude hidden files from file list.
        print('Merge:',folder)
        merged = merge_sec_files(folder)                                # Generate the merged table.
        merged.to_csv('data/sec/merged/'+folder+'.csv', index=False)    # Save the merged table.    
    
print('Merging finished!')

Merge: 2009q1
Merge: 2009q2
Merge: 2009q3
Merge: 2009q4
Merge: 2010q1
Merge: 2010q2
Merge: 2010q3
Merge: 2010q4
Merge: 2011q1
Merge: 2011q2
Merge: 2011q3
Merge: 2011q4
Merge: 2012q1
Merge: 2012q2
Merge: 2012q3
Merge: 2012q4
Merge: 2013q1
Merge: 2013q2
Merge: 2013q3
Merge: 2013q4
Merge: 2014q1
Merge: 2014q2
Merge: 2014q3
Merge: 2014q4
Merge: 2015q1
Merge: 2015q2
Merge: 2015q3
Merge: 2015q4
Merge: 2016q1
Merge: 2016q2
Merge: 2016q3
Merge: 2016q4
Merge: 2017q1
Merge: 2017q2
Merge: 2017q3
Merge: 2017q4
Merge: 2018q1
Merge: 2018q2
Merge: 2018q3
Merge: 2018q4
Merge: 2019q1
Merge: 2019q2
Merge: 2019q3
Merge: 2019q4
Merge: 2020_10
Merge: 2020_11
Merge: 2020_12
Merge: 2020q1
Merge: 2020q2
Merge: 2020q3
Merge: 2021_01
Merge: 2021_02
Merge: 2021_03
Merging finished!


Now you have all the merged files in the "merged" folder. To free up memory, you can now:
1. delete all files in "downloads" folder
1. empty trash

You need to keep  all the merged files (but they only take about 8 GB).

Now collect the dates of all filings from the merge files:

In [13]:
def get_all_filing_dates(filename=None):                          # Function input: optional filename.

    directory = 'data/sec/merged/'                                # Read data from here.
    filenames = [filename] if filename else os.listdir(directory) # Supplied filename or all files in "merged" directory.
    filenames = [f for f in filenames if not f.startswith(".")]   # Exclude hidden files from file list.
    
    results   = pd.DataFrame()                                    # Results will be appended to this table.

    for filename in filenames:                                    # Loop over all files.
        data    = pd.read_csv(directory+filename, parse_dates=['filed','ddate'])  # Read the file.        
        results = results.append( data.groupby(['cik','filed'],as_index=False).first()[['cik','filed']] )
    
    return results.sort_values(['cik','filed']).set_index('cik')



Path('data/sec/dates').mkdir(parents=True, exist_ok=True)          # Make the folder where we save the filing dates.  

filing_dates = get_all_filing_dates()
filing_dates.to_csv('data/sec/dates/filing_dates.csv')             # Save the file.

Now every month:
1. download the lastest folder (once available on SEC website)
1. merge the files
1. (delete the downloaded folder if you need space)
1. update the filing dates

For example do this in early May 2021 (check the SEC website to see if the file is available):

In [14]:
folder = '2021_04'

download_file(folder)                                           # Dowload the newest data from SEC.

merged = merge_sec_files(folder)                                # Generate the merged table.
merged.to_csv('data/sec/merged/'+folder+'.csv', index=False)    # Save the merged table. 

filing_dates = get_all_filing_dates()                           # Get all filing dates.
filing_dates.to_csv('data/sec/dates/filing_dates.csv')          # Save the filing dates.

File not found: 2021_04


FileNotFoundError: [Errno 2] File data/sec/downloads/2021_04/sub.tsv does not exist: 'data/sec/downloads/2021_04/sub.tsv'

That's it!           