# Download and merge all SEC files

We download SEC company filing data from this website: https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html 


You need to generate the following directory (folder) structure:
- inside current directory (where you run this notebook): folder "data"
- inside data folder: folder "sec"
- inside sec folder: folder "downloads" and folder "merged"

Now import these libraries

In [1]:
import pandas as pd
import requests, zipfile, io
import os
from pathlib import Path

Our download function:

In [2]:
def download_file(period):
    url = 'https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/'+period+'_notes.zip'
    
    unzip_folder_name = 'data/sec/downloads/' + period                           # Where to put contents of unzipped file  
    
    r = requests.get(url)
    if r.ok:                                                                     # If download worked
        print('Downloaded:', url, 'to:', unzip_folder_name)
        Path(unzip_folder_name).mkdir(parents=True, exist_ok=True)               # Make the folder where we unzip the file to   
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(members=['sub.tsv','num.tsv'], path=unzip_folder_name)      # Unzip file to the folder we just made
    else:
        print('File not found:', period)

If you previously had trouble downloading the files, and you are worried about running out of space:
1. delete all files in your "downloads" folder
1. empty your trash to make sure that memory is free
1. check available memory, you need about 40 GB

Now download **all files** from 2010 to 2020:

In [4]:
for year in range(2010,2021):                # Get the quarterly files
    for quarter in [1,2,3,4]:
        period = str(year)+'q'+str(quarter)
        download_file(period)
        
for year in range(2020,2021):                # Get the monthly files
    for month in [10,11,12]:
        period = str(year)+'_'+str(month)
        download_file(period)

Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q1_notes.zip to: data/sec/downloads/2010q1
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q2_notes.zip to: data/sec/downloads/2010q2
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q3_notes.zip to: data/sec/downloads/2010q3
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2010q4_notes.zip to: data/sec/downloads/2010q4
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2011q1_notes.zip to: data/sec/downloads/2011q1
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2011q2_notes.zip to: data/sec/downloads/2011q2
Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2011q3_notes.zip to: data/sec/downloads/2011q3
Downloaded: https://www.sec.gov/files/dera/data/financi

Now download the **most rescent file** (you can do this every month, for example in March, dowload '2021_02'):

In [6]:
download_file('2021_01')

Downloaded: https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2021_01_notes.zip to: data/sec/downloads/2021_01


Now use this function to merge all the files:

In [7]:
def merge_sec_files(folder):

    keep_these_columns = ['cik','sic','countryinc','tag','filed','ddate','qtrs','value']

    filings = pd.read_table('data/sec/downloads/'+folder+'/sub.tsv')
    numbers = pd.read_table('data/sec/downloads/'+folder+'/num.tsv', encoding='ISO-8859-1', error_bad_lines=False) 

    filings = filings[filings.form.isin(['10-Q','10-K']) & filings.cik.notnull()]
    numbers = numbers[(numbers.dimh=='0x00000000')]                                     # keep only non-segment data

    merged = numbers.merge(filings, on='adsh', how='inner')[keep_these_columns]

    merged['filed'] = pd.to_datetime(merged.filed, format='%Y%m%d', errors='coerce')    #  ‘coerce’: invalid parsing set as NaT.
    merged['ddate'] = pd.to_datetime(merged.ddate, format='%Y%m%d', errors='coerce')    

    merged = merged[merged.filed.notnull() & merged.ddate.notnull()].drop_duplicates()

    merged.to_csv('data/sec/merged/'+folder+'.csv', index=False)
    
    return merged   

Merge **all files** in your downloads folder (ignore the warnings):     
(this will take a while)

In [8]:
for folder in os.listdir('data/sec/downloads/'):  # Loop over all folders in "downloads" directory
    print(folder)
    merge_sec_files(folder)

2020_11


  if (await self.run_code(code, result,  async_=asy)):


2020_10
2014q3
2014q4
2018q1
2010q2
2012q1


  if (await self.run_code(code, result,  async_=asy)):


2016q1
2014q2
2010q3
2010q4
2020q1
2009q3


  if (await self.run_code(code, result,  async_=asy)):


2009q4
2009q2
2013q2
2011q1
2019q2
2017q3
2017q4
2019q3
2019q4
2013q3
2013q4
2015q1
2017q2
2020_12
2018q2
2010q1
2012q2
2016q4
2016q3
2012q4
2012q3
2018q4
2018q3
2016q2
2014q1
2021_01
2020q2


  if (await self.run_code(code, result,  async_=asy)):


2020q3
2009q1
2020q4
2015q4
2015q3
2013q1
2011q2
2019q1
2015q2
2017q1
2011q4
2011q3


Or if you only want to merge **one** particular file:

In [10]:
merge_sec_files('2021_01')

Unnamed: 0,cik,sic,countryinc,tag,filed,ddate,qtrs,value
0,1517389,7371.0,US,AccountsPayableAndAccruedLiabilitiesCurrent,2021-01-06,2020-11-30,0,7094.0
1,1517389,7371.0,US,AccountsPayableAndAccruedLiabilitiesCurrent,2021-01-06,2020-02-29,0,8698.0
2,1517389,7371.0,US,AccountsReceivableNetCurrent,2021-01-06,2020-11-30,0,9384.0
3,1517389,7371.0,US,AccountsReceivableNetCurrent,2021-01-06,2020-02-29,0,9402.0
4,1517389,7371.0,US,AdditionalPaidInCapital,2021-01-06,2020-11-30,0,2449733.0
...,...,...,...,...,...,...,...,...
118256,1593812,6221.0,US,RedemptionsCostBasis,2021-01-29,2019-10-31,4,-6713.0
118257,1593812,6221.0,US,RedemptionsCostBasis,2021-01-29,2020-10-31,4,-203243901.0
118258,1593812,6221.0,US,WeightedAverageNumberOfGoldReceipts,2021-01-29,2018-10-31,4,194219.0
118259,1593812,6221.0,US,WeightedAverageNumberOfGoldReceipts,2021-01-29,2019-10-31,4,153340.0


Now you have all the merged files in the "merged" folder. To free up memory, you can now:
1. delete all files in "downloads" folder
1. empty trash

You need to keep  all the merged files (but they only take about 8 GB).

Now every month:
1. download the lastest folder (once available on SEC website)
1. merge the files
1. (delete the downloaded folder if you need space)

For example do this in early March 2021:

In [None]:
download_file('2021_02')

merge_sec_files('2021_02')

# delete folder 'data/sec/downloads/2021_02'

That's it!           

Anyone who has problems with this: email me and tell me what is not working. We will need these data.