# Web Scraping the SEC
The Securities and Exchange Commission is a regulatory agency that houses numerous financial documents related to public companies. These companies are required to file financial disclosures to the SEC so that investors can quickly evaluate their business performance. In this tutorial, we will explore how to web scrape the SEC using their public database.

If you would like more information about the SEC and their data sources, I encourage you to visit the SEC's website directly. Here is the link for your reference: **https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm**

Additionally, I am in the process of building a video series on YouTube that covers this topic. If you're interested in watching those videos, please follow this link **https://www.youtube.com/playlist?list=PLcFcktZ0wnNkOo9FQ2wrDcsV0jYqEYu1z**

In [1]:
# import our libraries
import requests
import urllib
from bs4 import BeautifulSoup

***
We will be using URLs a lot to request the info we need. To make sure this is quick, we will create a function that will allow us to pass through a list of parameters and output a new URL.

In [2]:
# let's first make a function that will make the process of building a url easy.
def make_url(base_url , comp):
    
    url = base_url
    
    # add each component to the base url
    for r in comp:
        url = '{}/{}'.format(url, r)
        
    return url

base_url = r"https://www.sec.gov/Archives/edgar/data"
components = ['886982','000156459019011378', '0001564590-19-011378-index-headers.html']
make_url(base_url, components)

'https://www.sec.gov/Archives/edgar/data/886982/000156459019011378/0001564590-19-011378-index-headers.html'

***
## Pulling the documents for a single filing for a single company
If we want to pull all the records for a single filing, the process is simple. We pass through the company's CIK number; this will define the company we want to search. Once we do this, we can request the filings for that company.

Remember, that when we request the filings, we will get all the filings for that company. If we need to, we can filter the filings to only a specific time range, but this will require that we filter the URLs that only contain those dates. Unfortunately, we will be able just to select a particular year.

If you look at the end of the file extension **(0001564590-19-011378)**, the number in the middle is the year of the submission.

In [3]:
# define a base url, this would be the EDGAR data Archives
base_url = r"https://www.sec.gov/Archives/edgar/data"

# define a company to search (GOLDMAN SACHS), this requires a CIK number that is defined by the SEC.
cik_num = '886982'

# let's get all the filings for Goldman Sachs in a json format.
# Alternative is .html & .xml
filings_url = make_url(base_url, [cik_num, 'index.json'])

# Get the filings and then decode it into a dictionary object.
content = requests.get(filings_url)
decoded_content = content.json()

# Get a single filing number, this way we can request all the documents that were submitted.
filing_number = decoded_content['directory']['item'][0]['name']

# define the filing url, again I want all the data back as JSON.
filing_url = make_url(base_url, [cik_num, filing_number, 'index.json'])

# Get the documents submitted for that filing.
content = requests.get(filing_url)
document_content = content.json()

# get a document name
for document in document_content['directory']['item']:
    if document['type'] != 'image2.gif':
        document_name = document['name']
        filing_url = make_url(base_url, [cik_num, filing_number, document_name])
        print(filing_url)

https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408-index-headers.html
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408-index.html
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408.txt
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/gs-424b2.htm


***
## Pulling all the documents for all the filings for a single company
If we want to pull all the records for all the filings, the process is very similar. All we are going to do is loop through all the filings instead of just grabbing one. Remember, that there can be many filings for a single company so you may get back more than you intend.

In [5]:
# define a base url, this would be the EDGAR data Archives
base_url = r"https://www.sec.gov/Archives/edgar/data"

# define a company to search (GOLDMAN SACHS), this requires a CIK number that is defined by the SEC.
cik_num = '886982'

# let's get all the filings for Goldman Sachs in a json format.
# Alternative is .html & .xml
filings_url = make_url(base_url, [cik_num, 'index.json'])

# Get the filings and then decode it into a dictionary object.
content = requests.get(filings_url)
decoded_content = content.json()

# Get a filing number, this way we can request all the documents that were submitted.
# HERE I AM JUST GRABBING THE FIRST THREE FILINGS REMOVE [0:3] to grab all of them.
for filing_number in decoded_content['directory']['item'][0:3]:    
    
    filing_num = filing_number['name']
    print('-'*100)
    print('Grabbing filing : {}'.format(filing_num))
    
    # define the filing url, again I want all the data back as JSON.
    filing_url = make_url(base_url, [cik_num, filing_num, 'index.json'])

    # Get the documents submitted for that filing.
    content = requests.get(filing_url)
    document_content = content.json()

    # get a document name
    for document in document_content['directory']['item']:
        document_name = document['name']
        filing_url = make_url(base_url, [cik_num, filing_num, document_name])
        print(filing_url)

----------------------------------------------------------------------------------------------------
Grabbing filing : 000156459019011408
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408-index-headers.html
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408-index.html
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/0001564590-19-011408.txt
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000001.jpg
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000002.jpg
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000003.jpg
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000004.jpg
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000005.jpg
https://www.sec.gov/Archives/edgar/data/886982/000156459019011408/g3jv1geevabj000006.jpg
https://www.sec.gov/Archives/edga

***
## Pulling the daily index filings
The Daily-Index endpoint will return all the filings for a given year. Once the year is selected, we define a quarter for all the filings, and then we can grab the associated files for that quarter. Each quarter has the following four indexes available:

 - **Company** — sorted by company name
 - **Form** — sorted by form type
 - **Master** — sorted by CIK number
 - **XBRL** — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions
 
The company, form, and master indexes contain the same information sorted differently. For more information, please visit the documentation provided by the SEC. https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm

In [19]:
# define the urls needed to make the request, let's start with all the daily filings
base_url = r"https://www.sec.gov/Archives/edgar/daily-index"

# The daily-index filings, require a year and content type (html, json, or xml).
year_url = make_url(base_url, ['2019', 'index.json'])

# request the content for 2019, remember that a JSON strucutre will be sent back so we need to decode it.
content = requests.get(year_url)
decoded_content = content.json()

# the structure is almost identical to other json requests we've made. Go to the item list.
# AGAIN ONLY GRABBING A SUBSET OF THE FULL DATASET 
for item in decoded_content['directory']['item'][0:1]:
    
    # get the name of the folder
    print('-'*100)
    print('Pulling url for Quarter: {}'.format(item['name']))
    
    # The daily-index filings, require a year, a quarter and a content type (html, json, or xml).
    qtr_url = make_url(base_url, ['2019', item['name'], 'index.json'])
    
    # print out the url.
    print("URL Link: " + qtr_url)
    
    # Request, the new url and again it will be a JSON structure.
    file_content = requests.get(qtr_url)
    decoded_content = file_content.json()
    
    print('-'*100)
    print('Pulling files')

    # for each file in the directory items list, print the file type and file href.
    # AGAIN DOING A SUBSET
    for file in decoded_content['directory']['item'][0:10]:
        
        file_url = make_url(base_url, ['2019', item['name'], file['name']])
        print("File URL Link: " + file_url)

----------------------------------------------------------------------------------------------------
Pulling url for Quarter: QTR1
URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/index.json
----------------------------------------------------------------------------------------------------
Pulling files
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190102.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190103.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190104.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190107.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190108.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190109.idx
File URL Link: https://www.sec.gov/Archives/edgar/daily-index/2019/QTR1/company.20190110.idx
File URL Link: https://ww

***
## Parsing the master IDX file
Out of all the files to parse, I find the Master.idx file the easiest because it is possible to separate each field by a delimiter. Where the other files do not offer such a delimiter making the process even header than it needs to be.

The first thing is to load the information to a text file so that way you don't have to make a second request and not burden the server.

In [9]:
# define a url, in this case I'll just take one of the urls up above.
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2019/QTR2/master.20190401.idx"

# request that new content, this will not be a JSON STRUCTURE!
content = requests.get(file_url).content

# we can always write the content to a file, so we don't need to request it again.
with open('master_20190102.txt', 'wb') as f:
     f.write(content)

In [10]:
# let's open it and we will now have a byte stream to play with.
with open('master_20190102.txt','rb') as f:
     byte_data = f.read()

# Now that we loaded the data, we have a byte stream that needs to be decoded and then split by double spaces.
data = byte_data.decode("utf-8").split('  ')

# We need to remove the headers, so look for the end of the header and grab it's index
for index, item in enumerate(data):
    if "ftp://ftp.sec.gov/edgar/" in item:
        start_ind = index

# define a new dataset with out the header info.
data_format = data[start_ind + 1:]

master_data = []

# now we need to break the data into sections, this way we can move to the final step of getting each row value.
for index, item in enumerate(data_format):
    
    # if it's the first index, it won't be even so treat it differently
    if index == 0:
        clean_item_data = item.replace('\n','|').split('|')
        clean_item_data = clean_item_data[8:]
    else:
        clean_item_data = item.replace('\n','|').split('|')
        
    for index, row in enumerate(clean_item_data):
        
        # when you find the text file.
        if '.txt' in row:

            # grab the values that belong to that row. It's 4 values before and one after.
            mini_list = clean_item_data[(index - 4): index + 1]
            
            if len(mini_list) != 0:
                mini_list[4] = "https://www.sec.gov/Archives/" + mini_list[4]
                master_data.append(mini_list)
                
# grab the first three items
master_data[:3]

[['1236397',
  'BRADBURY DANIEL',
  '4',
  '20190401',
  'https://www.sec.gov/Archives/edgar/data/1236397/0000886744-19-000047.txt'],
 ['1236458',
  'WILLIAMS PAUL S',
  '4',
  '20190401',
  'https://www.sec.gov/Archives/edgar/data/1236458/0001227654-19-000074.txt'],
 ['1237789',
  'BLAIR DONALD W',
  '4',
  '20190401',
  'https://www.sec.gov/Archives/edgar/data/1237789/0001127602-19-013788.txt']]

***
An extra step we can take is converting our master list of data into a list of dictionaries, where each dictonary represents a single filing document. This way we can easily iterate over the master list to grab the data we need.

In [11]:
# loop through each document in the master list.
for index, document in enumerate(master_data):
    
    # create a dictionary for each document in the master list
    document_dict = {}
    document_dict['cik_number'] = document[0]
    document_dict['company_name'] = document[1]
    document_dict['form_id'] = document[2]
    document_dict['date'] = document[3]
    document_dict['file_url'] = document[4]
    
    master_data[index] = document_dict

Let's grab all the 10-K filings from the dataset. If you would like more info on the different financial forms that we have access to, I encourage you to visit **https://www.sec.gov/forms** for more info.

In [18]:
# by being in a dictionary format, it'll be easier to get the items we need.
for document_dict in master_data[0:150]:

    # if it's a 10-K document pull the url and the name.
    if document_dict['form_id'] == '10-K':
        print(document_dict['company_name'])
        print(document_dict['file_url'])

GENERAL STEEL HOLDINGS INC
https://www.sec.gov/Archives/edgar/data/1239188/0001144204-19-017485.txt
COMMONWEALTH INCOME & GROWTH FUND V
https://www.sec.gov/Archives/edgar/data/1253347/0001654954-19-003881.txt
Joway Health Industries Group Inc
https://www.sec.gov/Archives/edgar/data/1263364/0001213900-19-005388.txt
TRANSAKT LTD.
https://www.sec.gov/Archives/edgar/data/1263872/0001062993-19-001480.txt
MONITRONICS INTERNATIONAL INC
https://www.sec.gov/Archives/edgar/data/1265107/0001265107-19-000004.txt
