Here we are going to extract information from daily index:
https://www.sec.gov/Archives/edgar/daily-index

In [2]:
import requests
import urllib
from bs4 import BeautifulSoup

In [3]:
# let's first make a function that will make the process of building a url easy.
def make_url(base_url , comp):
    
    url = base_url
    
    # add each component to the base url
    for r in comp:
        url = '{}/{}'.format(url, r)
        
    return url

# EXAMPLE
base_url = r"https://www.sec.gov/Archives/edgar/data"
components = ['886982','000156459019011378', '0001564590-19-011378-index-headers.html']
make_url(base_url, components)

'https://www.sec.gov/Archives/edgar/data/886982/000156459019011378/0001564590-19-011378-index-headers.html'

Get all documents of a year

Here we are going to get the links of all documents of all companies in a specific year
divided by quarter

In [4]:
# base url for the daily index files.
base_url = r"https://www.sec.gov/Archives/edgar/daily-index"

# create the URL of daily index for 2020
year_url = make_url(base_url, ['2020', 'index.json'])

# request year_url
content = requests.get(year_url)

# convert data from json to dict
decoded_content = content.json()
decoded_content

{'directory': {'item': [{'last-modified': '03/31/2020 10:08:05 PM',
    'name': 'QTR1',
    'type': 'dir',
    'href': 'QTR1/',
    'size': '20 KB'},
   {'last-modified': '06/26/2020 10:07:19 PM',
    'name': 'QTR2',
    'type': 'dir',
    'href': 'QTR2/',
    'size': '20 KB'},
   {'last-modified': '07/01/2020 12:20:09 AM',
    'name': 'QTR3',
    'type': 'dir',
    'href': 'QTR3/',
    'size': '4 KB'}],
  'name': 'daily-index/2020/',
  'parent-dir': '../'}}

In [None]:
# loop trhough the dict
for item in decoded_content['directory']['item']:
    
    # get the name of the folder
    print("-" * 100)
    print("pulling url for quarter {}".format(item["name"]))
    
    # create the qtr url
    qtr_url = make_url(base_url, ['2020',item["name"], 'index.json'])
    print(qtr_url)
    
    # request qtr_url
    file_content = requests.get(qtr_url)

    # convert data from json to dict
    decoded_content = file_content.json()
    
    print("-" * 100)
    print("Pulling files")
    
    for file in decoded_content['directory']['item']:
        file_url = make_url(base_url, ['2020', item["name"], file['name']])
        print(file_url)


Parsing the master IDX file

Out of all the files to parse, I find the Master.idx file the easiest because it is possible to separate each field by a delimiter. Whereas the other files do not offer such a delimiter or they lack the additional detail that is provided by master file. With that being said, if I had to choose a second file to parse, it would probably be the sitemap file because of the provided structure in it.

The first thing is to load the information to a text file so that way you don't have to make a second request and not burden the server. After we create a new text file with the content, we can reload it into by opening the text file. From here, I usually encourage people to explore the data before they perform any parsing. We will notice right away that getting the info may be a little challenging, but it can be done.

The approach that I laid out below worked for most files I encountered, but I cannot guarantee it will work for all of them. As time goes on you, have more detailed data so parsing the dataset will become more comfortable.

In [None]:
# define a url, in this case I'll just take one of the urls up above.
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR2/master.20200401.idx"

# request that new content, this will not be a JSON STRUCTURE!
# this will download the file with all data of that day 2020-04-01
# and store the content into content var
content = requests.get(file_url).content

# we can always write the content to a file, so we don't need to request it again.
with open('master_20200401.txt', 'wb') as f:
     f.write(content)

In [22]:
# let's open it and we will now have a byte stream to play with.
with open('master_20200401.txt','rb') as f:
     byte_data = f.read()

# Now that we loaded the data, we have a byte stream that needs to be decoded and then split by double spaces.
data = byte_data.decode("utf-8").split('  ')

# removing the headers, so look for the end of the header and grab it's index
for index, item in enumerate(data):
    if "ftp://ftp.sec.gov/edgar/" in item:
        start_ind = index

# define a new dataset without the headers.
data_format = data[start_ind + 1:]

master_data = []

# loop through the data list
for index, item in enumerate(data_format):
    # if it's the first index, it won't be even so treat it differently
    if index == 0:
        clean_item_data = item.replace('\n','|').split('|')
        clean_item_data = clean_item_data[8:]
    else:
        clean_item_data = item.replace('\n','|').split('|')
    
    for index, row in enumerate(clean_item_data):

        if ".txt" in row:
            mini_list = clean_item_data[(index - 4): index + 1]
            
            if len(mini_list) != 0:
                mini_list[4] = "https://www.sec.gov/Archives/" + mini_list[4]
                master_data.append(mini_list)
master_data[:3]

[['1189229',
  'PAISLEY CHRISTOPHER B',
  '4',
  '20200401',
  'https://www.sec.gov/Archives/edgar/data/1189229/0001209191-20-022250.txt'],
 ['1189425',
  'POTTER ROBERT L',
  '4',
  '20200401',
  'https://www.sec.gov/Archives/edgar/data/1189425/0001121484-20-000050.txt'],
 ['1189703',
  'TELLEZ CORA M',
  '4',
  '20200401',
  'https://www.sec.gov/Archives/edgar/data/1189703/0001225208-20-005755.txt']]

Creating our Document Dictionary

An extra step we can take is converting our master list of data into a list of dictionaries, where each dictionary represents a single filing document. This way we can quickly iterate over the master list to grab the data we need. This structure will help us down the road when we need to access only some aspects of information. I encourage individuals to put the time up in front to ensure a quick and easy process for the bulk of the parsing.

In [24]:
# loop through each document in the master list.
for index, document in enumerate(master_data):
    
    # create a dictionary for each document in the master list
    document_dict = {}
    document_dict['cik_number'] = document[0]
    document_dict['company_name'] = document[1]
    document_dict['form_id'] = document[2]
    document_dict['date'] = document[3]
    document_dict['file_url'] = document[4]
    
    master_data[index] = document_dict

In [26]:
master_data[:3]

[{'cik_number': '1189229',
  'company_name': 'PAISLEY CHRISTOPHER B',
  'form_id': '4',
  'date': '20200401',
  'file_url': 'https://www.sec.gov/Archives/edgar/data/1189229/0001209191-20-022250.txt'},
 {'cik_number': '1189425',
  'company_name': 'POTTER ROBERT L',
  'form_id': '4',
  'date': '20200401',
  'file_url': 'https://www.sec.gov/Archives/edgar/data/1189425/0001121484-20-000050.txt'},
 {'cik_number': '1189703',
  'company_name': 'TELLEZ CORA M',
  'form_id': '4',
  'date': '20200401',
  'file_url': 'https://www.sec.gov/Archives/edgar/data/1189703/0001225208-20-005755.txt'}]

Filtering by File Type

Naturally, we might not need all the files that we have scraped, so let's explore how to filter the data. If we want to grab all the 10-K filings from the dataset, we loop through our master_data list and only print the ones where the form_id has a value of 10-K. In the example below, I only loop through the first 100 dictionaries for readability.

Finally, I do some extra transformation on the final document URL to set the stage for the next tutorial. The beautiful thing about these document URLs is that with a few other transformations we can now get to that particular company filing archive.

If you would like more info on the different financial forms that we have access to, I encourage you to visit https://www.sec.gov/forms for more info.

In [33]:
# by being in a dictionary format, it'll be easier to get the items we need.
for document_dict in master_data:
#     print(document_dict['form_id'] )

    # if it's a 10-K document pull the url and the name.
    if document_dict['form_id'] == '10-K':
        
        # get the components
        comp_name = document_dict['company_name']
        docu_url = document_dict['file_url']
        
        print('-'*100)
        print(comp_name)
        print(docu_url)

----------------------------------------------------------------------------------------------------
CREATIVE LEARNING Corp
https://www.sec.gov/Archives/edgar/data/1394638/0001731122-20-000338.txt
----------------------------------------------------------------------------------------------------
Predictive Oncology Inc.
https://www.sec.gov/Archives/edgar/data/1446159/0001171843-20-002217.txt
----------------------------------------------------------------------------------------------------
MARIMED INC.
https://www.sec.gov/Archives/edgar/data/1522767/0001493152-20-005546.txt
----------------------------------------------------------------------------------------------------
TILLY'S, INC.
https://www.sec.gov/Archives/edgar/data/1524025/0001628280-20-004436.txt
----------------------------------------------------------------------------------------------------
Vislink Technologies, Inc.
https://www.sec.gov/Archives/edgar/data/1565228/0001493152-20-005537.txt
----------------------------