## Preliminaries
Initial setup

In [None]:
import requests   # best library to manage HTTP transactions
import csv # library to read/write/parse CSV files
from bs4 import BeautifulSoup # web-scraping library

acceptMime = 'text/html'
cikList = []
cikPath = 'cik.txt'


Open the file containing the list of CIK codes, read them in, and turn them into a list with whitespace stripped

In [None]:
cikFileObject = open(cikPath, newline='')
cikRows = cikFileObject.readlines()

for cik in cikRows:
    cikList.append(cik.strip())
print(cikList)

## Searching for 10-K forms
Create a list of dictionaries for appropriate results

In [None]:
resultsList = []

Create the search URL using one hacked from playing around online

In [None]:
cik = cikList[2] # in the final script, this will loop through all of the CIK codes. (elements 0 and 1 don't produce any results)
# this query string selects for 10-K forms, but also retrieves forms whose code start with 10-K
baseUri = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK='+cik+'&type=10-K&dateb=&owner=exclude&start=0&count=40&output=atom'
print(baseUri)


Retrieve the XML document and turn it into a Beautiful Soup object (well-structured with magical properties)

In [None]:
r = requests.get(baseUri, headers={'Accept' : 'application/xml'})
soup = BeautifulSoup(r.text,features="html5lib")
print(soup)

The search string (term="10-k") limits results to only category elements with the attribute that's exactly equal to"10-K"

The select function returns a list of soup objects that can each be searched

In [None]:
for cat in soup.select('category[term="10-K"]'):
    # can't use cat.filing-href because hyphen in tag is interpreted by Python as a minus
    # also, couldn't get .strings to work, so used first child element (the string content of the tag)
    date = cat.find('filing-date').contents[0]
    year = date[:4] # the year is the first four characters of the date string
    print(year)
    # create a dictionary of an individual result
    searchResults = {'cik':cik,'year':year,'uri':cat.find('filing-href').contents[0]}
    if year == "2016" or year == "2014":
        # append the dictionary to the list of results
        resultsList.append(searchResults)

The loop is done, now show the results

In [None]:
print(resultsList)

## Searching for the components of an individual 10-K filing

Start by showing the URL to be retrieved

In [None]:
form10kList = [] # create an empty list to put the results in
hitNumber = 0  # in the final script, loop through the resultsList.  Here, just do the first result.
# for hitNumber in range(0,len(resultsList)):
print(resultsList[hitNumber]['uri'])

Retrieve the HTML and turn it into a cleaned-up soupt object

In [None]:
r = requests.get(resultsList[hitNumber]['uri'], headers={'Accept' : 'text/html'})
soup = BeautifulSoup(r.text,features="html5lib")
print(soup)

Select the tr elements and generate an array of soup objects for each of the tr elements

In [None]:
trArray = soup.select('tr')
print(trArray)

Loop through each of the tr elements and check whether it has a td element that contains "10-K".  If so, then add the value of the href attribute to the results array.  Note: the values are relative, so must prepend 'http://www.sec.gov' to make it an absolute URL.

In [None]:
for row in trArray:
    is10k = False
    for cell in row.select('td'):
        try:
            testString = cell.contents[0]
            if cell.contents[0] == "10-K":
                is10k = True
        except:  # handle error caes where the cell doesn't have contents
            pass
    if is10k:
        form10kList.append('http://www.sec.gov' + row.a.get('href'))


Print the resulting list

In [None]:
print(form10kList)

## Retrieve the actual 10-K page and pull out the signatory names
This will eventually be a loop, but for now, just do the first result

In [None]:
form10kNumber = 0
# for form10kNumber in range(0,len(form10kList)):
print(form10kList[form10kNumber])

Retrieve the HTML for the web page and turn it into a Beautiful Soup object