## Preliminaries
Initial setup

In [5]:
import http_library
import csv # library to read/write/parse CSV files
from bs4 import BeautifulSoup # web-scraping library

acceptMime = 'text/html'
cikList = []
cikPath = 'cik.txt'


Open the file containing the list of CIK codes, read them in, and turn them into a list with whitespace stripped

In [6]:
cikFileObject = open(cikPath, newline='')
cikRows = cikFileObject.readlines()

for cik in cikRows:
    cikList.append(cik.strip())

In [7]:
print(cikList)

['0001085917', '0000105598', '0000034088']


## Searching for 10-K forms
Create a list of dictionaries for appropriate results

In [8]:
resultsList = []

The following query string selects for 10-K forms, but also retrieves forms whose code just starts with 10-K.  It is hacked from a URL that returned the search data in Atom format.

In [9]:
    baseUri = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK='+cik+'&type=10-K&dateb=&owner=exclude&start=0&count=40&output=atom'
    print(baseUri)

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000034088&type=10-K&dateb=&owner=exclude&start=0&count=40&output=atom


When a BeautifulSoup object is created, it cleans up dirty HTML and fills in a hierarchical structure similar to XML that can be searched.  In this case, the Atom output is already pretty clean XML.  So the keyword "html5lib" isn't really needed to interpret the input.  Rather, it just suppresses an annoying warning.

In [10]:
    soup = BeautifulSoup(http_library.httpGet(baseUri,acceptMime)[1],features="html5lib")

In [11]:
    print(soup)

<!--?xml version="1.0" encoding="ISO-8859-1" ?--><html><head></head><body><feed xmlns="http://www.w3.org/2005/Atom">
    <author>
      <email>webmaster@sec.gov</email>
      <name>Webmaster</name>
    </author>
    <company-info>
      <addresses>
        <address type="mailing">
          <city>IRVING</city>
          <state>TX</state>
          <street1>5959 LAS COLINAS BLVD</street1>
          <zip>75039-2298</zip>
        </address>
        <address type="business">
          <city>IRVING</city>
          <phone>9729406000</phone>
          <state>TX</state>
          <street1>5959 LAS COLINAS BLVD</street1>
          <zip>75039-2298</zip>
        </address>
      </addresses>
      <assigned-sic>2911</assigned-sic>
      <assigned-sic-desc>PETROLEUM REFINING</assigned-sic-desc>
      <assigned-sic-href>http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&amp;SIC=2911&amp;owner=exclude&amp;count=40</assigned-sic-href>
      <assitant-director>4</assitant-director>
      <cik>

The following search string in the select function limits results to only category elements with the attribute that's exactly equal to"10-K"(as opposed to ones that just start with "10-K").  The select function returns a list of soup objects that can each be searched using the same functions as the originally created soup object.  "cat" standes for the category tags.

In [None]:
    for cat in soup.select('category[term="10-K"]'):

We can't use the normal, simple way to travers the tree (cat.filing-href) because hyphen in tag is interpreted by Python as a minus.  Also, I couldn't get .strings to work, so I used first child element (the string content of the tag)

In [None]:
        date = cat.find('filing-date').contents[0]
        year = date[:4]
        print(year)

At this point, I take the individual result that was found and create a dictionary for it.  

In [None]:
        searchResults = {'cik':cik,'year':year,'uri':cat.find('filing-href').contents[0]}

I've added a filter here to only include results from particular years (can be changed or commented out)

In [None]:
        if year == "2016" or year == "2014":

Now I append the dictionary to the growing list of results

In [None]:
            resultsList.append(searchResults)

The loop is done, now show the results

In [None]:
print(resultsList)