<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#pyEDGAR" data-toc-modified-id="pyEDGAR-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>pyEDGAR</a></span></li><li><span><a href="#EDGAR" data-toc-modified-id="EDGAR-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>EDGAR</a></span></li><li><span><a href="#EDGAR-Indices" data-toc-modified-id="EDGAR-Indices-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>EDGAR Indices</a></span><ul class="toc-item"><li><span><a href="#Downloading-indices-(optional)" data-toc-modified-id="Downloading-indices-(optional)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Downloading indices (optional)</a></span></li><li><span><a href="#Indices-as-dataframes" data-toc-modified-id="Indices-as-dataframes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Indices as dataframes</a></span></li></ul></li><li><span><a href="#EDGAR-Filings" data-toc-modified-id="EDGAR-Filings-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>EDGAR Filings</a></span></li><li><span><a href="#Working-with-Filings" data-toc-modified-id="Working-with-Filings-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Working with Filings</a></span></li><li><span><a href="#Filings-as-Plaintext" data-toc-modified-id="Filings-as-Plaintext-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Filings as Plaintext</a></span></li><li><span><a href="#Filings-as-HTML" data-toc-modified-id="Filings-as-HTML-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Filings as HTML</a></span></li><li><span><a href="#Homework" data-toc-modified-id="Homework-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Homework</a></span></li></ul></div>

So now you know the basics of Python, how to read files and some simple regular expressions.

Now let's turn to EDGAR, and all the joys that lie therein.

We're going to start by looking at the index of all filings, so we can find which one we want to analyze.
Then, we're going to look at specific filings and get comfortable with their format, extracting documents, and pulling info out of HTML.

Once again, we approach it from a project/goal driven mindset.
With that in mind, here are some things we might want to determine from the document:

  1. Word or page count of the document.
  1. How many images do firms include in their proxy statements?
  1. What are the different sections of a proxy statement?
  1. How might we extract a specific section?
  1. How might we extract numbers, like CEO's compensation?
 
These are all simple questions with very hard answers.
Luckily, there are quite a few libraries or tricks we can employ to try and make this easier.

# pyEDGAR

Over the course of my PhD, I put together a library I called [pyEDGAR](https://github.com/gaulinmp/pyedgar), which tries to facilitate the most common tasks of interacting with SEC filings on the EDGAR website.

You don't have to use pyEDGAR by any means, in fact if you look [here](https://github.com/gaulinmp/pyedgar/blob/master/pyedgar/utilities/edgarweb.py#L130) you can see that it's pretty darn easy to download EDGAR documents with just the [requests](https://2.python-requests.org/en/master/) library.
But I'm going to use `pyedgar` because I'm lazy, and don't want to re-write all the parsing code I've already written and packaged up.

So, to get pyEDGAR, you can install it in a one-liner:

```bash
$ pip install git+https://github.com/gaulinmp/pyedgar#egg=pyedgar
```

Or if you want to pull any future updates:

```bash
$ cd ~/wherever_you_want_the_library_to_go/
$ ggit clone https://github.com/gaulinmp/pyedgar
$ cd pyedgar
$ pip install -e ./
```

Either should work out, for now the former is probably easier.

In [None]:
import pyedgar

pyEDGAR has a config file that it looks for, you probably don't have it.
The thing we really care about for now is where it will put files if you want to downolad them locally:

In [None]:
from pyedgar import config

In [None]:
# dir() looks at all the functions and variables that are attached to an object.
[x for x in dir(config) if x.isupper()]

The thing we're interested in is the INDEX_CACHE_ROOT for now, I don't expect you to download all the filings (but you can, with that CACHE_FEED):

In [None]:
config.INDEX_CACHE_ROOT

By default, that will point to a temp file, so if you want to download the indices below, you should make it:

In [None]:
import os
try:
    os.mkdir(config.INDEX_CACHE_ROOT)
except FileNotFoundError:
    # File not found means /tmp/pyedgar wasn't a folder. So make that first.
    os.mkdir(os.path.dirname(config.INDEX_CACHE_ROOT))
    os.mkdir(config.INDEX_CACHE_ROOT)
except FileExistsError:
    print("Folder already exists")

In [None]:
os.path.exists(config.INDEX_CACHE_ROOT)

# EDGAR

Documentation: [https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm](https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm)

EDGAR hosts the public filings of all public companies (and some private) that were submitted since 1995.
The filings are associated with one or more firms (identified by CIKs), and each filing has a unique identifier (Accession).

Note: One Accession can have multiple CIKs associated with it, so sometimes your UID could just be Accesion, but when matching to Compustat you need the CIK as well.

There are two primary things we care about from EDGAR.
The first is the filings, obviously.
The second is the index of all filings.
This is necessary because we need to know what filings exist so we can look them up.
Let's start with the index:

# EDGAR Indices

EDGAR indices reside at: [https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/](https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/).

They contain a list of all filings filed in a given quarter (so it's huge in later years!). 
We need this list to know what filings we might want to look at, for example all 10-Ks.

Here's what the top few lines of those files look like:

In [None]:
_ = """
Description:           Master Index of EDGAR Dissemination Feed
Last Data Received:    March 31, 2019
Comments:              webmaster@sec.gov
Anonymous FTP:         ftp://ftp.sec.gov/edgar/
Cloud HTTP:            https://www.sec.gov/Archives/

 
 
 
CIK|Company Name|Form Type|Date Filed|Filename
--------------------------------------------------------------------------------
1000045|NICHOLAS FINANCIAL INC|10-Q|2019-02-14|edgar/data/1000045/0001193125-19-039489.txt
1000045|NICHOLAS FINANCIAL INC|4|2019-01-15|edgar/data/1000045/0001357521-19-000001.txt
1000045|NICHOLAS FINANCIAL INC|4|2019-02-19|edgar/data/1000045/0001357521-19-000002.txt
"""

We can get these indexes using EDGARIndex, which just downloads all the quarters since 1995 and puts them into one big table.
It also makes separate tables for different form types for convenience.

In [None]:
from pyedgar import EDGARIndex
idx = EDGARIndex()

Now we can look at what indices pyEDGAR has found:

In [None]:
idx.indices

Well that makes sense, we haven't downloaded anything yet.
How do you download indices?

## Downloading indices (optional)

In [None]:
from pyedgar.utilities.indices import IndexMaker

In [None]:
idxm = IndexMaker()
idxm._get_index_cache_path('2014Q1')

In [None]:
from tqdm import tqdm_notebook
idxm._tq = tqdm_notebook
idxm._downloader._tq = tqdm_notebook

In [None]:
idxm.extract_indexes(start_year=2019)

## Indices as dataframes

Now that we've downloaded some indices, we can take a look at them:

In [None]:
idx.indices

Let's look at what's in the Def-14A, proxy statement filings:

In [None]:
d = idx['DEF14A']
d[d.name.str.contains("Google")].head(5)

What is this beautiful table?
It's a [pandas](https://pandas.pydata.org/) [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

We'll use a lot more dataframes next class, for now let's just use it to show us `cik` and `accession` and move on to the actual filings.

# EDGAR Filings

EDGAR filings are the full text of what a company filed with the SEC.
The filing is a weird hybrid of text, HTML, SGML, and attachments which could be anything, like Excel, PDF, Zip, PNG/JPG, etc.

What we care about is the general format. It goes like this:

```html
<SEC-DOCUMENT>
    <SEC-HEADER>
        <!-- format: HEADER NAME: HEADER VALUE -->
        FILER:
            COMPANY INFO:
                Street Address: etc....
    </SEC-HEADER>
    <DOCUMENT>
        <TYPE>DEFA14A
        <SEQUENCE>1
        <FILENAME>d935572ddefa14a.htm
        <DESCRIPTION>DEFA14A
        <TEXT>
            <!-- The actual filed document here -->
        </TEXT>
    </DOCUMENT>
    <DOCUMENT>
        <TYPE>IMAGE
        <SEQUENCE>2
        <FILENAME>logo.png
        <DESCRIPTION>Logo image file
        <TEXT>
            <!-- The actual image in 64 bit ascii encoding or something -->
        </TEXT>
    </DOCUMENT>
</SEC-DOCUMENT>
```

So what we care about is in those `<TEXT>` tags, or in that `<SEC-HEADER>` tag. 

This is where you could manually extract that information, or you could let someone else waste their time doing that for you, so you can just jump straight to the documents:

In [None]:
# Import the filing
from pyedgar import Filing

one_def14a = Filing(1288776, '0001193125-05-072803')
one_def14a

The `Filing` object when you first create it doesn't actually read in the filing.
That only happens when you actually access the filing's data:

In [None]:
print(one_def14a.full_text[:2000])

In [None]:
one_def14a

We can see that the filing now has `Text:True`, meaning the text is loaded into the filing.
But headers and documents haven't been loaded. 
What does that mean?

As we saw above, the structure of a filing has the header section, and then a bunch of documents sequentially listed in the file.
The `Filing` object knows about these, but doesn't waste CPU time parsing them until you explicitly ask for it:

In [None]:
one_def14a.headers

In [None]:
one_def14a

So we've loaded the headers.

There can be multiple filers in an accession, for example Google and Alphabet.
They each have an address (which is often the same), and filer information.
So how to read in this header?

The `Filing` object reads the headers in in two ways: flat and hierarchical.
  * **Flat**: All header entries are put in one dictionary, ignoring keys if they already exist.
  * **Hierarchical**: Header entries are entered into the dictionary like Flat, but when indentation is found, those indented entries are put in a sub-dictionary.
  
So for a hierarchical header example:

```
EFFECTIVENESS DATE:		20150603

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:		Google Inc.
		CENTRAL INDEX KEY:			0001288776

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:		Alphabet Inc.
		CENTRAL INDEX KEY:			0001652044
```

would result in a dictionary looking like:

```python
{'effectiveness-date':20150603,
 'filer': {
     'company-data': {
         'company-conformed-name': 'Google Inc.',
         'central-index-key': '0001288776',
     }
 'filer_0': {
     'company-data': {
         'company-conformed-name': 'Alphabet Inc.',
         'central-index-key': '0001652044',
     }
 }
```

So our headers are loaded as a dictionary, meaning we can easily extract information from them:

In [None]:
one_def14a.headers['conformed-submission-type']

The last part of the filing is the documents.
This is the main Def-14A text.

We can access it like this:

In [None]:
# one_def14a.documents

But let's not actually do that, it's a big file.
Instead, we'll use python's pretty-print to format it, and just display the first bit:

In [None]:
import pprint
print(pprint.pformat(one_def14a.documents, width=110)[:1000])

So the documents is a list of documents, each of which is a dictionary, containing the `<DESCRIPTION>`, `<FILENAME>`, and `<TEXT>` (in `full_text`).
So to get the text of our document:

In [None]:
print(one_def14a.documents[0]['full_text'][:1000])

And there we have it, our filing.
Well, the first document of the filing.
`Filing`s are always guaranteed to have at least one document (except in error cases), and it's usually the main document (8-K or 10-K, for example).

# Working with Filings

HTML Documentation: [https://www.w3schools.com/html/html_intro.asp](https://www.w3schools.com/html/html_intro.asp)

Filings are typically submitted in either HTML form (seen above), or plain text form.
The latter is pretty simple to read, but lacks a lot of the contextual clues that we might use to extract data, like bold headings are probably the start of sections.

We'll mostly deal with HTML filings here, because that's what most companies file now (text was popular at EDGAR's beginnings, but not so much any more).
But our solution to HTML is sometimes just to convert to plain text, so you should be comfortable with both filing types.

So let's load up an HTML file and start playing with it:

In [None]:
# Import the filing
from pyedgar import Filing

filing = Filing(1652044, '0001308179-17-000170')

In [None]:
html = filing.documents[0]['full_text']
len(html)

That's pretty long for a proxy document, about 1MB.
What does it look like?

`Filing`s know that the filings come from EDGAR, so they have urls associated with them:

In [None]:
from IPython.display import display_html, HTML
# Display it as a link
HTML(f"<a href='{filing.urls[-1]}' target=_blank>{filing.urls[-1]}</a>")

And in notebooks:

In [None]:
print(html[:200])

It might be nice to see the actual formatted version:

In [None]:
HTML(html)

As we can see, that document has lots of formatting and stuff, most of which we don't really care about.

Parsing HTML documents is its own tutorial, and takes another life time to learn (this is at least the second one needed, after learning Regexes).
We'll only touch on the basics here, but as always there's a lot of practice to get comfortable with it.

To parse HTML, we could use regular expressions, but that's a bit confounded by the fact that we really want to search the displayed text, not all those html tags like `<div>` or `<div style="font-family:times">`.
So we largely have two options:

  1. Convert the HTML to plain-text, and then search/parse that plain text like we've done before.
  2. Use a library to parse the HTML for us, and use the contextual information we get from the HTML syntax to help us extract data more reliably.
  3. Middle-ground: use something like Markdown to convert HTML into plain-text with some context preserved, like headers, bold, italic, etc.
  
The first way is easier, but sometimes less robust.
The second way is harder, but sometimes the only way we can get a reliable extract.
The last option, the Markdown approach, is what I used to parse Risk Factor sections, and seemed to work pretty well.

There's no one right answer, so your approach should be customized to exactly the data you want to get.

# Filings as Plaintext

The first way to deal with HTML is just to strip all HTML from it, and get the plain-text.
There's two ways to do that. 

First, if we're on linux, and have [w3m](http://w3m.sourceforge.net/) installed, we can use pyedgar:

In [None]:
from pyedgar.utilities import htmlparse

print(htmlparse.convert_html_to_text(html)[:1000])

As you probably saw, if you don't have w3m installed, that doesn't work. 
I only mention it because it's the fastest way to convert html to text that I've found (tested a few different methods), so if you're doing a big project, consider finding a linux server and using it.

A second way to convert is using [html2text](http://alir3z4.github.io/html2text/), which is actually solution 3 from above (the middle ground).
First, we have to install it:

In [None]:
!pip install html2text

In [None]:
import html2text

h = html2text.HTML2Text()
h.ignore_links = False

In [None]:
text = h.handle(html)

In [None]:
print(text[:1000])

Well that's not that pretty either.
Now you might see why I use w3m (or you can't see, but trust me it's beautifully formatted :).

As an explanation of what's happening, the html2text software takes in HTML, and converts it to [Markdown](https://daringfireball.net/projects/markdown/syntax), which is a text file that converts into relatively simple HTML.
It uses things like \*italic\* for *italic*, or # Heading 1 for a `<h1>` tag.
In fact it's what these comments in the notebooks use!

This is convenient for things like defining a heading, because if we're looking for `Item 1A: Risk Factor`, we can search for `**Item 1A: Risk Factor**`.

As with all scraping work, the best way to figure this out is to try, fail, try again, repeat ad-nauseam until you get your 95% accuracy target.

So now we could look for things like **Compensation Discussion and Analysis**:

In [None]:
'Compensation Discussion and Analysis' in text

Wahoo! 
Obviously a regex solution would be better, but you learned how to do that in the first lecture, so we don't have to repeat ourselves here :)

# Filings as HTML

We're looking at this filing: [https://www.sec.gov/Archives/edgar/data/1652044/000130817917000170/lgoog2017_def14a.htm](https://www.sec.gov/Archives/edgar/data/1652044/000130817917000170/lgoog2017_def14a.htm)

To look at a document as HTML, we need (once again) a library.
The standard one is called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

Let's install it:

In [None]:
!conda install BeautifulSoup4 -y

In [None]:
# Don't ask why it's bs4, just gotta memorize it
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

In [None]:
print(soup.prettify()[:1000])

Time for a crash course in BeautifulSoup (more complete explanation above, in that link).

BeautifulSoup (BS from here on out) takes in some HTML, and converts it into a big Python object.
That object lets you search for specific tags, extract text, as well as a bunch of functionality we probably don't care about.

Let's play around with it:

In [None]:
soup.fi # <- put cursor after the i and hit tab

In [None]:
soup.find('p')

`soup.find` looks for HTML tags, like the `<p>` tag, which is a paragraph tag.
So if we wanted to count all the paragraphs:

In [None]:
len(soup.find_all('p'))

Let's search for **Compensation Discussion and Analysis** again:

In [None]:
soup.findAll(text='Compensation Discussion and Analysis')

Well that's neat! What's around those things?

In [None]:
for x in soup.findAll(text='Compensation Discussion and Analysis'):
    print(x)
    break

# Because of Python magic, we now have an x to work with
x

So we ran a loop, and broke out of it. 
Python leaves all those variables intact, meaning we can now play with x:

In [None]:
x.parent # <-- hit tab

In [None]:
x.parent

What if we wanted to get all parents up until body?

In [None]:
newx = x
while newx.parent.name != "body":
    newx = newx.parent

newx.name

Ooh, a table? 
Let's look at it!

In [None]:
HTML(newx.prettify())

That looks like at table of contents.
Let's look at the second instance we found:

In [None]:
for x in soup.findAll(text='Compensation Discussion and Analysis'):
    newx = x
    while newx.parent.name != "body":
        newx = newx.parent
    
    print(x, ':', newx.name)

In [None]:
HTML(newx.prettify())

Ooh, that looks like a header.
I bet it's the start of the CD&A section.
Let's look at what comes next:

In [None]:
newx.attrs

In [None]:
newx.find_next(attrs=newx.attrs)

Aww bummer. 
We want to know why that didn't work, right?

Let's brute-force it, by going over to the full filing and searching for that style.

  1. Open [this](https://www.sec.gov/Archives/edgar/data/1652044/000130817917000170/lgoog2017_def14a.htm), right click, view source.
  1. Search for "font: 18pt Arial, Helvetica, Sans-Serif; margin: 25pt 0 20pt"
  1. Notice there's only one match. Sad.
  1. Think to yourself "Right above it is font 22pt, right below is font 14pt, this should work"
  1. Notice that the margin: changes in those other ones.
  1. Think to yourself "What if we just try the font part, and omit margins?

In [None]:
newx.find_next(attrs={'style': 'font: 18pt Arial, Helvetica, Sans-Serif;'})

Still no. 
Maybe the problem is that we're searching for an exact match, and we want a partial match.

REGEX to the rescue!

In [None]:
import re

newx.find_next(attrs={'style': re.compile('font: 18pt Arial, Helvetica, Sans-Serif;')})

Aaaaand [boom goes the dynamite](https://youtu.be/W45DRy7M1no?t=144).

Okay, so we have a CD&A tag beginning, and we have the next tag that follows it.
Let's grab everything inbetween:

In [None]:
begin_tag = newx
end_tag = newx.find_next(attrs={'style': re.compile('font: 18pt Arial, Helvetica, Sans-Serif;')})

In [None]:
gather_the_html = []
for tag in begin_tag.findNextSiblings():
    gather_the_html.append(tag.prettify())
    if tag == end_tag or tag.find(attrs={'style': re.compile('font: 18pt Arial, Helvetica, Sans-Serif;')}):
        break

I often don't know if something works until I run it. 
I wrote the above, then ran it and hoped.
Let's see if it worked:

In [None]:
len(gather_the_html)

In [None]:
gather_the_html[0], gather_the_html[-1]

Okay... maybe? 
Let's display:

In [None]:
HTML('\n'.join(gather_the_html).replace('$', '\$'))

We got lucky here, because there's no nested hierarchy (thanks Google!).
That's not always the case, often times companies will wrap each page in a `<div>` tag, so we would have to use `begin_tag.findAllNext()` instead, and then somehow rule out duplicate tags (e.g. because of finding children).

As you might have seen, scraping HTML is a bear of a task, and I do it very iteratively.
But it lets you do things like we just did, saying find a heading, then find the next heading at the same level.
This wouldn't have been possible in plain text.

# Homework

Using the filing from above ([this one](https://www.sec.gov/Archives/edgar/data/1652044/000130817917000170/lgoog2017_def14a.htm)), read in the HTML and answer the following questions:



  1. How many words are in the filing?
  1. How many pages are in the filing?
  1. How many images are included in the filing?
  1. What are the different sections of the proxy statement? (hint: see the table of contents we found above).
  1. What are the top 5 people (by salary) paid at Google?
  1. *WITHOUT CODING*: Describe how you would go about extracting this information programatically.
     1. What format is this information in?
     1. Do you think it's repeatable for not-Google?
     1. Would you use HTML or plain text to get this information?
     1. Extra credit: Write out some [pseudo-code](https://www.vikingcodeschool.com/software-engineering-basics/what-is-pseudo-coding) for your approach, or just the steps in plain english.