# Parsing SEC Documents - New Filings

## Introduction:
In this notebook, we will go over how to parse SEC filing documents so you can extract different content from it and also logically organize the information. Making additional parsing more natural.

At this point, it's safe to say that there is a remarkable amount of data available to individuals who seek high-quality financial data across a multitude of companies. This data can be used in a range of activities, from competition analysis to sentiment analysis. All it requires the individual to do is define a company they want, filing that belongs to that company, and the file that contains the information.

Right now, we've defined a few ways to search for information depending on the type of search you want to conduct:

1. **Broad Search:**
You don't care what is returned, and you want all the filings for a given period. This search is best acheived by parsing the different index directories. These directories list all the filings made for a given period. While probably the most simple to wrap your head around, this returned the most significant amount of data. Making data management a priority.
2. **Company Specific Search:**
You are looking for a specific company and all their filings. There are currently two ways of doing this, using the CIK method or EDGAR Search Method. If you used the CIK method, you followed a technique that was similar to the Broad search method. Instead of looking at a specific period, you provide the CIK number, which will take you to a directory that contains all the filings that the company has made. Inside each filing folder, you would find all the files for that particular filing. If you didn't know the CIK number, you looked it up and did the CIK method. If you didn't know the CIK number you leveraged the EDGAR Search Method. Using this strategy meant requesting a URL that would direct you to a table containing all the filings for that company. However, this method required us to make a more complex script that would parse XML strings, paginating if the company had multiple pages, and reconstructing links to other directories.
3. **Filing Specific Search:**
Here you want only a specific filing for a particular company. Again this can be achieved either using the CIK Search Method or the EDGAR Search Method. It solely depends on whether you have the information upfront or not. The only additional piece of information you needed was the accession number.
4. **Criteria Specific Search:**
Here you were looking for a filing that has specific criteria, like the form being a 10K, a particular SIC number, or a particular date. This could only be achieved using the EDGAR Search Method, where you built a URL that contains the criteria you were searching on. In this case, your search was relatively simple because you only provided one rule. This search returns either an HTML or XML version of the data, which you would then have to parse.
5. **Multi-Criteria Search:**
This is similar to a form specific search, but instead of only having one criterion, you were searching on multiple criteria. The only difference here is you had to build and request a URL that contained multiple parameters. Once you got the data, it was parsed the same.
6. **Text Search:**
This type of search was done if you wanted to search on information found in the HEADER file for each filing. This allowed us to do the most complex searches as we could use boolean operators, wildcards, stemming, the order of evaluation, and phrase searching. While a powerful tool that allowed your searches to become very customized, it did require a decent working knowledge of both the HEADER file and search functionality. This was challenging as documentation is limited, requiring a fair amount of experimentation.

Regardless of your particular search method, each can give you access to filings that match the criteria you specified. Once you've landed on the filing, you want to scrape. You can begin the process of extracting the information.

## Libraries
Surprisingly, we don't need that many libraries to scrape the SEC filing. In this particular tutorial, we will use the following libraries:

1. BeautifulSoup - This will be used to parse the actual text file content.
2. Requests - This will be used to request the text file from the URL provided.
3. Unicodedata - The text that is sent back is messy and will need to be cleaned up. We can use unicodedata to do that.
4. Re - This is the RegEx library for python and will make looking for keywords smooth and quick.
Depending on your specific needs, you may need extra libraries but at this point, for simplicity, we will keep it to these four.

In [35]:
import re
import requests
import unicodedata
from bs4 import BeautifulSoup

## Define Text Normalization Function
The text is a mess, so we will need to normalize it. However, we can't rely on just the unicodedata library to help us. Some windows_1252_characters will provide some challenges. I found a solution on Stack Overflow that'll help us normalize the remaining portions of the data. All this function does is take a string, finds any qualifying matches, decodes those matches, and then replaces them in the string. In essence, it's just cleaning up the characters that weren't decoded by the unicodedata library.

In [36]:
def restore_windows_1252_characters(restore_string):
    """
        Replace C1 control characters in the Unicode string s by the
        characters at the corresponding code points in Windows-1252,
        where possible.
    """

    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
        
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)

## Grab the Document Content
Let's grab the document first. In my case, I have a URL that'll direct me to a text file found on the SEC website. I take the URL, request the content using a GET request, store the response in a variable called response, and then pass through the content into our BeautifulSoup parser object. Make sure to specify the lxml parser.

## Old Vs. New
This is an excellent time to mention document structure and how structure changes depending on the filing's age. In newer filings, the HTML code is very well structured, meaning the tags are correct and allow us to go into extraordinary detail when we parse the info. In older filings, the HTML code is not well strucutred. To give you an idea of what you'll encounter, entire content will b stored in tags called <PAGE> and tables that don't have normal td and tr elements. This is just listing a few.

Unfortunately, the scraping strategy we leverage will depend on the age of the filing. Additionally, I can't say with confidence at what point in time a filing is considered "old" as I haven't explored enough filings to find a pattern. I'm assuming most of us don't need to go far back in time to collect data, so this series focuses on the "newer" filings. In other words, those filings that have will define HTML code inside the text file.

If you're curious as to what an example of an older and newer filing would look like, I've provided a link to two different text files for two separate filings.

#### Newer Filing:

https://www.sec.gov/Archives/edgar/data/1166036/000110465904027382/0001104659-04-027382.txt

#### Older Filing:

https://www.sec.gov/Archives/edgar/data/1750/0000912057-94-002818.txt

In [37]:
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/1166036/000110465904027382/0001104659-04-027382.txt"

# grab the response
response = requests.get(new_html_text)

# pass it through the parser, in this case let's just use lxml because the tags seem to follow xml.
soup = BeautifulSoup(response.content, 'lxml')

## Defining Our Master Dictionary To House Filings
Assuming you want to parse more than one filing, we will need to create a structure that allows for a natural hierarchy. This hierarchy will provide a defined path to each component of our filing, while still being flexible enough to grow as you need to add different pieces.

At the highest level, we have our master_filings_dict, which will contain all the filings we scrape. For this to work, we need a mechanism that provides a unique identifier to serve as our dictionary key. In this case, the accession_number will work perfectly as it's unique for every filing.

From here, we will have an accession dictionary, which will contain two parts:

1. One for the SEC Header content, which is found at the top of every filing.
2. One for the filing documents, which will contain all the documents for a filing.

To be clear, a filing can contain multiple documents. For example, a filing can include a 10-K document and an exhibit document. You must remember this it will help you understand the upcoming sections.

In [38]:
# define a dictionary that will house all filings.
master_filings_dict = {}

# let's use the accession number as the key. This 
accession_number = '0001104659-04-027382'

# add a new level to our master_filing_dict, this will also be a dictionary.
master_filings_dict[accession_number] = {}

# this dictionary will contain two keys, the sec header content, and a documents key.
master_filings_dict[accession_number]['sec_header_content'] = {}
master_filings_dict[accession_number]['filing_documents'] = None

## Examing the SEC-Header Tag
Honestly, I would not want to scrape this poriton of the filing. The reason why is it's a bunch text that isn't strucutred well. Also you probably can get this information from somewhere else in the filing and have it be more structured. However, there may be some information in this part that people want. Here's a simple solution. Grab first, and parse later.

In [39]:
# grab the sec-header tag, so we can store it in the master filing dictionary.
sec_header_tag = soup.find('sec-header')

# store the tag in the dictionary just as is.
master_filings_dict[accession_number]['sec_header_content']['sec_header_code'] = sec_header_tag

# display the sec header tag, so you can see how it looks.
display(sec_header_tag)

<sec-header>0001104659-04-027382.hdr.sgml : 20040913
<acceptance-datetime>20040913074905
ACCESSION NUMBER:		0001104659-04-027382
CONFORMED SUBMISSION TYPE:	8-K/A
PUBLIC DOCUMENT COUNT:		7
CONFORMED PERIOD OF REPORT:	20040730
ITEM INFORMATION:		Completion of Acquisition or Disposition of Assets
ITEM INFORMATION:		Financial Statements and Exhibits
FILED AS OF DATE:		20040913
DATE AS OF CHANGE:		20040913

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			MARKWEST ENERGY PARTNERS L P
		CENTRAL INDEX KEY:			0001166036
		STANDARD INDUSTRIAL CLASSIFICATION:	CRUDE PETROLEUM &amp; NATURAL GAS [1311]
		IRS NUMBER:				270005456
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		8-K/A
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-31239
		FILM NUMBER:		041026639

	BUSINESS ADDRESS:	
		STREET 1:		155 INVERNESS DR WEST
		STREET 2:		STE 200
		CITY:			ENGLEWOOD
		STATE:			CO
		ZIP:			80112
		BUSINESS PHONE:		303-925-9275

	MAIL ADDRESS:	
		STREET 1:		155 INVERNESS DR WEST
		STREET 2:		STE 200
		C

In [40]:
master_filings_dict

{'0001104659-04-027382': {'sec_header_content': {'sec_header_code': <sec-header>0001104659-04-027382.hdr.sgml : 20040913
   <acceptance-datetime>20040913074905
   ACCESSION NUMBER:		0001104659-04-027382
   CONFORMED SUBMISSION TYPE:	8-K/A
   PUBLIC DOCUMENT COUNT:		7
   CONFORMED PERIOD OF REPORT:	20040730
   ITEM INFORMATION:		Completion of Acquisition or Disposition of Assets
   ITEM INFORMATION:		Financial Statements and Exhibits
   FILED AS OF DATE:		20040913
   DATE AS OF CHANGE:		20040913
   
   FILER:
   
   	COMPANY DATA:	
   		COMPANY CONFORMED NAME:			MARKWEST ENERGY PARTNERS L P
   		CENTRAL INDEX KEY:			0001166036
   		STANDARD INDUSTRIAL CLASSIFICATION:	CRUDE PETROLEUM &amp; NATURAL GAS [1311]
   		IRS NUMBER:				270005456
   		FISCAL YEAR END:			1231
   
   	FILING VALUES:
   		FORM TYPE:		8-K/A
   		SEC ACT:		1934 Act
   		SEC FILE NUMBER:	001-31239
   		FILM NUMBER:		041026639
   
   	BUSINESS ADDRESS:	
   		STREET 1:		155 INVERNESS DR WEST
   		STREET 2:		STE 200
   		

## Parsing the documents
Now the fun part, grabbing all the documents. This isn't too bad, just do a find all with the document tag and loop through the results. However, I want to take some time and give you an idea of the natural structure of the text file.

At this point, we know that a filing contains two main elements:

1. A header
2. A document collection.

We've seen from up above that the header while challenging to parse, contains more information like time of filing, company info, and more contextual data. Some of this data requires going through multiple levels to get it. For example, the company zip requires going into the header, then the filer, then the business address, and finally, you'll arrive at the company zip code.

The reason I mention this is because it helps you understand that some of the info exists deep in the document hierarchy. To get this info means traversing the hierarchy, in some cases, multiple times for each document in a filing. Continuing, a document has another hierarchy:

1. type
2. sequence
3. filename
4. description
5. text

Additionally, the text tag contains all the HTML code for our document, which has its own structure that can change depending on the document you're looking at. The vital point to take away is that a hierarchy is there and can be beneficial for storage purposes and searching.

In the code I present below, I try to maintain the hierarchy as much as possible — no sense of wasting something that is already there. However, I do modify it in some instances to meet my requirements. For example, I break each document into pages and store those pages in a document dictionary. I also have search results stored at a page level and, in some cases, at a document level.

It's not to say that one is better than the other, but more to show that you can approach this task in multiple ways. The way you decide to go with will depend on the problem you're trying to solve.

In [41]:
# initalize the dictionary that will house all of our documents
master_document_dict = {}

# find all the documents in the filing.
for filing_document in soup.find_all('document'):
    
    # define the document type, found under the <type> tag, this will serve as our key for the dictionary.
    document_id = filing_document.type.find(text=True, recursive=False).strip()
    
    # here are the other parts if you want them.
    document_sequence = filing_document.sequence.find(text=True, recursive=False).strip()
    document_filename = filing_document.filename.find(text=True, recursive=False).strip()
    document_description = filing_document.description.find(text=True, recursive=False).strip()
    
    # initalize our document dictionary
    master_document_dict[document_id] = {}
    
    # add the different parts, we parsed up above.
    master_document_dict[document_id]['document_sequence'] = document_sequence
    master_document_dict[document_id]['document_filename'] = document_filename
    master_document_dict[document_id]['document_description'] = document_description
    
    # store the document itself, this portion extracts the HTML code. We will have to reparse it later.
    master_document_dict[document_id]['document_code'] = filing_document.extract()
    
    # grab the text portion of the document, this will be used to split the document into pages.
    filing_doc_text = filing_document.find('text').extract()
    
    # find all the thematic breaks, these help define page numbers and page breaks.
    all_thematic_breaks = filing_doc_text.find_all('hr',{'width':'100%'})
    
    '''
        THE FOLLOWING CODE IS OPTIONAL:
        -------------------------------
        
        This portion will demonstrate how to parse the page number from each "page". Now I would only do this if you
        want the ACTUAL page number on the document, if you don't need it then forget about it and just wait till the
        next seciton.
        
        Additionally, some of the documents appear not to have page numbers when they should so there is no guarantee
        that all the documents will be nice and organized.
    
    '''
    
    
    
    # grab all the page numbers, first one is usually blank
    all_page_numbers = [thematic_break.parent.parent.previous_sibling.previous_sibling.get_text(strip=True) 
                        for thematic_break in all_thematic_breaks]
    
    
    '''
    
        If the above list comprehension doesn't make sense to you, here is how it would look as a regular loop.
    
        # define a list to house all the page numbers
        all_page_numbers = []

        # loop throuhg all the thematic breaks.
        for thematic_break in all thematic_breaks:

           # this would grab the page number tag.
           page_number = thematic_break.parent.parent.previous_sibling.previous_sibling

           # this would grab the page number text
           page_number = page_number.get_text(strip=True)
           
           # store it in the list.
           all_page_numbers.append(page_number)

    '''
    
    # as long as there are numbers to change then proceed.
    if length_of_page_numbers > 0:
        
        # grab the last number
        previous_number = all_page_numbers[-1]
        
        # initalize a new list
        all_page_numbers_cleaned = []
        
        # loop through the old list in reverse order.
        for number in reversed(all_page_numbers):
            
            # if it's blank proceed to cleaning.
            if number == '':
                
                # the tricky part, there are three scenarios.

                # the previous one we looped was 0 or 1.
                if previous_number == '1' or previous_number == '0':
                    
                    # in this case, it means this is a "new section", so restart at 0.
                    all_page_numbers_cleaned.append(str(0))
                    
                    # reset the page number and the previous number.
                    length_of_page_numbers = length_of_page_numbers - 1
                    previous_number = '0'
                
                # the previous one we looped it wasn't either of those.
                else:
                    
                    # if it was blank, take the current length, subtract 1, and add it to the list.
                    all_page_numbers_cleaned.append(str(length_of_page_numbers - 1))
                    
                    # reset the page number and the previous number.
                    length_of_page_numbers = length_of_page_numbers - 1
                    previous_number = number

            else:
                
                # add the number to the list.
                all_page_numbers_cleaned.append(number)
                
                # reset the page number and the previous number.
                length_of_page_numbers = length_of_page_numbers - 1
                previous_number = number
    else:
        
        # make sure that it has a page number even if there are none, just have it equal 0
        all_page_numbers_cleaned = ['0']
    
    # have the page numbers be the cleaned ones, in reversed order.
    all_page_numbers = list(reversed(all_page_numbers_cleaned))
    
    # store the page_numbers
    master_document_dict[document_id]['page_numbers'] = all_page_numbers
    
    
    '''
        -------------------------------
          THE OPTIONAL CODE HAS ENDED
        -------------------------------
    
        This next portion of code is really what made this all possible. Up above you saw I grabbed all the thematic
        breaks from our document because they sever as natural page breaks. Without those thematic breaks I'm not sure
        if this would be such an easy process. It's not to say we couldn't break it into pages, but I would bet the code
        would be more complex.
    
    '''
    

[<hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align="left" color="gray" noshade="" size="2" width="100%"/>, <hr align