# Introduction

In this notebook we are going to analyze Form 10-K files of S&P 500 companies. "A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance." - <a href="https://en.wikipedia.org/wiki/Form_10-K">Wikipedia</a>.  
The main goal of this project is to analyze and find out which topics matter most for companies besides financial statistics. U.S. Securities and Exchange Commission <a href="https://www.sec.gov/">website</a> provides all kind of reports that companies report and we can use it to get Form 10-K reports.  
Because of time and resoursce constraints we will analyze only S&P 500 companies which comprise 80% of American equity market by capitalization.

# Getting Data

S&P companies list were retrieved from this <a href="https://en.wikipedia.org/wiki/List_of_S%26P_500_companies">Wikipedia article</a> and the corresponding Form 10-K files (annual reports) were scraped from <a href="https://www.sec.gov/edgar.shtml">SEC EDGAR website.</a> For the source code of scraper please refer to **sandp500_scraper.py** file. After execution of scraper, there will be a folder with name **sandp500/** and **companies_list.csv** file inside folder, the content of which is self-explanatory.

# Data

In [1]:
import pandas as pd
data_path = 'data/sandp500/'
df = pd.read_csv(data_path + "companies_list.csv")
df.head()

Unnamed: 0,Ticker symbol,Security,SEC filings,GICS Sector,GICS Sub Industry,Location,Date first added,CIK,Founded,File
0,MMM,3M Company,reports,Industrials,Industrial Conglomerates,"St. Paul, Minnesota",,66740,1902,0000066740.txt
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",3/31/64,1800,1888,0000001800.txt
2,ABBV,AbbVie Inc.,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",12/31/12,1551152,2013 (1888),0001551152.txt
3,ABMD,ABIOMED Inc,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",5/31/18,815094,1981,0000815094.txt
4,ACN,Accenture plc,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",7/6/11,1467373,1989,0001467373.txt


## Data preprocessing

The content of .txt file is original Form 10-K submission with HTML tags. It is worth cleaning it and saving back without any HTML tags. Because of the sizes of files, this step takes some time to finish (~1 hour):

In [2]:
# for progress bar
from tqdm import tqdm
from tqdm import tqdm_notebook
tqdm_notebook().pandas()

# for parsing
from bs4 import BeautifulSoup

for file_name in tqdm(list(df.File.values)):
    try:
        data = None
        # read data
        with open(data_path + file_name, "r") as f:
            data = f.read()
        
        # write cleaned data
        with open(data_path + file_name, "w") as f:
            soup = BeautifulSoup(data, 'lxml')
            f.write(soup.get_text())
    
    # in case if some companies do not have annual reports
    except FileNotFoundError as e:
        print(e)
        df = df[df['File'] != file_name].reset_index()
        print("Entry from the dataframe was removed.")

  0%|          | 0/500 [00:00<?, ?it/s]




 36%|███▌      | 179/500 [21:23<38:21,  7.17s/it]

[Errno 2] No such file or directory: 'data/sandp500/0001711269.txt'
Entry from the dataframe was removed.


100%|██████████| 500/500 [1:00:57<00:00,  7.31s/it]


For the convenience of analyzing, it is worth to lemmatize annual reports and store them as tokens in extra column of dataframe. Besides lemmatization we will also remove punctuation, stop words and numbers. Do not forget to save the dataframe to file, because this step also takes some time to finish (~2 hours).

In [13]:
from unicodedata import normalize
import re
import spacy

nlp = spacy.load('en', disable=['tagger', 'parser', 'ner'])
nlp.max_length = 100000000

def pre_process(text):
    """Lemmatize given text.
    
    Args:
        text: `string` text to lemmatize
    Returns:
        `list` of `string`
    """
    # unicode normalize
    text = normalize('NFKD', text)
    
    # remove numbers
    text = re.sub(r'[0-9]', '', text)
    
    doc = nlp(text)
    
    # lemmatize, remove punctuation, stop words, -PRON-
    return [t.lemma_.lower() for t in doc 
                if (not t.is_punct and 
                    not t.is_stop and 
                    t.lemma_ != '-PRON-')
           ]

def get_tokens(file_name):
    with open(data_path + file_name, "r") as f:
        text = f.read()
    
    # reduce the length for nlp.max_length
    text = text.replace('\n', '')
    text = text.replace('\t', '')
    text = text.replace('\r', '')
    
    return pre_process(text)

In [None]:
df['tokens'] = df['File'].progress_apply(lambda x: get_tokens(x))

In [None]:
df.to_csv(data_path + "sandp500.csv", encoding='utf-8', index=False)