# Objective
Goal here is to apply NLP to perform analysis on the published financial reports, more specifically on the annual 10-K filings(and later 10-Q i.e quaterly filings) for the listed companies.

## What are 10-K filings?
A 10-K is a comprehensive report filed annually by a publicly-traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC).  
Source: [Investopedia](https://www.investopedia.com/terms/1/10-k.asp) 

## Where can one get 10-K filings in a machine readable format?
10-K filings are available on sec website and can be easily searched via their tool [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch.html). EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system used at the U.S. Securities and Exchange Commission (SEC). These are can be also accessed via their apis as will be done below.

## Which sections of 10-K filings will be used in analyis?
The details of different sections of 10-K filings can be found [here](https://www.sec.gov/fast-answers/answersreada10khtm.html).
The sections that will be used for analysis are:
- Item 1A - “Risk Factors” 
- Item 3 - “Legal Proceedings”
- Item 7 - “Management’s Discussion and Analysis of Financial Condition and Results of Operations”

## Exactly what NLP analysis will be done here?
- Topic Modelling
- Higlighting Trends
- TBC

## Which libraries are going to be used for NLP analysis
- [Spacy](https://spacy.io/)
- [Gensim](https://radimrehurek.com/gensim/)
- BeautifulSoup and other python libraries

## Inspiration
I learnt about EDGAR during my Udacity course `AI for Trading`. I highly recommend this course for the large practical knowldge the course provided. In the course sentiment analysis was performed on 10-K statements and then further used to create factors into a model.

### Install Packages
[installing-python-packages-from-jupyter](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [15]:
# download the spacy model

## enable this for downloading & using smaller spacy model
!{sys.executable} -m spacy download en 
spacy_model = 'en'

## enable this for downloading & using medium spacy model
# !{sys.executable} -m spacy download en_core_web_md
# spacy_model = 'en_core_web_md'

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/anirudh/opt/anaconda3/envs/nlp10k/lib/python3.7/site-packages/en_core_web_sm
-->
/Users/anirudh/opt/anaconda3/envs/nlp10k/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [16]:
import spacy
nlp = spacy.load(spacy_model)

In [54]:
import requests
from ratelimit import limits, sleep_and_retry
# ratelimit prevents large number of requests being sent to the server
# @sleep_and_retry sleeps and then retries when number of requests are many
# @limits defines the limits

class SecAPI(object):
    SEC_CALL_LIMIT = {'calls': 10, 'seconds': 1}

    @staticmethod
    @sleep_and_retry
    # Dividing the call limit by half to avoid coming close to the limit
    @limits(calls=SEC_CALL_LIMIT['calls'] / 2, period=SEC_CALL_LIMIT['seconds'])
    def _call_sec(url):
        return requests.get(url)

    def get(self, url):
        return self._call_sec(url).text

sec_api = SecAPI()

In [55]:
# you need cik_map to download form EDGAR
cik_map = {
    'tech_cik' : {
        'AMZN': '0001018724',
        'AAPL': '0000320193',
        'MSFT': '0000789019',
        'GOOG' : '0001652044'
    },
    'fin_cik' : {
        'UBS' : '0001610520',
        'CS'  : '0001159510',
        'GSBD': '0001572694',
        'JPM' : '0000019617'
    }
}

In [56]:
cik_lookup = cik_map['fin_cik']
print(cik_lookup)

{'UBS': '0001610520', 'CS': '0001159510', 'GSBD': '0001572694', 'JPM': '0000019617'}


In [281]:
from bs4 import BeautifulSoup
from datetime import datetime

def get_sec_data(sec_api, ticker, cik, doc_type='10-K', max_num=3):
    start=0
    count=60
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    print(f'rss_url: {rss_url}')
    sec_data = sec_api.get(rss_url)
    # print(type(sec_data))
    # print(f'sec_data: {sec_data}')
    feed = BeautifulSoup(sec_data.encode()).feed
    # print('feed: ', type(feed))
    # print(feed)
    entries=[]
    cnt=0
    for entry in feed.find_all('entry', recursive=False):
        if entry.content.find('filing-type').getText() == doc_type and cnt < max_num:
            dt_str = entry.content.find('filing-date').getText()
            entries.append((ticker, 
                            entry.content.find('filing-date').getText()
                            ,entry.content.find('filing-href').getText()
                           ))
            # entries[datetime.strptime(dt_str,'%Y-%m-%d').date()] = entry.content.find('filing-href').getText()
            cnt += 1
    return entries

In [282]:
ticker = 'GOOG'
cik = cik_map['tech_cik'][ticker]
# cik = cik_map['fin_cik']['UBS']

sec_data = get_sec_data(sec_api, ticker, cik,max_num=5)
# print(sec_data)

rss_url: https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001652044&type=10-K&start=0&count=60&owner=exclude&output=atom
[('GOOG', '2020-02-04', 'https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/0001652044-20-000008-index.htm'), ('GOOG', '2019-02-05', 'https://www.sec.gov/Archives/edgar/data/1652044/000165204419000004/0001652044-19-000004-index.htm'), ('GOOG', '2018-02-06', 'https://www.sec.gov/Archives/edgar/data/1652044/000165204418000007/0001652044-18-000007-index.htm'), ('GOOG', '2017-02-03', 'https://www.sec.gov/Archives/edgar/data/1652044/000165204417000008/0001652044-17-000008-index.htm'), ('GOOG', '2016-02-11', 'https://www.sec.gov/Archives/edgar/data/1652044/000165204416000012/0001652044-16-000012-index.htm')]


In [283]:
df = pd.DataFrame()
df = df.append(sec_data)

In [292]:
df.columns=['ticker','filing_date','href']
df

Unnamed: 0,ticker,filing_date,href
0,GOOG,2020-02-04,https://www.sec.gov/Archives/edgar/data/165204...
1,GOOG,2019-02-05,https://www.sec.gov/Archives/edgar/data/165204...
2,GOOG,2018-02-06,https://www.sec.gov/Archives/edgar/data/165204...
3,GOOG,2017-02-03,https://www.sec.gov/Archives/edgar/data/165204...
4,GOOG,2016-02-11,https://www.sec.gov/Archives/edgar/data/165204...


In [315]:
def download_filings(df_row):
    print(f"Downloading {df_row['ticker']}, for date: {df_row['filing_date']}")
    file_url = df_row['href'].replace('-index.htm', '.txt').replace('.txtl', '.txt')            
    return sec_api.get(file_url)

In [316]:
df['filing'] = df.apply(download_filings,axis=1)

Dowloading GOOG, for date: 2020-02-04
Dowloading GOOG, for date: 2019-02-05
Dowloading GOOG, for date: 2018-02-06
Dowloading GOOG, for date: 2017-02-03
Dowloading GOOG, for date: 2016-02-11


In [320]:
print(df.loc[(df['ticker']=='GOOG') & (df['filing_date'] == '2020-02-04'),'filing'][0][:2000])

<SEC-DOCUMENT>0001652044-20-000008.txt : 20200204
<SEC-HEADER>0001652044-20-000008.hdr.sgml : 20200204
<ACCEPTANCE-DATETIME>20200203210359
ACCESSION NUMBER:		0001652044-20-000008
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		112
CONFORMED PERIOD OF REPORT:	20191231
FILED AS OF DATE:		20200204
DATE AS OF CHANGE:		20200203

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Alphabet Inc.
		CENTRAL INDEX KEY:			0001652044
		STANDARD INDUSTRIAL CLASSIFICATION:	SERVICES-COMPUTER PROGRAMMING, DATA PROCESSING, ETC. [7370]
		IRS NUMBER:				611767919
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-37580
		FILM NUMBER:		20570407

	BUSINESS ADDRESS:	
		STREET 1:		1600 AMPHITHEATRE PARKWAY
		CITY:			MOUNTAIN VIEW
		STATE:			CA
		ZIP:			94043
		BUSINESS PHONE:		650-253-0000

	MAIL ADDRESS:	
		STREET 1:		1600 AMPHITHEATRE PARKWAY
		CITY:			MOUNTAIN VIEW
		STATE:			CA
		ZIP:			94043
</SEC-HEADER>
<DOCU

In [104]:
from tqdm import tqdm
raw_fillings_by_ticker = {}
for dt, index_url in tqdm(sec_data.items(), desc=f'Downloading {ticker} Filings', unit='filling'):
    file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')            
    raw_fillings_by_ticker[dt] = sec_api.get(file_url)


Downloading GOOG Filings: 100%|██████████| 3/3 [00:00<00:00,  4.06filling/s]


In [109]:
print(raw_fillings_by_ticker['2020-02-04'][:2000])

<SEC-DOCUMENT>0001652044-20-000008.txt : 20200204
<SEC-HEADER>0001652044-20-000008.hdr.sgml : 20200204
<ACCEPTANCE-DATETIME>20200203210359
ACCESSION NUMBER:		0001652044-20-000008
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		112
CONFORMED PERIOD OF REPORT:	20191231
FILED AS OF DATE:		20200204
DATE AS OF CHANGE:		20200203

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Alphabet Inc.
		CENTRAL INDEX KEY:			0001652044
		STANDARD INDUSTRIAL CLASSIFICATION:	SERVICES-COMPUTER PROGRAMMING, DATA PROCESSING, ETC. [7370]
		IRS NUMBER:				611767919
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-37580
		FILM NUMBER:		20570407

	BUSINESS ADDRESS:	
		STREET 1:		1600 AMPHITHEATRE PARKWAY
		CITY:			MOUNTAIN VIEW
		STATE:			CA
		ZIP:			94043
		BUSINESS PHONE:		650-253-0000

	MAIL ADDRESS:	
		STREET 1:		1600 AMPHITHEATRE PARKWAY
		CITY:			MOUNTAIN VIEW
		STATE:			CA
		ZIP:			94043
</SEC-HEADER>
<DOCU

In [108]:
with open("2020-02-04.html",'w') as fh:
    fh.write(raw_fillings_by_ticker['2020-02-04'])

### Extract the 10-K filing from the 10-K form

This information is extracted by looking for pattern  
`
<DOCUMENT>
<TYPE>10-K
...
...
</DOCUMENT>
`


In [237]:
test_str = """<DOCUMENT>
<TYPE>10-K
<p>line 1</p>
<br>line 2
</DOCUMENT>
<DOCUMENT>
line 3
line 4
</DOCUMENT>
"""

In [238]:
import re
def get_10K_filing(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """

    regex_start = re.compile(r'<DOCUMENT>')
    matches = regex_start.finditer(text)
    start_pos =[match.span()[1] for match in matches]

    regex_end = re.compile(r'</DOCUMENT>')
    matches_end = regex_end.finditer(text)
    end_pos = [match.span()[0] for match in matches_end]
    
    #print('start: ', start_pos, 'end: ', end_pos)
    extracted_docs = [text[start:end] for start, end in zip(start_pos,end_pos)]
    print(extracted_docs)
    for doc in extracted_docs:
        regex = re.compile(r'<TYPE>10-K\s*\n(.*)', re.DOTALL)
        matches = regex.finditer(doc)
        for match in matches:
            print('match: ', match)
            
    return match.group(1)

print(get_10K_filing(test_str))

['\n<TYPE>10-K\n<p>line 1</p>\n<br>line 2\n', '\nline 3\nline 4\n']
match:  <re.Match object; span=(1, 37), match='<TYPE>10-K\n<p>line 1</p>\n<br>line 2\n'>
<p>line 1</p>
<br>line 2



In [213]:
len(get_10K_filing(raw_fillings_by_ticker['2020-02-04']))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



3446213

In [160]:
with open("10k-2020-02-04.html",'w') as fh:
    fh.write(get_10K_filing(raw_fillings_by_ticker['2020-02-04']))


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [161]:
bs4_obj = BeautifulSoup(raw_fillings_by_ticker['2020-02-04'])
bs4_obj.title

<title>Document</title>

In [200]:
for i, docs in enumerate(bs4_obj.find_all('document')):
    for result in docs.find_all('type'):
        print(i, len(result))

0 2
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2


In [210]:
for i, docs in enumerate(bs4_obj.find_all('type')):
        print(i, len(docs.get_text()), docs.p, ">>>", docs.get_text()[:10])

0 354067 None >>> 10-K
1
goo
1 32786 None >>> EX-4.14
2

2 19335 None >>> EX-10.08.1
3 31587 None >>> EX-10.08.2
4 520 None >>> EX-21.01
5
5 1686 None >>> EX-23.01
6
6 3475 None >>> EX-31.01
7
7 3491 None >>> EX-31.02
8
8 1971 None >>> EX-32.01
9
9 13555 None >>> EX-101.SCH
10 871 None >>> EX-101.CAL
11 1679 None >>> EX-101.DEF
12 84691 None >>> EX-101.LAB
13 2478 None >>> EX-101.PRE
14 10336390 None >>> GRAPHIC
15


In [212]:
for i, docs in enumerate(bs4_obj.find_all('type')):
    if docs.find(text=re.compile('10-K', flags=re.DOTALL)):
        print(i, "found 10-k")
    else:
        print(i, "not found")

0 found 10-k
1 found 10-k
2 not found
3 not found
4 not found
5 found 10-k
6 found 10-k
7 found 10-k
8 found 10-k
9 not found
10 not found
11 not found
12 not found
13 not found
14 found 10-k


In [244]:
def get_10K_filing2(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """
    extracted_docs = BeautifulSoup(text).find_all('document')
    plain_text=""
    for i, doc in enumerate(extracted_docs):
        
        #print('processing ', i, ">>\n",str(doc))
        regex = re.compile(r'.*<type>10-K\s*\n(.*)</type>', re.DOTALL)
        matches = regex.finditer(str(doc))
        for match in matches:
            # print('match: ', match)
            plain_text = BeautifulSoup(match.group(1), 'html.parser').get_text() # remove all the html
            return plain_text
    return plain_text
print('return val: \n', get_10K_filing2(test_str))

return val: 
 line 1
line 2



In [245]:
with open("10k-cleaned-2020-02-04.html",'w') as fh:
    fh.write(get_10K_filing2(raw_fillings_by_ticker['2020-02-04']))


In [234]:
len(get_10K_filing2(raw_fillings_by_ticker['2020-02-04']))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



match:  <re.Match object; span=(0, 3417472), match='<document>\n<type>10-K\n<sequence>1\n<filename>go>


3417443

In [235]:
len(get_10K_filing(raw_fillings_by_ticker['2020-02-04']))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



3446213

In [236]:
3446213-3417443

28770