# Text Mining for Stock Prediction

## Data extraction:

With Seppe's code we can download any type of filing from the SEC. He included a function to get a summary for 424b5 filings, but for 8-Ks and 10-Ks we have to define the function to parse the main contents. 

We should join the extract information in one dataframe that contains: 
* Publication date
* Main extracts from the publication
* Closing stock price before the publication
* Closing(or opening?) stock price of after the publication was filed

### To do: 
* Define how to parse the main contents of 8-K and 10-Ks
* Define a function that applies the parsing functions to the respective form types
* Scrape financial data (Melek & Anna)
* Join scraped info for fine-tuning


## BERT 
Original repository: https://github.com/google-research/bert

### Pre-training
* We have to determine the corpus on which we are going to pre-train BERT
* We need separate sentences for the NSP task, and we need to tokenize the sentences for the MLM task (see https://github.com/google-research/bert/blob/master/create_pretraining_data.py)
* In the text file we are going to feed into Bert, the text should be one sentence per line, and there should be empty lines to denote different documents.

### Fine-tuning
For fine-tuning Bert for our particular classification task, we are going to set-up an additional layer that is going to take in the text files and predicts whether the stock price of the company (denoting which company filed what is not really that important) that filed a particular publication is going to increase, decrease or remain stable.

#### Requirements:
* Publication date
* Sentences regarding the risk factors
* Closing stock price before publication date
* Opening stock price after publication date
* Label (e.g. increase, decrease, stable)

## Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
from htmllaundry import sanitize
from htmllaundry.cleaners import LaundryCleaner
import htmllaundry.utils
import xmltodict
import re
import json
from pprint import pprint
import pandas as pd
from glob import glob
from cachecontrol import CacheControl
from IPython.display import HTML
import unicodedata

In [2]:
sess = requests.session()
cach = CacheControl(sess)

In [43]:
from IPython.display import HTML

def window(html):
    s = '<script type="text/javascript">'
    s += 'var win = window.open("", "", "toolbar=no, location=no, directories=no, status=no, menubar=no, scrollbars=yes, resizable=yes, width=780, height=200, top="+(screen.height-400)+", left="+(screen.width-840));'
    s += 'win.document.body.innerHTML = \'' + html.replace("\n",'\\n').replace("'", "\\'") + '\';'
    s += '</script>'
    return HTML(s)

## Cleaning SEC encoding

In [3]:
CustomCleaner = LaundryCleaner(
            page_structure=False,
            remove_unknown_tags=False,
            allow_tags=['blockquote', 'a', 'i', 'em', 'p', 'b', 'strong',
                        'h1', 'h2', 'h3', 'h4', 'h5', 
                        'ul', 'ol', 'li', 
                        'sub', 'sup',
                        'abbr', 'acronym', 'dl', 'dt', 'dd', 'cite',
                        'dft', 'br', 
                        'table', 'tr', 'td', 'th', 'thead', 'tbody', 'tfoot'],
            safe_attrs_only=True,
            add_nofollow=True,
            scripts=True,
            javascript=True,
            comments=True,
            style=True,
            links=False,
            meta=True,
            processing_instructions=False,
            frames=True,
            annoying_tags=False)

In [4]:
## The SEC is encoded in CP1252, and it is recommended to use UTF-8 always.
##see: https://www.w3.org/International/questions/qa-what-is-encoding
###### https://www.w3.org/International/articles/definitions-characters/#unicode
###### https://www.w3.org/International/questions/qa-choosing-encodings

def reformat_cp1252(match):
    codePoint = int(match.group(1))
    if 128 <= codePoint <= 159:
        return bytes([codePoint])
    else:
        return match.group()

def clean_sec_content(binary):
    return re.sub(b'&#(\d+);', reformat_cp1252, binary, flags=re.I).decode("windows-1252").encode('utf-8').decode('utf-8')

In [5]:
## this is to normalize urls, making them more human friendly
def slugify(value):
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s\.\-]', '-', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    return value

## Cleaning html

In [6]:
def read_html(file):
    with open(file, 'r') as f: return f.read()

In [7]:
def clean_html(html):
    soup = BeautifulSoup(html)
    if not soup.find('p'):
        for div in soup.find_all('div'):
            div.name = 'p'
    for b in soup.find_all('b'):
        b.name = 'strong'
    for f in soup.find_all('font', style=re.compile('font-weight:\s*bold')):
        f.name = 'strong'
    for footer in soup.find_all(class_=['header', 'footer']): 
        try: footer.decompose()
        except: pass
    san = sanitize(str(soup), CustomCleaner)
    soup = BeautifulSoup(san)
    def decompose_parent(el, parent='p', not_grandparent='table'):
        try:
            parent = el.find_parent(parent)
        except: parent = None
        if not parent: return
        grandparent = parent.find_parent('table')
        if grandparent: return
        parent.decompose()
    for el in soup.find_all(text=lambda x: 'table of contents' == str(x).lower().strip()):
        decompose_parent(el, 'a')
    for el in soup.find_all(text=re.compile(r'^\s*S\-(\d+|[ivxlcdm]+)\s*$')): 
        decompose_parent(el, 'p')
    for el in soup.find_all(text=re.compile(r'^\s*\d+\s*$')): 
        decompose_parent(el, 'p')
    return soup

## Defining helper functions

In [8]:
def pagination_provider_by_element_start_count(find_args, find_kwargs):
    def pagination_provider_by_element_start_count_wrapped(soup, params):
        if soup.find(*find_args, **find_kwargs) is None: 
            return None
        params['start'] += params['count'] 
        return params
    return pagination_provider_by_element_start_count_wrapped

In [9]:
def params_provider_by_dict(params):
    return lambda : params

In [10]:
## look for a table, converts it to text and removes line breaks
def table_provider_by_summary(summary, header=0, index_col=0):
    return lambda soup: pd.read_html(
        str(soup.find('table', summary=summary)).replace('<br>', '<br>\n'), header=header, index_col=index_col)[0]

#### Breaking down table_provider_by_summary

In [129]:
url = base_url.format('/Archives/edgar/data/1035443/0001047469-19-001263-index.html')
url

In [128]:
soup = BeautifulSoup(cach.get(url).text)
soup

In [130]:
var= soup.find('table')
var

In [131]:
var = str(soup.find('table'))
var

In [132]:
# <BR> tags denote line breaks, so here you get rid of them
var = str(soup.find('table')).replace('<br>', '<br>\n')
var

In [133]:
var = pd.read_html(str(soup.find('table')).replace('<br>', '<br>\n'), header=0, index_col =0)
var

In [134]:
var = pd.read_html(str(soup.find('table')).replace('<br>', '<br>\n'), header=0, index_col =0)[0]
var

In [58]:
print('-'*80)

--------------------------------------------------------------------------------


In [11]:
def get_sec_table(url,
                  table_provider=None,
                  base_params={}, 
                  params_provider=None,
                  pagination_provider=None,
                  replace_links=True,
                  session=None):
    ### this function returns a tuple of a df with the respective soup element
    def return_data_frame(session, url, params, provider):
        request = session.get(url, params=params)
        soup = BeautifulSoup(request.text)
        if replace_links:
            for a in soup.find_all('a'):
                parent = a.find_parent('td')
                if parent: parent.string = a['href']
        df = provider(soup)
        return df, soup
    ####################################################################
    ###if no Session, then we use the base_url to do the pull request###
    ####################################################################
    if session is None:
        session = cach
    if not url.startswith('http://') and not url.startswith('https://'):
        url = base_url.format(url)
    ###############################################################################################    
    ###if the specified parameters are a dictionary, update params with the specified parameters###
    ###############################################################################################
    params = dict(base_params)
    if params_provider:
        if isinstance(params_provider, dict):
            params.update(params_provider)
        else:
            params.update(params_provider())
    ### in case you only scrape one page, it will just return the df of the respective page###
    if not pagination_provider:
        df, soup = return_data_frame(session, url, params, table_provider)
        return df
    ### in case you want to scrape multiple pages, create an empty list of dfs and add each df from each 
    ### page to the empty list, at the end you just concatenate all of the dfs
    else:
        data_frames = []
        page_params = dict(params)
        while True:
            df, soup = return_data_frame(session, url, page_params, table_provider)
            data_frames.append(df)
            # Make sure columns retain their names
            data_frames[-1].columns = data_frames[0].columns
            new_params = pagination_provider(soup, page_params)
            if not new_params:
                break
            else:
                page_params.update(new_params)
        return pd.concat(data_frames, sort=False, ignore_index=True)

## Function to get the documents

This is a function to get the documents in the filing details page for each filing. See below for an example page.


In [12]:
get_filing_documents = lambda url, summary = 'Document Format Files' : get_sec_table(url,
                                                                                    table_provider = table_provider_by_summary(summary, index_col=None),
                                                                                    pagination_provider = pagination_provider_by_element_start_count(('input',), {'value': 'Next 100'}))

## Scraping most recent filings

For the previous 5 days

In [13]:
base_url = 'https://www.sec.gov{}'

In [14]:
def get_current_events(days_before=0, form_type=''):
    soup = BeautifulSoup(cach.get(base_url.format('/cgi-bin/current'), 
                            params={'q1': days_before, 'q2': 0, 'q3': form_type}).text)
    pre = soup.find('pre')
    ls = []
    for line in str(pre).replace('<hr>', '\n').replace('<hr/>', '\n').split('\n'):
        bs_line = BeautifulSoup(line)
        clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
        split_line = [ x.strip() for x in clean_line.split('  ') if x.strip() ]
        split_line += [ a.get('href') for a in bs_line.find_all('a') ]
        if not all(x is None for x in split_line): ls.append(split_line)
    colnames = ls[0] + [ 'link_{}'.format(i) for i in range(max(len(l) for l in ls) - len(ls[0])) ]
    return pd.DataFrame(ls[1:], columns=colnames)

In [17]:
get_current_events(form_type='8-K').head()

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1
0,11-20-2020,8-K,1800,ABBOTT LABORATORIES,/Archives/edgar/data/1800/0001104659-20-127945...,browse-edgar?action=getcompany&CIK=1800
1,11-20-2020,8-K,1820191,AEA-Bridges Impact Corp.,/Archives/edgar/data/1820191/0001193125-20-299...,browse-edgar?action=getcompany&CIK=1820191
2,11-20-2020,8-K,868857,AECOM,/Archives/edgar/data/868857/0001104659-20-1279...,browse-edgar?action=getcompany&CIK=868857
3,11-20-2020,8-K,946644,AIM ImmunoTech Inc.,/Archives/edgar/data/946644/0001493152-20-0221...,browse-edgar?action=getcompany&CIK=946644
4,11-20-2020,8-K,926660,AIMCO PROPERTIES L.P.,/Archives/edgar/data/926660/0001193125-20-2987...,browse-edgar?action=getcompany&CIK=926660


#### Breaking down get_current_events

Questions: 
* What exactly is the goal of splitting every word in the list of rows? 


In [121]:
# this code goes to the current events page and scrapes the html of the list of documents
soup = BeautifulSoup(cach.get(base_url.format('/cgi-bin/current'), params = {'q1': 0, 'q2':0, 'q3': '8-K'}).text)
soup

Since the code is super dirty and company names and dates are mixed in the same text, we have split them into new lines.

We are only interested in the table, so that's why we use soup.find('pre') --> look at the HTML code in the base_url. 

In [122]:
pre = soup.find('pre')
lines = str(pre).replace('<hr>', '\n').replace('<hr/>', '\n'). split('\n')
lines

In [123]:
# here instead of having each row of the table between single HTML tags, we make each row separate
for line in lines:
    bs_line = BeautifulSoup(line)
    print(bs_line)

In [124]:
for line in lines:
    bs_line = BeautifulSoup(line)
    clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
    print(clean_line)

In [125]:
# the split_line for some reason separates every letter
for line in lines:
    bs_line = BeautifulSoup(line)
    clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
    split_line = [x.strip() for x in clean_line.strip('  ') if x.strip()]
    print(split_line)

In [126]:
## we then add the links for the forms into each list if they have a link in bs_line
for line in lines:
    bs_line = BeautifulSoup(line)
    clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
    split_line = [x.strip() for x in clean_line.strip('  ') if x.strip()]
    split_line += [a.get('href') for a in bs_line.find_all('a')]
    print(split_line)

In [127]:
ls=[]
for line in lines:
    bs_line = BeautifulSoup(line)
    clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
    split_line = [x.strip() for x in clean_line.strip('  ') if x.strip()]
    split_line += [a.get('href') for a in bs_line.find_all('a')]
    if not all(x is None for x in split_line): ls.append(split_line)
ls

## Downloading SEC documents

Questions: 
* What is the purpose of defining a directory? It does not seem to work when I use it as a parameter for download_sec_documents

* What does the error "index 0 is out of bounds for axis 0 with size 0" mean? I still manage to download the files.

In [15]:
def download_sec_documents(doc_link):
    contents = clean_sec_content(cach.get(base_url.format(doc_link)).content)
    name = slugify(doc_link)
    with open(name, 'w') as f: f.write(contents)

#### Downloading 424B5s

In [29]:
forms = get_current_events(0, '424B5')
for link in forms['link_0']:
    docs = get_filing_documents(base_url.format(link))
    doc_link = docs.loc[docs.Type == '424B5', 'Document'].values[0]
    download_sec_documents(doc_link)

#### Downloading 8-Ks

In [34]:
num_days = 2

for p in range(0, num_days):
    print('Scraping day-page:', p)
    forms = get_current_events(p, '8-K')
    for link in forms['link_0']:
        docs = get_filing_documents(base_url.format(link))
        doc_link = docs.loc[docs.Type == '8-K', 'Document'].values[0]
        download_sec_documents(doc_link)

Scraping day-page: 0


IndexError: index 0 is out of bounds for axis 0 with size 0

To download the 8-K I always get the index error above, however when I try to download 424B5 filings this is not a issue. This happens in the filing details page: 

8-K example: https://www.sec.gov/Archives/edgar/data/926660/0001193125-20-298746-index.html

424B5 example: https://www.sec.gov/Archives/edgar/data/1035443/0001047469-19-001263-index.html


## Summary extraction

There is a difference between 424B5 and 8-K forms. The 424B5 forms have summary tables that are what the extract_dual_tables function extracts. 8-K forms are entirely text. We need to find a way to select and extract relevant info from the 8-Ks

In [41]:
def extract_dual_tables(soup):
    dualrows = []
    for tr in soup.select("table tr"):
        row = [td.text.strip() for td in tr.find_all('td')]
        if len(row) != 2:
            continue
        if row[1].strip() == '':
            continue
        if all([row[x] == '' for x in range(0, len(row)-1)]):
            if len(dualrows) > 0 and len(row) == len(dualrows[-1]):
                dualrows[-1][-1] += ' ' + row[-1]
        else:
            dualrows.append(row)
    return dualrows

In [23]:
def match_by_name_and_regex(name, regex, lowercase=True):
    return lambda el: el.name == name and re.search(regex, el.text.lower() if lowercase else el.text) is not None

In [24]:
def get_offering_header_candidates(soup):
    return soup.find_all(match_by_name_and_regex('p', r'\s*offering\s*$'))

def get_after_offering_header_tables(header):
    tables = ''
    nextSibling = header.nextSibling
    table_seen = False
    while True:
        if nextSibling is None:
            break
        if type(nextSibling) == NavigableString:   
            if table_seen and str(nextSibling).strip() != '': break
            nextSibling = nextSibling.nextSibling
            continue
        if nextSibling.name != 'table':
            if table_seen and nextSibling.get_text(strip=True) != '': break
            nextSibling = nextSibling.nextSibling
            continue
        table_seen = True
        tables += str(nextSibling)
        if not nextSibling.nextSibling:
            print(nextSibling)
            print(nextSibling.nextSibling)
        nextSibling = nextSibling.nextSibling
    return tables

def extract_offering(soup):
    for header in get_offering_header_candidates(soup):
        tables = get_after_offering_header_tables(header)
        if tables:
            return extract_dual_tables(BeautifulSoup(tables))

The methods above work for concatenating the tables in a 424B5 prospectus supplement (summary). Therefore I think I am going to define a function that depending on whether the document is 424B5, applies extract_offering. If the document is 8-K then it has to be processed differently (to be defined). 

In addition, we need to pair a publication with a respective company to feed into Bert and associate the publication with a stock price as well. We also need to pull stock prices before and after the publications and maybe join all that info in a dataframe? 

In [118]:
content = read_html('-archives-edgar-data-1174940-000149315220022121-form424b5.htm')
cleaned_content = clean_html(content)

In [119]:
window(str(cleaned_content))

In [116]:
names = ['-archives-edgar-data-1174940-000149315220022121-form424b5.htm', '-archives-edgar-data-1636651-000119312520299425-d947308d424b5.htm']
        
def process_424B5(names):
    for name in names:
        if re.search('.+424b5.+', name):
            contents = read_html(name)
            cleaned_x = clean_html(contents)
            offering = extract_offering(cleaned_x)
        summary.append(offering)
    return summary
            ## here we should have the function that parses the 8-K filings

Let's define a function that reads each summary, extracts the ticker and creates a dictionary with the tickers as index and then summary of the filing as value.

In [145]:
base_url = 'https://www.sec.gov{}'
tickers = cach.get(base_url.format('/files/company_tickers.json')).json()
listing = list(tickers.values())[:3]
listing

[{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'},
 {'cik_str': 789019, 'ticker': 'MSFT', 'title': 'MICROSOFT CORP'},
 {'cik_str': 1018724, 'ticker': 'AMZN', 'title': 'AMAZON COM INC'}]

In [148]:
filings = get_company_filings(listing[0]['cik_str'])
filings.head(3)

NameError: name 'get_company_filings' is not defined