# Text Mining for Stock Prediction

## Data extraction:

With Seppe's code we can download any type of filing from the SEC. He included a function to get a summary for 424b5 filings, but we have to define the function to parse the main contents. 

We should consolidate: 
* Publication date
* Main extracts from the publication
* Closing stock price before the publication
* Closing(or opening?) stock price of after the publication was filed

### To do: 
* Define how to parse the main contents 
* Define a function that applies the parsing functions to the respective form types
* Scrape financial data (Melek & Anna)
* Join scraped info for fine-tuning


## BERT 
Original repository: https://github.com/google-research/bert

### Pre-training
* We have to determine the corpus on which we are going to pre-train BERT
* We need separate sentences for the NSP task, and we need to tokenize the sentences for the MLM task (see https://github.com/google-research/bert/blob/master/create_pretraining_data.py)
* In the text file we are going to feed into Bert, the text should be one sentence per line, and there should be empty lines to denote different documents.

### Fine-tuning
For fine-tuning Bert for our particular classification task, we are going to set-up an additional layer that is going to take in the text files and predicts whether the stock price of the company (should we denote which company filed what?) that filed a particular publication is going to increase, decrease or remain stable.

#### Requirements:
* Publication date
* Sentences regarding the risk factors
* Closing stock price before publication date
* Opening stock price after publication date
* Label (e.g. increase, decrease, stable)

## Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
from htmllaundry import sanitize
from htmllaundry.cleaners import LaundryCleaner
import htmllaundry.utils
import xmltodict
import re
import json
from pprint import pprint
import pandas as pd
from glob import glob
from cachecontrol import CacheControl
from IPython.display import HTML
import unicodedata
from bs4 import Comment

 # Scraping SEC filings

In [2]:
sess = requests.session()
cach = CacheControl(sess)

In [3]:
from IPython.display import HTML

def window(html):
    s = '<script type="text/javascript">'
    s += 'var win = window.open("", "", "toolbar=no, location=no, directories=no, status=no, menubar=no, scrollbars=yes, resizable=yes, width=780, height=200, top="+(screen.height-400)+", left="+(screen.width-840));'
    s += 'win.document.body.innerHTML = \'' + html.replace("\n",'\\n').replace("'", "\\'") + '\';'
    s += '</script>'
    return HTML(s)

## Cleaning SEC encoding

In [4]:
CustomCleaner = LaundryCleaner(
            page_structure=False,
            remove_unknown_tags=False,
            allow_tags=['blockquote', 'a', 'i', 'em', 'p', 'b', 'strong',
                        'h1', 'h2', 'h3', 'h4', 'h5', 
                        'ul', 'ol', 'li', 
                        'sub', 'sup',
                        'abbr', 'acronym', 'dl', 'dt', 'dd', 'cite',
                        'dft', 'br', 
                        'table', 'tr', 'td', 'th', 'thead', 'tbody', 'tfoot'],
            safe_attrs_only=True,
            add_nofollow=True,
            scripts=True,
            javascript=True,
            comments=True,
            style=True,
            links=False,
            meta=True,
            processing_instructions=False,
            frames=True,
            annoying_tags=False)

In [5]:
## The SEC is encoded in CP1252, and it is recommended to use UTF-8 always.
##see: https://www.w3.org/International/questions/qa-what-is-encoding
###### https://www.w3.org/International/articles/definitions-characters/#unicode
###### https://www.w3.org/International/questions/qa-choosing-encodings

def reformat_cp1252(match):
    codePoint = int(match.group(1))
    if 128 <= codePoint <= 159:
        return bytes([codePoint])
    else:
        return match.group()

def clean_sec_content(binary):
    return re.sub(b'&#(\d+);', reformat_cp1252, binary, flags=re.I).decode("windows-1252").encode('utf-8').decode('utf-8')

In [6]:
## this is to normalize urls
def slugify(value):
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s\.\-]', '-', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    return value

## Cleaning html

In [7]:
def read_html(file):
    with open(file, 'r') as f: return f.read()

In [8]:
def clean_html(html):
    soup = BeautifulSoup(html)
    if not soup.find('p'):
        for div in soup.find_all('div'):
            div.name = 'p'
    for b in soup.find_all('b'):
        b.name = 'strong'
    for f in soup.find_all('font', style=re.compile('font-weight:\s*bold')):
        f.name = 'strong'
    for footer in soup.find_all(class_=['header', 'footer']): 
        try: footer.decompose()
        except: pass
    san = sanitize(str(soup), CustomCleaner)
    soup = BeautifulSoup(san)
    def decompose_parent(el, parent='p', not_grandparent='table'):
        try:
            parent = el.find_parent(parent)
        except: parent = None
        if not parent: return
        grandparent = parent.find_parent('table')
        if grandparent: return
        parent.decompose()
    for el in soup.find_all(text=lambda x: 'table of contents' == str(x).lower().strip()):
        decompose_parent(el, 'a')
    for el in soup.find_all(text=re.compile(r'^\s*S\-(\d+|[ivxlcdm]+)\s*$')): 
        decompose_parent(el, 'p')
    for el in soup.find_all(text=re.compile(r'^\s*\d+\s*$')): 
        decompose_parent(el, 'p')
    return soup

## Defining helper functions to download files

In [9]:
def pagination_provider_by_element_start_count(find_args, find_kwargs):
    def pagination_provider_by_element_start_count_wrapped(soup, params):
        if soup.find(*find_args, **find_kwargs) is None: 
            return None
        params['start'] += params['count'] 
        return params
    return pagination_provider_by_element_start_count_wrapped

In [10]:
def params_provider_by_dict(params):
    return lambda : params

In [11]:
## look for a table, converts it to text and removes line breaks
def table_provider_by_summary(summary, header=0, index_col=0):
    return lambda soup: pd.read_html(
        str(soup.find('table', summary=summary)).replace('<br>', '<br>\n'), header=header, index_col=index_col)[0]

In [12]:
def get_sec_table(url,
                  table_provider=None,
                  base_params={}, 
                  params_provider=None,
                  pagination_provider=None,
                  replace_links=True,
                  session=None):
    ### this function returns a tuple of a df with the respective soup element
    def return_data_frame(session, url, params, provider):
        request = session.get(url, params=params)
        soup = BeautifulSoup(request.text)
        if replace_links:
            for a in soup.find_all('a'):
                parent = a.find_parent('td')
                if parent: parent.string = a['href']
        df = provider(soup)
        return df, soup
    ####################################################################
    ###if no Session, then we use the base_url to do the pull request###
    ####################################################################
    if session is None:
        session = cach
    if not url.startswith('http://') and not url.startswith('https://'):
        url = base_url.format(url)
    ###############################################################################################    
    ###if the specified parameters are a dictionary, update params with the specified parameters###
    ###############################################################################################
    params = dict(base_params)
    if params_provider:
        if isinstance(params_provider, dict):
            params.update(params_provider)
        else:
            params.update(params_provider())
    ### in case you only scrape one page, it will just return the df of the respective page###
    if not pagination_provider:
        df, soup = return_data_frame(session, url, params, table_provider)
        return df
    ### in case you want to scrape multiple pages, create an empty list of dfs and add each df from each 
    ### page to the empty list, at the end you just concatenate all of the dfs
    else:
        data_frames = []
        page_params = dict(params)
        while True:
            df, soup = return_data_frame(session, url, page_params, table_provider)
            data_frames.append(df)
            # Make sure columns retain their names
            data_frames[-1].columns = data_frames[0].columns
            new_params = pagination_provider(soup, page_params)
            if not new_params:
                break
            else:
                page_params.update(new_params)
        return pd.concat(data_frames, sort=False, ignore_index=True)

## Function to get the documents

This is a function to get the documents in the filing details page for each filing. See below for an example page.


In [13]:
get_filing_documents = lambda url, summary = 'Document Format Files' : get_sec_table(url,
                                                                                    table_provider = table_provider_by_summary(summary, index_col=None),
                                                                                    pagination_provider = pagination_provider_by_element_start_count(('input',), {'value': 'Next 100'}))

## Scraping most recent filings

For the previous 5 days

In [14]:
base_url = 'https://www.sec.gov{}'

In [15]:
def get_current_events(days_before=0, form_type=''):
    soup = BeautifulSoup(cach.get(base_url.format('/cgi-bin/current'), 
                            params={'q1': days_before, 'q2': 0, 'q3': form_type}).text)
    pre = soup.find('pre')
    ls = []
    for line in str(pre).replace('<hr>', '\n').replace('<hr/>', '\n').split('\n'):
        bs_line = BeautifulSoup(line)
        clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
        split_line = [ x.strip() for x in clean_line.split('  ') if x.strip() ]
        split_line += [ a.get('href') for a in bs_line.find_all('a') ]
        if not all(x is None for x in split_line): ls.append(split_line)
    colnames = ls[0] + [ 'link_{}'.format(i) for i in range(max(len(l) for l in ls) - len(ls[0])) ]
    return pd.DataFrame(ls[1:], columns=colnames)

In [16]:
get_current_events(form_type='8-K').head()

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1
0,12-11-2020,8-K,1599407,1847 Holdings LLC,/Archives/edgar/data/1599407/0001213900-20-042...,browse-edgar?action=getcompany&CIK=1599407
1,12-11-2020,8-K,946644,AIM ImmunoTech Inc.,/Archives/edgar/data/946644/0001493152-20-0233...,browse-edgar?action=getcompany&CIK=946644
2,12-11-2020,8-K,353184,AIR T INC,/Archives/edgar/data/353184/0000353184-20-0000...,browse-edgar?action=getcompany&CIK=353184
3,12-11-2020,8-K,1626199,"ALPINE IMMUNE SCIENCES, INC.",/Archives/edgar/data/1626199/0001626199-20-000...,browse-edgar?action=getcompany&CIK=1626199
4,12-11-2020,8-K,1823945,ALTIMAR ACQUISITION CORP.,/Archives/edgar/data/1823945/0001104659-20-134...,browse-edgar?action=getcompany&CIK=1823945


## Downloading SEC documents

Questions: 
* What is the purpose of defining a directory? It does not seem to work when I use it as a parameter for download_sec_documents

* What does the error "index 0 is out of bounds for axis 0 with size 0" mean? I still manage to download the files.

In [16]:
def download_sec_documents(doc_link):
    contents = clean_sec_content(cach.get(base_url.format(doc_link)).content)
    name = slugify(doc_link)
    with open(name, 'w') as f: f.write(contents)

#### Downloading 424B5s

In [17]:
forms = get_current_events(0, '424B5')
for link in forms['link_0']:
    docs = get_filing_documents(base_url.format(link))
    doc_link = docs.loc[docs.Type == '424B5', 'Document'].values[0]
    download_sec_documents(doc_link)

#### Downloading 8-Ks

In [34]:
num_days = 2

for p in range(0, num_days):
    print('Scraping day-page:', p)
    forms = get_current_events(p, '8-K')
    for link in forms['link_0']:
        docs = get_filing_documents(base_url.format(link))
        doc_link = docs.loc[docs.Type == '8-K', 'Document'].values[0]
        download_sec_documents(doc_link)

Scraping day-page: 0


IndexError: index 0 is out of bounds for axis 0 with size 0

To download the 8-K I always get the index error above, however when I try to download 424B5 filings this is not a issue. This happens in the filing details page: 

8-K example: https://www.sec.gov/Archives/edgar/data/926660/0001193125-20-298746-index.html

424B5 example: https://www.sec.gov/Archives/edgar/data/1035443/0001047469-19-001263-index.html


## Summary extraction

There is a difference between 424B5 and 8-K forms. The 424B5 forms have summary tables, but we are interested in extracting the text from both forms, not necessarily the tables.

File that I am going to be working with: https://www.sec.gov/Archives/edgar/data/1174940/000149315220022121/form424b5.htm

### Parsing 424B5 filings

In [12]:
content1 = read_html('-archives-edgar-data-1174940-000149315220022121-form424b5.htm')
cleaned_content1 = clean_html(content1)

First we need to get the headers that delimit the section that includes the description of the risk factors. Once we have the headers, we start looping through every "p" tag that is in between the two headers.

We have to generalize how we find the delimiting headers and insert them inside the function get_risk_info to stop parsing the text on other sections. 

In [9]:
def get_header(soup):
    for header in soup.find_all():
        match = re.match(r'RISK\s*FACTORS\s*', header.text, re.M) 
        if match:
            if header.name == 'p':
                parent = header.parent
                if parent.name == 'body': ## this is to make it easier to get the delimiting header, since it must be a sibling of header
                    return header

In [18]:
def get_delimiter_header(content):
    soup = BeautifulSoup(content, 'html.parser') #we use the html parser to list the headers with the sourceline
    positions = []
    for tag in soup.find_all('a'):
        positions.append(tag.string)
    limit = len(positions)
    listing = []
    for i in range(0, limit-1):
        if positions[i] is not None:
            match = re.match(r'RISK\s*FACTORS\s*', str(positions[i]), re.M) #we spot the target header and select the header that follows it
            if match:
                listing.append(positions[i+1])
    return listing

In [19]:
def get_risk_info(header,content):
    paragraphs = ''
    brother = header.next_sibling
    already_seen = False
    delimiters = get_delimiter_header(content) #we call the delimiting headers to use them as stop criteria
    while True:
        if brother is None:
            break
        if type(brother) == NavigableString:
            if already_seen and str(brother).strip() != '': break
            brother = brother.next_sibling
            continue
        if brother.name == 'p':
            if already_seen and brother.get_text(strip=True) in delimiters: break
        already_seen = True
        paragraphs += str(brother)
        brother = brother.next_sibling
    return str(paragraphs)

In [20]:
def extract_paragraphs(soup): 
    header = get_header(soup)
    content = get_risk_info(header)
    return content

Find out a way to extract the content through extract_paragraphs. 

Now I am going to try to parse all the downloaded documents.

In [41]:
filings_424b5 = []

for html_file in glob('*424b5*.htm'):
    print(html_file)
    content = read_html(html_file)
    cleaned_content = clean_html(content)
    paragraphs = get_risk_info(get_header(cleaned_content),content)
    if paragraphs:
        filings_424b5 += str(paragraphs)
        f = open("4245B5","a")
        f.write(str(paragraphs))
        f.close()
        print('Great success')
                

-archives-edgar-data-1674930-000156459020054810-flgt-424b5.htm
Great success
-archives-edgar-data-1558583-000121390020038382-ea130283-424b5_arcimoto.htm
Great success
-archives-edgar-data-864270-000119312520298619-d97515d424b5.htm
Great success
-archives-edgar-data-1692412-000119312520298480-d83042d424b5.htm
Great success
-archives-edgar-data-1410098-000121390020039655-ea130437-424b5_cormedix.htm
Great success
-archives-edgar-data-310764-000119312520299402-d30647d424b5.htm
Great success
-archives-edgar-data-1636651-000119312520299425-d947308d424b5.htm
Great success
-archives-edgar-data-1623526-000119312520299388-d42479d424b5.htm
Great success
-archives-edgar-data-1696396-000119312520299465-d93087d424b5.htm
Great success
-archives-edgar-data-1174940-000149315220022121-form424b5.htm
Great success
-archives-edgar-data-1316517-000121390020038504-ea130316-424b5_kanditechno.htm
Great success
-archives-edgar-data-1182534-000119312520304490-d63319d424b5.htm
Great success
-archives-edgar-data-1

In [53]:
f = open("4245B5", "r")
test = f.read()

The code does work, but the text still includes the html tags.

In [55]:
def cleanhtml(raw_html):
    cleanr = cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

In [56]:
cleanhtml(test)

"Investing in our common stock involves risk. Before deciding whether to invest in our common stock, you should consider carefully the risks and uncertainties described below. You should also consider the risks, uncertainties and assumptions discussed under the heading “Risk Factors” included in our most recent annual report on Form 10-K, as revised or supplemented by our most recent quarterly report on Form 10-Q, each of which are on file with the SEC and are incorporated herein by reference, and which may be amended, supplemented or superseded from time to time by other reports we file with the SEC in the future. There may be other unknown or unpredictable economic, business, competitive, regulatory or other factors that could have material adverse effects on our future results. If any of these risks actually occurs, our business, business prospects, financial condition or results of operations could be seriously harmed. This could cause the trading price of our common stock to decli