# Text Mining for Stock Prediction

## Data extraction:

With Seppe's code we can download any type of filing from the SEC. He included a function to get a summary for 424b5 filings, but we have to define the function to parse the main contents. 

We should consolidate: 
* Publication date
* Main extracts from the publication
* Closing stock price before the publication
* Closing(or opening?) stock price of after the publication was filed

### To do: 
* Define how to parse the main contents 
* Define a function that applies the parsing functions to the respective form types
* Scrape financial data (Melek & Anna)
* Join scraped info for fine-tuning


## BERT 
Original repository: https://github.com/google-research/bert

### Pre-training
* We have to determine the corpus on which we are going to pre-train BERT
* We need separate sentences for the NSP task, and we need to tokenize the sentences for the MLM task (see https://github.com/google-research/bert/blob/master/create_pretraining_data.py)
* In the text file we are going to feed into Bert, the text should be one sentence per line, and there should be empty lines to denote different documents.

### Fine-tuning
For fine-tuning Bert for our particular classification task, we are going to set-up an additional layer that is going to take in the text files and predicts whether the stock price of the company (should we denote which company filed what?) that filed a particular publication is going to increase, decrease or remain stable.

#### Requirements:
* Publication date
* Sentences regarding the risk factors
* Closing stock price before publication date
* Opening stock price after publication date
* Label (e.g. increase, decrease, stable)

## Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
from htmllaundry import sanitize
from htmllaundry.cleaners import LaundryCleaner
import htmllaundry.utils
import xmltodict
import re
import json
from pprint import pprint
import pandas as pd
from glob import glob
from cachecontrol import CacheControl
from IPython.display import HTML
import unicodedata
from bs4 import Comment

 # Scraping SEC filings

In [2]:
sess = requests.session()
cach = CacheControl(sess)

In [3]:
from IPython.display import HTML

def window(html):
    s = '<script type="text/javascript">'
    s += 'var win = window.open("", "", "toolbar=no, location=no, directories=no, status=no, menubar=no, scrollbars=yes, resizable=yes, width=780, height=200, top="+(screen.height-400)+", left="+(screen.width-840));'
    s += 'win.document.body.innerHTML = \'' + html.replace("\n",'\\n').replace("'", "\\'") + '\';'
    s += '</script>'
    return HTML(s)

## Cleaning SEC encoding

In [4]:
CustomCleaner = LaundryCleaner(
            page_structure=False,
            remove_unknown_tags=False,
            allow_tags=['blockquote', 'a', 'i', 'em', 'p', 'b', 'strong',
                        'h1', 'h2', 'h3', 'h4', 'h5', 
                        'ul', 'ol', 'li', 
                        'sub', 'sup',
                        'abbr', 'acronym', 'dl', 'dt', 'dd', 'cite',
                        'dft', 'br', 
                        'table', 'tr', 'td', 'th', 'thead', 'tbody', 'tfoot'],
            safe_attrs_only=True,
            add_nofollow=True,
            scripts=True,
            javascript=True,
            comments=True,
            style=True,
            links=False,
            meta=True,
            processing_instructions=False,
            frames=True,
            annoying_tags=False)

In [5]:
## The SEC is encoded in CP1252, and it is recommended to use UTF-8 always.
##see: https://www.w3.org/International/questions/qa-what-is-encoding
###### https://www.w3.org/International/articles/definitions-characters/#unicode
###### https://www.w3.org/International/questions/qa-choosing-encodings

def reformat_cp1252(match):
    codePoint = int(match.group(1))
    if 128 <= codePoint <= 159:
        return bytes([codePoint])
    else:
        return match.group()

def clean_sec_content(binary):
    return re.sub(b'&#(\d+);', reformat_cp1252, binary, flags=re.I).decode("windows-1252").encode('utf-8').decode('utf-8')

In [6]:
## this is to normalize urls
def slugify(value):
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s\.\-]', '-', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    return value

## Cleaning html

In [7]:
def read_html(file):
    with open(file, 'r') as f: return f.read()

In [8]:
def clean_html(html):
    soup = BeautifulSoup(html)
    if not soup.find('p'):
        for div in soup.find_all('div'):
            div.name = 'p'
    for b in soup.find_all('b'):
        b.name = 'strong'
    for f in soup.find_all('font', style=re.compile('font-weight:\s*bold')):
        f.name = 'strong'
    for footer in soup.find_all(class_=['header', 'footer']): 
        try: footer.decompose()
        except: pass
    san = sanitize(str(soup), CustomCleaner)
    soup = BeautifulSoup(san)
    def decompose_parent(el, parent='p', not_grandparent='table'):
        try:
            parent = el.find_parent(parent)
        except: parent = None
        if not parent: return
        grandparent = parent.find_parent('table')
        if grandparent: return
        parent.decompose()
    for el in soup.find_all(text=lambda x: 'table of contents' == str(x).lower().strip()):
        decompose_parent(el, 'a')
    for el in soup.find_all(text=re.compile(r'^\s*S\-(\d+|[ivxlcdm]+)\s*$')): 
        decompose_parent(el, 'p')
    for el in soup.find_all(text=re.compile(r'^\s*\d+\s*$')): 
        decompose_parent(el, 'p')
    return soup

## Defining helper functions to download files

In [9]:
def pagination_provider_by_element_start_count(find_args, find_kwargs):
    def pagination_provider_by_element_start_count_wrapped(soup, params):
        if soup.find(*find_args, **find_kwargs) is None: 
            return None
        params['start'] += params['count'] 
        return params
    return pagination_provider_by_element_start_count_wrapped

In [10]:
def params_provider_by_dict(params):
    return lambda : params

In [11]:
## look for a table, converts it to text and removes line breaks
def table_provider_by_summary(summary, header=0, index_col=0):
    return lambda soup: pd.read_html(
        str(soup.find('table', summary=summary)).replace('<br>', '<br>\n'), header=header, index_col=index_col)[0]

In [12]:
def get_sec_table(url,
                  table_provider=None,
                  base_params={}, 
                  params_provider=None,
                  pagination_provider=None,
                  replace_links=True,
                  session=None):
    ### this function returns a tuple of a df with the respective soup element
    def return_data_frame(session, url, params, provider):
        request = session.get(url, params=params)
        soup = BeautifulSoup(request.text)
        if replace_links:
            for a in soup.find_all('a'):
                parent = a.find_parent('td')
                if parent: parent.string = a['href']
        df = provider(soup)
        return df, soup
    ####################################################################
    ###if no Session, then we use the base_url to do the pull request###
    ####################################################################
    if session is None:
        session = cach
    if not url.startswith('http://') and not url.startswith('https://'):
        url = base_url.format(url)
    ###############################################################################################    
    ###if the specified parameters are a dictionary, update params with the specified parameters###
    ###############################################################################################
    params = dict(base_params)
    if params_provider:
        if isinstance(params_provider, dict):
            params.update(params_provider)
        else:
            params.update(params_provider())
    ### in case you only scrape one page, it will just return the df of the respective page###
    if not pagination_provider:
        df, soup = return_data_frame(session, url, params, table_provider)
        return df
    ### in case you want to scrape multiple pages, create an empty list of dfs and add each df from each 
    ### page to the empty list, at the end you just concatenate all of the dfs
    else:
        data_frames = []
        page_params = dict(params)
        while True:
            df, soup = return_data_frame(session, url, page_params, table_provider)
            data_frames.append(df)
            # Make sure columns retain their names
            data_frames[-1].columns = data_frames[0].columns
            new_params = pagination_provider(soup, page_params)
            if not new_params:
                break
            else:
                page_params.update(new_params)
        return pd.concat(data_frames, sort=False, ignore_index=True)

## Function to get the documents

This is a function to get the documents in the filing details page for each filing. See below for an example page.


In [13]:
get_filing_documents = lambda url, summary = 'Document Format Files' : get_sec_table(url,
                                                                                    table_provider = table_provider_by_summary(summary, index_col=None),
                                                                                    pagination_provider = pagination_provider_by_element_start_count(('input',), {'value': 'Next 100'}))

## Scraping most recent filings

For the previous 5 days

In [14]:
base_url = 'https://www.sec.gov{}'

In [15]:
def get_current_events(days_before=0, form_type=''):
    soup = BeautifulSoup(cach.get(base_url.format('/cgi-bin/current'), 
                            params={'q1': days_before, 'q2': 0, 'q3': form_type}).text)
    pre = soup.find('pre')
    ls = []
    for line in str(pre).replace('<hr>', '\n').replace('<hr/>', '\n').split('\n'):
        bs_line = BeautifulSoup(line)
        clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
        split_line = [ x.strip() for x in clean_line.split('  ') if x.strip() ]
        split_line += [ a.get('href') for a in bs_line.find_all('a') ]
        if not all(x is None for x in split_line): ls.append(split_line)
    colnames = ls[0] + [ 'link_{}'.format(i) for i in range(max(len(l) for l in ls) - len(ls[0])) ]
    return pd.DataFrame(ls[1:], columns=colnames)

In [40]:
get_current_events(form_type='8-K').head()

## Downloading SEC documents

Questions: 
* What is the purpose of defining a directory? It does not seem to work when I use it as a parameter for download_sec_documents

* What does the error "index 0 is out of bounds for axis 0 with size 0" mean? I still manage to download the files.

In [16]:
def download_sec_documents(doc_link):
    contents = clean_sec_content(cach.get(base_url.format(doc_link)).content)
    name = slugify(doc_link)
    with open(name, 'w') as f: f.write(contents)

#### Downloading 424B5s

In [29]:
forms = get_current_events(0, '424B5')
for link in forms['link_0']:
    docs = get_filing_documents(base_url.format(link))
    doc_link = docs.loc[docs.Type == '424B5', 'Document'].values[0]
    download_sec_documents(doc_link)

#### Downloading 8-Ks

In [34]:
num_days = 2

for p in range(0, num_days):
    print('Scraping day-page:', p)
    forms = get_current_events(p, '8-K')
    for link in forms['link_0']:
        docs = get_filing_documents(base_url.format(link))
        doc_link = docs.loc[docs.Type == '8-K', 'Document'].values[0]
        download_sec_documents(doc_link)

Scraping day-page: 0


IndexError: index 0 is out of bounds for axis 0 with size 0

To download the 8-K I always get the index error above, however when I try to download 424B5 filings this is not a issue. This happens in the filing details page: 

8-K example: https://www.sec.gov/Archives/edgar/data/926660/0001193125-20-298746-index.html

424B5 example: https://www.sec.gov/Archives/edgar/data/1035443/0001047469-19-001263-index.html


## Summary extraction

There is a difference between 424B5 and 8-K forms. The 424B5 forms have summary tables that are what the extract_dual_tables function extracts. We are interested in extracting the text from both forms, not necessarily the tables.

File that I am going to be working with: https://www.sec.gov/Archives/edgar/data/1174940/000149315220022121/form424b5.htm

In [17]:
def match_by_name_and_regex(name, regex, lowercase=True):
    return lambda el: el.name == name and re.search(regex, el.text.lower() if lowercase else el.text) is not None

In [18]:
# this function is going to give us all the headers that relate to risk factors in a filing
def get_offering_header_candidates(soup):
    return soup.find_all(match_by_name_and_regex('a', r'\s*factors\s*'))

In [19]:
content = read_html('-archives-edgar-data-1174940-000149315220022121-form424b5.htm')
cleaned_content = clean_html(content)

In [20]:
get_offering_header_candidates(cleaned_content)

[<a href="#v_005">RISK
     FACTORS</a>,
 <a href="#cj_001">RISK FACTORS</a>,
 <a href="#cj_001">RISK FACTORS</a>]

## Approach 1: 
Listing the titles in the document that correspond to risk factors (let's call them target_titles) --> get_offering_header_candidates

Then we need to parse everything that is between the target_titles and the next title that does not match using match_by_name_and_regex('a', r'\s*factors\s*').

We need to loop through each item, if the item is null or not a string, we need to skip them. If not, we are going to append them to an empty list. 

In [41]:
cleaned_content

In [22]:
# In order to find the paragraph under the header:
get_offering_header_candidates(cleaned_content)[0]

# we need to first find the p tag that is parent to this a tag, and then loop through every sibling
# of the parent tag and add it to an empty list

<a href="#v_005">RISK
    FACTORS</a>

In [36]:
for tag in cleaned_content.find_all('a'):
    parent = tag.parent
    if parent is not None:
        print(parent)

<td><a href="#v_001">ABOUT
    THIS PROSPECTUS SUPPLEMENT</a></td>
<td><a href="#v_002">CAUTIONARY
    STATEMENT CONCERNING FORWARD-LOOKING STATEMENTS</a></td>
<td><a href="#v_003">PROSPECTUS
    SUPPLEMENT SUMMARY</a></td>
<td><a href="#v_004">THE
    OFFERING</a></td>
<td><a href="#v_005">RISK
    FACTORS</a></td>
<td><a href="#g_001">CAPITALIZATION</a></td>
<td><a href="#g_002">USE
    OF PROCEEDS</a></td>
<td><a href="#g_003">DIVIDEND
    POLICY</a></td>
<td><a href="#g_004">DILUTION</a></td>
<td><a href="#g_005">UNDERWRITING</a></td>
<td><a href="#g_006">LEGAL
    MATTERS</a></td>
<td><a href="#g_007">EXPERTS</a></td>
<td><a href="#g_008">WHERE
    YOU CAN FIND MORE INFORMATION</a></td>
<td><a href="#g_009">INFORMATION
    INCORPORATED BY REFERENCE</a></td>
<td><a href="#g_010">ABOUT
    THIS PROSPECTUS</a></td>
<td><a href="#g_011">PROSPECTUS
    SUMMARY</a></td>
<td><a href="#g_012">SECURITIES
    WE MAY OFFER</a></td>
<td><a href="#cj_001">RISK FACTORS</a></td>
<td><a href="#cj

In [24]:
##This is the tag we are interested in 
<p><a name="v_005"></a>RISK FACTORS</p>


In [35]:
# here I am trying to find the paragraphs in the section of RISK FACTORS
for tag in cleaned_content.find_all('a'):
    parent = tag.parent
    if parent is not None and parent.name == 'p':
        uncle = parent.next_sibling
        if str(uncle).strip() == '':
            uncle = uncle.next_sibling
            if uncle is not None: 
                print(uncle)
## this code works for only one paragraph after the header title, we should loop through every 
# p tag after the header

<p>This
prospectus supplement and the accompanying prospectus form part of a registration statement on Form S-3 that we filed with the
Securities and Exchange Commission, or the “SEC,” using a “shelf” registration process. This document
contains two parts. The first part consists of this prospectus supplement, which provides you with specific information about
this offering. The second part, the accompanying prospectus, provides more general information, some of which may not apply to
this offering. Generally, when we refer only to the “prospectus,” we are referring to both parts combined. This prospectus
supplement, and the information incorporated herein by reference, may add, update or change information in the accompanying prospectus.
You should read the entire prospectus supplement as well as the accompanying prospectus and the documents incorporated by reference
herein that are described under the headings “Where You Can Find More Information” and “Incorporation of Certain
Docume

In [31]:
def get_risk_info(header):
    paragraphs = ''
    parent = header.parent
    while True:
        if parent is None: break # here you loop through each element until there exists a parent
        if parent.name != 'p': # if the parent of an a tag is not a p, it will loop through
            break              # each item until there is one that is a p tag. If it is, we get the uncle
        else:
            uncle = parent.next_sibling
            continue
        if uncle is None: break
        if str(uncle).strip() == '': ## in between two p tags, there is a p tag that doesn't contain anything
            uncle = uncle.next_sibling # so if we land in a tag without any text, we just need to get the sibling of that tag
            paragraphs += str(uncle)
    return paragraphs
    

In [32]:
def extract_paragraphs(soup):
    for header in get_offering_header_candidates(soup):
        paragraphs = get_risk_info(header)
        if paragraphs:
            return paragraphs
        else:
            print('try again')

In [33]:
extract_paragraphs(cleaned_content)

try again
try again
try again


## Approach 2: 

We can get the sourceline and source position of each title in a document by using an html parser. Then we identify the target_titles, their coordinates, and store the coordinates of target_titles and target_titles + 1 in a dataframe.

After creating the dataframe, we can parse everything with coordinates in between target_titles and target_titles + 1.

In [20]:
soup = BeautifulSoup(content, 'html.parser')
soup.find_all('a')

[<a href="#v_001"><font style="font-size: 10pt">ABOUT
     THIS PROSPECTUS SUPPLEMENT</font></a>,
 <a href="#v_002"><font style="font-size: 10pt">CAUTIONARY
     STATEMENT CONCERNING FORWARD-LOOKING STATEMENTS</font></a>,
 <a href="#v_003"><font style="font-size: 10pt">PROSPECTUS
     SUPPLEMENT SUMMARY</font></a>,
 <a href="#v_004"><font style="font-size: 10pt">THE
     OFFERING</font></a>,
 <a href="#v_005"><font style="font-size: 10pt">RISK
     FACTORS</font></a>,
 <a href="#g_001"><font style="font-size: 10pt">CAPITALIZATION</font></a>,
 <a href="#g_002"><font style="font-size: 10pt">USE
     OF PROCEEDS</font></a>,
 <a href="#g_003"><font style="font-size: 10pt">DIVIDEND
     POLICY</font></a>,
 <a href="#g_004"><font style="font-size: 10pt">DILUTION</font></a>,
 <a href="#g_005"><font style="font-size: 10pt">UNDERWRITING</font></a>,
 <a href="#g_006"><font style="font-size: 10pt">LEGAL
     MATTERS</font></a>,
 <a href="#g_007"><font style="font-size: 10pt">EXPERTS</font></a>,
 

In [21]:
positions = []
for tag in soup.find_all('a'):
    positions.append((tag.sourceline, tag.sourcepos,tag.string))
positions

[(157, 114, 'ABOUT\n    THIS PROSPECTUS SUPPLEMENT'),
 (161, 114, 'CAUTIONARY\n    STATEMENT CONCERNING FORWARD-LOOKING STATEMENTS'),
 (165, 114, 'PROSPECTUS\n    SUPPLEMENT SUMMARY'),
 (169, 114, 'THE\n    OFFERING'),
 (173, 114, 'RISK\n    FACTORS'),
 (177, 114, 'CAPITALIZATION'),
 (180, 114, 'USE\n    OF PROCEEDS'),
 (184, 114, 'DIVIDEND\n    POLICY'),
 (188, 114, 'DILUTION'),
 (191, 114, 'UNDERWRITING'),
 (194, 114, 'LEGAL\n    MATTERS'),
 (198, 114, 'EXPERTS'),
 (201, 114, 'WHERE\n    YOU CAN FIND MORE INFORMATION'),
 (205, 114, 'INFORMATION\n    INCORPORATED BY REFERENCE'),
 (219, 114, 'ABOUT\n    THIS PROSPECTUS'),
 (223, 114, 'PROSPECTUS\n    SUMMARY'),
 (227, 114, 'SECURITIES\n    WE MAY OFFER'),
 (231, 57, 'RISK FACTORS'),
 (234, 57, 'SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS'),
 (237, 57, 'USE OF PROCEEDS'),
 (240, 57, 'DIVIDEND POLICY'),
 (243, 57, 'DESCRIPTION OF CAPITAL STOCK'),
 (246, 57, 'DESCRIPTION OF WARRANTS'),
 (249, 57, 'DESCRIPTION OF UNITS'),
 (252, 57, 

Now I just have to store the coordinates and names of the target_titles and target_titles+1 in a dataframe.

In [22]:
limit= len(positions)
pattern = re.compile(r'\s*')
listing = []

for i in range(0, limit-1):
    if positions[i][2] is not None:
        target = positions[i][2].replace('\n', '')
        target = re.sub(pattern, ' ', target)
        if target == 'RISK FACTORS':
            listing.append(positions[i])
            listing.append(positions[i+1])

In [23]:
listing

[(173, 114, 'RISK\n    FACTORS'),
 (177, 114, 'CAPITALIZATION'),
 (231, 57, 'RISK FACTORS'),
 (234, 57, 'SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS'),
 (2717, 75, 'RISK FACTORS'),
 (2720, 75, 'SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS')]

In [24]:
df = pd.DataFrame(listing)
df.columns = ['source-line', 'source-position', 'tag']
df

Unnamed: 0,source-line,source-position,tag
0,173,114,RISK\n FACTORS
1,177,114,CAPITALIZATION
2,231,57,RISK FACTORS
3,234,57,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS
4,2717,75,RISK FACTORS
5,2720,75,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS


Now that we have the coordinates that delimit the content we are interested in, we can set up a function that parses everything that is in between.

In [25]:
for line in soup.find_all('i'):
    print(line.sourceline, line.sourcepos, line.string)

702 31 Risk Factors
735 179  
737 158 This
summary highlights selected information contained elsewhere or incorporated by reference in this prospectus supplement and the
accompanying prospectus. This summary may not contain all of the information that may be important to you. You should read this
prospectus supplement, the accompanying prospectus, the information incorporated by reference in each, and any related free writing
prospectus before making an investment decision. You should pay special attention to the “Risk Factors” section beginning
on page S-9 of this prospectus supplement and “Risk Factors” set forth in our most recent annual report on Form 10-K
for the year ended December 31, 2019, as updated by our recent Form 8-K Report filed with the Securities and Exchange Commission
on May 8, 2020 and in our Quarterly Reports on Form 10-Q for the quarters ended March 31, 2020, June 30, 2020, and September 30,
2020, respectively and in the other documents which are incorporated by r

In [29]:
for line in soup.find_all('p'):
    print(line.sourceline, line.sourcepos,line.string)

12 0  
14 0 None
16 0 Filed
Pursuant to Rule 424(b)(5)
19 0 Registration
No. 333-235763
22 0 None
24 0  
26 0 PROSPECTUS
SUPPLEMENT
29 0 (To
Prospectus dated January 13, 2020)
32 0  
34 0 None
36 0  
38 0 None
41 0  
43 0 Pursuant
to this prospectus supplement and the accompanying prospectus, we are offering shares of our common stock, par value $0.001 per
share, (the “Common Stock”).
47 0  
49 0 Our
common stock is listed on the NYSE American under the symbol “OGEN.” The last reported sale price of our common stock
on the NYSE American on November 19, 2020 was $0.526 per share. We have applied to list the shares being sold in
this offering on the NYSE American. There can be no assurances that the NYSE American will grant the application.
54 0  
56 0 Investing
in our securities involves a high degree of risk. Before buying any securities, you should carefully read the discussion of material
risks of investing in our common stock under the heading “Risk Factors” beginning on page S-9 of

Apparently sections in the document that are next to each other have very different sourcelines and positions. Maybe all 'a' tags are grouped, and the same with all other types of tags.

In [55]:
## maybe try this again but using re.finditers? 