# Text Mining for Stock Prediction

## Data extraction:


We should consolidate: 
* Publication date
* Main extracts from the publication
* Closing stock price before the publication
* Opening stock price of day after the publication was filed

To do:
* Generalize 424B5 functions more. 
* Solve 8-K get_filing_documents problem. 

## BERT 
Original repository: https://github.com/google-research/bert

### Pre-training
* We have to determine the corpus on which we are going to pre-train BERT
* We need separate sentences for the NSP task, and we need to tokenize the sentences for the MLM task (see https://github.com/google-research/bert/blob/master/create_pretraining_data.py)
* See methodology section for detailed requirements. 

To do:
* Mock-train BERT model to work through the data processing stage.
* Pre-process training data

### Fine-tuning
For fine-tuning Bert for our particular classification task, we are going to set-up an additional layer that is going to take in the text files and predicts whether the stock price of the company (should we denote which company filed what?) that filed a particular publication is going to increase, decrease or remain stable.

#### Requirements:
* Publication date
* Sentences regarding the risk factors
* Closing stock price before publication date
* Opening stock price after publication date
* Label (e.g. increase, decrease, stable)

To do:
* What is the format the training data should have?

## Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
from htmllaundry import sanitize
from htmllaundry.cleaners import LaundryCleaner
import htmllaundry.utils
import xmltodict
import re
import json
from pprint import pprint
import pandas as pd
from glob import glob
from cachecontrol import CacheControl
from IPython.display import HTML
import unicodedata
from bs4 import Comment

 # Scraping SEC filings

In [2]:
sess = requests.session()
cach = CacheControl(sess)

In [3]:
from IPython.display import HTML

def window(html):
    s = '<script type="text/javascript">'
    s += 'var win = window.open("", "", "toolbar=no, location=no, directories=no, status=no, menubar=no, scrollbars=yes, resizable=yes, width=780, height=200, top="+(screen.height-400)+", left="+(screen.width-840));'
    s += 'win.document.body.innerHTML = \'' + html.replace("\n",'\\n').replace("'", "\\'") + '\';'
    s += '</script>'
    return HTML(s)

## Cleaning SEC encoding

In [4]:
CustomCleaner = LaundryCleaner(
            page_structure=False,
            remove_unknown_tags=False,
            allow_tags=['blockquote', 'a', 'i', 'em', 'p', 'b', 'strong',
                        'h1', 'h2', 'h3', 'h4', 'h5', 
                        'ul', 'ol', 'li', 
                        'sub', 'sup',
                        'abbr', 'acronym', 'dl', 'dt', 'dd', 'cite',
                        'dft', 'br', 
                        'table', 'tr', 'td', 'th', 'thead', 'tbody', 'tfoot'],
            safe_attrs_only=True,
            add_nofollow=True,
            scripts=True,
            javascript=True,
            comments=True,
            style=True,
            links=False,
            meta=True,
            processing_instructions=False,
            frames=True,
            annoying_tags=False)

In [5]:
## The SEC is encoded in CP1252, and it is recommended to use UTF-8 always.
##see: https://www.w3.org/International/questions/qa-what-is-encoding
###### https://www.w3.org/International/articles/definitions-characters/#unicode
###### https://www.w3.org/International/questions/qa-choosing-encodings

def reformat_cp1252(match):
    codePoint = int(match.group(1))
    if 128 <= codePoint <= 159:
        return bytes([codePoint])
    else:
        return match.group()

def clean_sec_content(binary):
    return re.sub(b'&#(\d+);', reformat_cp1252, binary, flags=re.I).decode("windows-1252").encode('utf-8').decode('utf-8')

In [6]:
## this is to normalize urls
def slugify(value):
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^\w\s\.\-]', '-', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    return value

## Cleaning html

In [7]:
def read_html(file):
    with open(file, 'r') as f: return f.read()

In [8]:
def clean_html(html):
    soup = BeautifulSoup(html)
    if not soup.find('p'):
        for div in soup.find_all('div'):
            div.name = 'p'
    for b in soup.find_all('b'):
        b.name = 'strong'
    for f in soup.find_all('font', style=re.compile('font-weight:\s*bold')):
        f.name = 'strong'
    for footer in soup.find_all(class_=['header', 'footer']): 
        try: footer.decompose()
        except: pass
    san = sanitize(str(soup), CustomCleaner)
    soup = BeautifulSoup(san)
    def decompose_parent(el, parent='p', not_grandparent='table'):
        try:
            parent = el.find_parent(parent)
        except: parent = None
        if not parent: return
        grandparent = parent.find_parent('table')
        if grandparent: return
        parent.decompose()
    for el in soup.find_all(text=lambda x: 'table of contents' == str(x).lower().strip()):
        decompose_parent(el, 'a')
    for el in soup.find_all(text=re.compile(r'^\s*S\-(\d+|[ivxlcdm]+)\s*$')): 
        decompose_parent(el, 'p')
    for el in soup.find_all(text=re.compile(r'^\s*\d+\s*$')): 
        decompose_parent(el, 'p')
    return soup

## Defining helper functions to download files

In [9]:
def pagination_provider_by_element_start_count(find_args, find_kwargs):
    def pagination_provider_by_element_start_count_wrapped(soup, params):
        if soup.find(*find_args, **find_kwargs) is None: 
            return None
        params['start'] += params['count'] 
        return params
    return pagination_provider_by_element_start_count_wrapped

In [10]:
def params_provider_by_dict(params):
    return lambda : params

In [11]:
## look for a table, converts it to text and removes line breaks
def table_provider_by_summary(summary, header=0, index_col=0):
    return lambda soup: pd.read_html(
        str(soup.find('table', summary=summary)).replace('<br>', '<br>\n'), header=header, index_col=index_col)[0]

In [12]:
def get_sec_table(url,
                  table_provider=None,
                  base_params={}, 
                  params_provider=None,
                  pagination_provider=None,
                  replace_links=True,
                  session=None):
    ### this function returns a tuple of a df with the respective soup element
    def return_data_frame(session, url, params, provider):
        request = session.get(url, params=params)
        soup = BeautifulSoup(request.text)
        if replace_links:
            for a in soup.find_all('a'):
                parent = a.find_parent('td')
                if parent: parent.string = a['href']
        df = provider(soup)
        return df, soup
    ####################################################################
    ###if no Session, then we use the base_url to do the pull request###
    ####################################################################
    if session is None:
        session = cach
    if not url.startswith('http://') and not url.startswith('https://'):
        url = base_url.format(url)
    ###############################################################################################    
    ###if the specified parameters are a dictionary, update params with the specified parameters###
    ###############################################################################################
    params = dict(base_params)
    if params_provider:
        if isinstance(params_provider, dict):
            params.update(params_provider)
        else:
            params.update(params_provider())
    ### in case you only scrape one page, it will just return the df of the respective page###
    if not pagination_provider:
        df, soup = return_data_frame(session, url, params, table_provider)
        return df
    ### in case you want to scrape multiple pages, create an empty list of dfs and add each df from each 
    ### page to the empty list, at the end you just concatenate all of the dfs
    else:
        data_frames = []
        page_params = dict(params)
        while True:
            df, soup = return_data_frame(session, url, page_params, table_provider)
            data_frames.append(df)
            # Make sure columns retain their names
            data_frames[-1].columns = data_frames[0].columns
            new_params = pagination_provider(soup, page_params)
            if not new_params:
                break
            else:
                page_params.update(new_params)
        return pd.concat(data_frames, sort=False, ignore_index=True)

## Function to get the documents

This is a function to get the documents in the filing details page for each filing. See below for an example page.


In [13]:
get_filing_documents = lambda url, summary = 'Document Format Files' : get_sec_table(url,
                                                                                    table_provider = table_provider_by_summary(summary, index_col=None),
                                                                                    pagination_provider = pagination_provider_by_element_start_count(('input',), {'value': 'Next 100'}))

## Scraping most recent filings

For the previous 5 days

In [14]:
base_url = 'https://www.sec.gov{}'

In [15]:
def get_current_events(days_before=0, form_type=''):
    soup = BeautifulSoup(cach.get(base_url.format('/cgi-bin/current'), 
                            params={'q1': days_before, 'q2': 0, 'q3': form_type}).text)
    pre = soup.find('pre')
    ls = []
    for line in str(pre).replace('<hr>', '\n').replace('<hr/>', '\n').split('\n'):
        bs_line = BeautifulSoup(line)
        clean_line = '  '.join(item.strip() for item in bs_line.find_all(text=True))
        split_line = [ x.strip() for x in clean_line.split('  ') if x.strip() ]
        split_line += [ a.get('href') for a in bs_line.find_all('a') ]
        if not all(x is None for x in split_line): ls.append(split_line)
    colnames = ls[0] + [ 'link_{}'.format(i) for i in range(max(len(l) for l in ls) - len(ls[0])) ]
    return pd.DataFrame(ls[1:], columns=colnames)

In [16]:
get_current_events(form_type='8-K').head()

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1
0,01-08-2021,8-K/A,1599407,1847 Holdings LLC,/Archives/edgar/data/1599407/0001213900-21-001...,browse-edgar?action=getcompany&CIK=1599407
1,01-08-2021,8-K,1634452,AB Private Credit Investors Corp,/Archives/edgar/data/1634452/0001193125-21-005...,browse-edgar?action=getcompany&CIK=1634452
2,01-08-2021,8-K,1300938,"ABCO Energy, Inc.",/Archives/edgar/data/1300938/0001185185-21-000...,browse-edgar?action=getcompany&CIK=1300938
3,01-08-2021,8-K/A,1813658,ACE Convergence Acquisition Corp.,/Archives/edgar/data/1813658/0001104659-21-002...,browse-edgar?action=getcompany&CIK=1813658
4,01-08-2021,8-K,1004434,"AFFILIATED MANAGERS GROUP, INC.",/Archives/edgar/data/1004434/0001193125-21-005...,browse-edgar?action=getcompany&CIK=1004434


## Downloading SEC documents

Questions: 
* What is the purpose of defining a directory? It does not seem to work when I use it as a parameter for download_sec_documents

* What does the error "index 0 is out of bounds for axis 0 with size 0" mean? I still manage to download the files.

In [50]:
def download_sec_documents(doc_link):
    contents = clean_sec_content(cach.get(base_url.format(doc_link)).content)
    name = slugify(doc_link)
    with open(name, 'w') as f: f.write(contents)
        

#### Downloading 424B5s

In [17]:
forms = get_current_events(0, '424B5')
for link in forms['link_0']:
    docs = get_filing_documents(base_url.format(link))
    doc_link = docs.loc[docs.Type == '424B5', 'Document'].values[0]
    download_sec_documents(doc_link)

#### Downloading 8-Ks

In [34]:
num_days = 2

for p in range(0, num_days):
    print('Scraping day-page:', p)
    forms = get_current_events(p, '8-K')
    for link in forms['link_0']:
        docs = get_filing_documents(base_url.format(link))
        doc_link = docs.loc[docs.Type == '8-K', 'Document'].values[0]
        download_sec_documents(doc_link)

Scraping day-page: 0


IndexError: index 0 is out of bounds for axis 0 with size 0

To download the 8-K I always get the index error above, however when I try to download 424B5 filings this is not a issue. This happens in the filing details page: 

8-K example: https://www.sec.gov/Archives/edgar/data/926660/0001193125-20-298746-index.html

424B5 example: https://www.sec.gov/Archives/edgar/data/1035443/0001047469-19-001263-index.html


## Extracting risk factors

There is a difference between 424B5 and 8-K forms. The 424B5 forms have summary tables, but we are interested in extracting the text from both forms, not necessarily the tables.

File that I am going to be working with: https://www.sec.gov/Archives/edgar/data/1174940/000149315220022121/form424b5.htm

### Parsing 424B5 filings

In [12]:
content1 = read_html('-archives-edgar-data-1174940-000149315220022121-form424b5.htm')
cleaned_content1 = clean_html(content1)

First we need to get the headers that delimit the section that includes the description of the risk factors. Once we have the headers, we start looping through every "p" tag that is in between the two headers.

We have to generalize how we find the delimiting headers and insert them inside the function get_risk_info to stop parsing the text on other sections. 

In [17]:
def get_header(soup):
    for header in soup.find_all():
        match = re.match(r'RISK\s*FACTORS\s*', header.text.upper(), re.M) 
        if match:
            if header.name == 'p':
                parent = header.parent
                if parent.name == 'body': ## this is to make it easier to get the delimiting header, since it must be a sibling of header
                    return header

In [18]:
def get_delimiter_header(content):
    soup = BeautifulSoup(content, 'html.parser') #we use the html parser to list the headers with the sourceline
    positions = []
    for tag in soup.find_all('a'):
        positions.append(tag.string)
    limit = len(positions)
    listing = []
    for i in range(0, limit-1):
        if positions[i] is not None:
            match = re.match(r'RISK\s*FACTORS\s*', str(positions[i]).upper(), re.M) #we spot the target header and select the header that follows it
            if match:
                if positions[i+1] is not None:
                    listing.append(positions[i+1].upper())
                if positions[i+2] is not None:
                    listing.append(positions[i+2].upper())
    return listing

In [293]:
def get_risk_info(header, content):
    paragraphs = ''
    brother = header.next_sibling
    limit = False
    delimiters = get_delimiter_header(content)
    while brother and limit == False:
        if brother.name == 'p' or brother.name == 'a': # we have to select the tag to use the .get_text attr
            if str(brother.get_text(strip = True)).replace('\n', ' ') in delimiters:
                limit = True
        paragraphs += str(brother)
        brother = brother.next_sibling
    return paragraphs

In [20]:
def extract_paragraphs(soup, content): 
    header = get_header(soup)
    content_s = get_risk_info(header, content)
    return content_s

Find out a way to extract the content through extract_paragraphs. 

Now I am going to try to parse all the downloaded documents.

In [14]:
filings_424b5 = []

for html_file in glob('*424b5*.htm'):
    print(html_file)
    content = read_html(html_file)
    cleaned_content = clean_html(content)
    paragraphs = get_risk_info(get_header(cleaned_content),content)
    if paragraphs:
        filings_424b5 += str(paragraphs)
        f = open("4245B5","a")
        f.write(str(paragraphs))
        f.close()
        print('Great success')
                

-archives-edgar-data-1674930-000156459020054810-flgt-424b5.htm
Great success
-archives-edgar-data-1558583-000121390020038382-ea130283-424b5_arcimoto.htm
Great success
-archives-edgar-data-864270-000119312520298619-d97515d424b5.htm
Great success
-archives-edgar-data-1692412-000119312520298480-d83042d424b5.htm
Great success
-archives-edgar-data-1410098-000121390020039655-ea130437-424b5_cormedix.htm
Great success
-archives-edgar-data-310764-000119312520299402-d30647d424b5.htm
Great success
-archives-edgar-data-1636651-000119312520299425-d947308d424b5.htm
Great success
-archives-edgar-data-1623526-000119312520299388-d42479d424b5.htm
Great success
-archives-edgar-data-1696396-000119312520299465-d93087d424b5.htm
Great success
-archives-edgar-data-1174940-000149315220022121-form424b5.htm
Great success
-archives-edgar-data-1316517-000121390020038504-ea130316-424b5_kanditechno.htm
Great success
-archives-edgar-data-1182534-000119312520304490-d63319d424b5.htm
Great success
-archives-edgar-data-1

In [15]:
f = open("4245B5", "r")
test = f.read()

The code does work, but the text still includes the html tags.

In [21]:
def cleanhtml(raw_html):
    cleanr = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});|\n')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

In [22]:
texty = cleanhtml(test)

In [23]:
window(str(texty))

## Parsing 8-K files

The 8-K filings have a different structure so we are just going to refine the functions to work with the 8-K filing structure. 

In [55]:
test = read_html("-archives-edgar-data-1577552-000104746921000247-a2242840z424b5.htm")
testc = clean_html(test)

In [26]:
file2 = read_html("-archives-edgar-data-1641751-000164175120000033-form-8k.htm")
file2c = clean_html(file2)

In [22]:
def delim_8k(soup):
    delim = []
    for header in soup.find_all():
        match = re.match(r'Exhibit.*', header.text)
        if match:
            delim.append(header.text)
    return delim


def get_8k_info1(soup):
    paragraphs = ''
    delimiters = delim_8k(soup)
    for header in soup.find_all():
        match = re.match(r'Item.*', header.text)
        if match:
            if header.name == 'p':
                brother = header.next_sibling
                stop = False
                while brother and stop == False:
                    if brother.name == 'table':
                        stop = True
                    if str(brother).strip() in delimiters:
                        stop = True
                    paragraphs += str(brother)
                    brother = brother.next_sibling
                return paragraphs



def get_8k_info(soup):
    paragraphs = []
    for header in soup.find_all():
        match = re.match(r'Item.*', header.text)
        if match:
            if header.name == 'p':
                brother = header.next_sibling
                already_seen = False
                while True:
                    if brother is None:
                        break
                    if type(brother) == NavigableString:
                        if already_seen and str(brother).strip() != '': break
                        brother = brother.next_sibling
                        continue
                    if brother.name == 'table': break
                    already_seen = True
                    paragraphs.append(str(brother))
                    brother = brother.next_sibling
                return(str(paragraphs))

In [23]:
filings_8k = []

for html_file in glob('*8k*.htm'):
    print(html_file)
    content = read_html(html_file)
    cleaned_content = clean_html(content)
    paragraphs = get_8k_info(cleaned_content)
    if paragraphs:
        filings_8k.append(paragraphs)
        f = open("8ks","a")
        f.write(paragraphs)
        f.close()
        print('Great success')
                

In [45]:
f = open("8ks", "r")
test = f.read()

# Consolidating training data

Since we will be doing a supervised classification task for the fine-tuning stage of the project, we need to identify each filing with the ticker of the respective company, as well as the prices before and after the publication. 

This section is merely to organize the training data in order to see that it is complete, and working.

In [48]:
# First we are going to get the information about the companies that filed the filings that
# we are going scrape.

num_days = 5
for i in range(0,num_days):
    print('Scraping day-page:', i)
    forms = get_current_events(p, '424B5')

forms.insert(6, 'Text', 'NA')
forms

Scraping day-page: 0
Scraping day-page: 1
Scraping day-page: 2
Scraping day-page: 3
Scraping day-page: 4


Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1,Text
0,02-04-2021,424B5,1697862,ARGENX SE,/Archives/edgar/data/1697862/0001047469-21-000...,browse-edgar?action=getcompany&CIK=1697862,
1,02-04-2021,424B5,1787306,"Arcutis Biotherapeutics, Inc.",/Archives/edgar/data/1787306/0001628280-21-001...,browse-edgar?action=getcompany&CIK=1787306,
2,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,
3,02-04-2021,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,/Archives/edgar/data/1167583/0001193125-21-028...,browse-edgar?action=getcompany&CIK=1167583,
4,02-04-2021,424B5,313807,BP PLC,/Archives/edgar/data/313807/0001193125-21-0285...,browse-edgar?action=getcompany&CIK=313807,
5,02-04-2021,424B5,1173204,Cinedigm Corp.,/Archives/edgar/data/1173204/0001104659-21-012...,browse-edgar?action=getcompany&CIK=1173204,
6,02-04-2021,424B5,771999,DOCUMENT SECURITY SYSTEMS INC,/Archives/edgar/data/771999/0001493152-21-0026...,browse-edgar?action=getcompany&CIK=771999,
7,02-04-2021,424B5,1082038,DURECT CORP,/Archives/edgar/data/1082038/0001193125-21-027...,browse-edgar?action=getcompany&CIK=1082038,
8,02-04-2021,424B5,1746466,"Equillium, Inc.",/Archives/edgar/data/1746466/0001193125-21-028...,browse-edgar?action=getcompany&CIK=1746466,
9,02-04-2021,424B5,1713832,HyreCar Inc.,/Archives/edgar/data/1713832/0001213900-21-006...,browse-edgar?action=getcompany&CIK=1713832,


Now we are going to download the filings in the urls extracted, get the risk factors section and add it to the dataframe. The functions to extract the risk factors work for about 75% of the filings, I am having some issues with files that have odd formatting.

In [52]:
for link in forms['link_0']:
    docs = get_filing_documents(base_url.format(link))
    doc_link = docs.loc[docs.Type == '424B5', 'Document'].values[0]
    download_sec_documents(doc_link)

In [250]:
files = []

for html_file in glob('*424b5*.htm'):
    files.append(html_file)
    
for i in range(0, len(forms)):
    CIK = forms['CIK Code'][i]
    for file in files:
        match = re.search(CIK, file)
        if match:
            content = read_html(file)
            cleaned_content = clean_html(content)
            paragraphs = get_risk_info1(get_header(cleaned_content), content)
            if paragraphs:
                forms['Text'][i]= cleanhtml(paragraphs)

In [251]:
forms

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1,Text
0,02-04-2021,424B5,1697862,ARGENX SE,/Archives/edgar/data/1697862/0001047469-21-000...,browse-edgar?action=getcompany&CIK=1697862,
1,02-04-2021,424B5,1787306,"Arcutis Biotherapeutics, Inc.",/Archives/edgar/data/1787306/0001628280-21-001...,browse-edgar?action=getcompany&CIK=1787306,You should consider carefully the risks descri...
2,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul..."
3,02-04-2021,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,/Archives/edgar/data/1167583/0001193125-21-028...,browse-edgar?action=getcompany&CIK=1167583,
4,02-04-2021,424B5,313807,BP PLC,/Archives/edgar/data/313807/0001193125-21-0285...,browse-edgar?action=getcompany&CIK=313807,Investing in the securities offered using this...
5,02-04-2021,424B5,1173204,Cinedigm Corp.,/Archives/edgar/data/1173204/0001104659-21-012...,browse-edgar?action=getcompany&CIK=1173204,Before you invest in shares of our ClassA comm...
6,02-04-2021,424B5,771999,DOCUMENT SECURITY SYSTEMS INC,/Archives/edgar/data/771999/0001493152-21-0026...,browse-edgar?action=getcompany&CIK=771999,Ourbusiness is influenced by many factors that...
7,02-04-2021,424B5,1082038,DURECT CORP,/Archives/edgar/data/1082038/0001193125-21-027...,browse-edgar?action=getcompany&CIK=1082038,"Before you invest in our common stock, in addi..."
8,02-04-2021,424B5,1746466,"Equillium, Inc.",/Archives/edgar/data/1746466/0001193125-21-028...,browse-edgar?action=getcompany&CIK=1746466,Investing in our securities involves a high de...
9,02-04-2021,424B5,1713832,HyreCar Inc.,/Archives/edgar/data/1713832/0001213900-21-006...,browse-edgar?action=getcompany&CIK=1713832,Investing in ourcommon stock involves a high d...


Now I am going to get the tickers of these companies based on their CIK Code, which was what we got from the SEC.

In [252]:
tickers = cach.get(base_url.format('/files/company_tickers.json')).json()
tickers = pd.DataFrame(tickers)
tickers

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,10896,10897,10898,10899,10900,10901,10902,10903,10904,10905
cik_str,320193,789019,1018724,1652044,1293451,1318605,1326801,1577552,1046179,1067983,...,1819794,1819615,1819608,1819608,1819516,1819516,1819574,1819574,1819584,1819584
ticker,AAPL,MSFT,AMZN,GOOG,TCEHY,TSLA,FB,BABA,TSM,BRK-A,...,HTOOW,CLVRW,AVAN-UN,AVAN-WT,ASPL-UN,ASPL-WT,STIC-UN,STIC-WT,SNPR-UN,SNPR-WT
title,Apple Inc.,MICROSOFT CORP,AMAZON COM INC,Alphabet Inc.,Tencent Holdings Ltd,"Tesla, Inc.",Facebook Inc,Alibaba Group Holding Ltd,TAIWAN SEMICONDUCTOR MANUFACTURING CO LTD,BERKSHIRE HATHAWAY INC,...,Fusion Fuel Green PLC,Clever Leaves Holdings Inc.,Avanti Acquisition Corp.,Avanti Acquisition Corp.,Aspirational Consumer Lifestyle Corp.,Aspirational Consumer Lifestyle Corp.,Northern Star Acquisition Corp.,Northern Star Acquisition Corp.,Tortoise Acquisition Corp. II,Tortoise Acquisition Corp. II


In [253]:
tickers_t = tickers.transpose()
tickers_t.head()

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1018724,AMZN,AMAZON COM INC
3,1652044,GOOG,Alphabet Inc.
4,1293451,TCEHY,Tencent Holdings Ltd


In [254]:
tickers_t = tickers_t.drop(['title'], axis = 1)
tickers_t.columns = ['CIK Code', 'Ticker']
tickers_t.head()

Unnamed: 0,CIK Code,Ticker
0,320193,AAPL
1,789019,MSFT
2,1018724,AMZN
3,1652044,GOOG
4,1293451,TCEHY


In [255]:
forms['CIK Code'] = forms['CIK Code'].astype(int)
tickers_t['CIK Code'] = tickers_t['CIK Code'].astype(int)

In [256]:
exp = forms.merge(tickers_t, on = 'CIK Code', how = 'left')
exp

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,link_0,link_1,Text,Ticker
0,02-04-2021,424B5,1697862,ARGENX SE,/Archives/edgar/data/1697862/0001047469-21-000...,browse-edgar?action=getcompany&CIK=1697862,,ARGX
1,02-04-2021,424B5,1787306,"Arcutis Biotherapeutics, Inc.",/Archives/edgar/data/1787306/0001628280-21-001...,browse-edgar?action=getcompany&CIK=1787306,You should consider carefully the risks descri...,ARQT
2,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB
3,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB-PC
4,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB-PE
5,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB-PD
6,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB-PF
7,02-04-2021,424B5,11544,BERKLEY W R CORP,/Archives/edgar/data/11544/0001193125-21-02858...,browse-edgar?action=getcompany&CIK=11544,"Before you invest in the debentures, you shoul...",WRB-PG
8,02-04-2021,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,/Archives/edgar/data/1167583/0001193125-21-028...,browse-edgar?action=getcompany&CIK=1167583,,
9,02-04-2021,424B5,313807,BP PLC,/Archives/edgar/data/313807/0001193125-21-0285...,browse-edgar?action=getcompany&CIK=313807,Investing in the securities offered using this...,BP


In [257]:
exp = exp.drop(['link_0', 'link_1'], axis = 1)
exp

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,Text,Ticker
0,02-04-2021,424B5,1697862,ARGENX SE,,ARGX
1,02-04-2021,424B5,1787306,"Arcutis Biotherapeutics, Inc.",You should consider carefully the risks descri...,ARQT
2,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB
3,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PC
4,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PE
5,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PD
6,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PF
7,02-04-2021,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PG
8,02-04-2021,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,,
9,02-04-2021,424B5,313807,BP PLC,Investing in the securities offered using this...,BP


## Merging with financial data
<br> </br>

So now the dataframe with the textual data is complete. Now we are going to use the tickers and the date of the publication to get the stock prices we are interested in. 

In [258]:
import yfinance as yf
from datetime import datetime, timedelta

In [259]:
def previous_close_and_next_open(tickers, dates):
    """This function obtains, for each pair of ticker and date, the closing price of the ticker during the 
    day before the given date and the opening price of the ticker for the day next to the reference date.
    
    For the inputs:
    tickers: List of tickers, each represented by a string
    dates: List of dates, each represented in the format %Y-%m-%d (2010-01-24)
    
    The output is a pandas dataframe, with as many rows as specified tickers, and columns Reference Date, Previous Close,
    and Next Open."""
    
    results = pd.DataFrame(columns=['Ticker', "Reference Date", "Previous Close", "Next Open"]).set_index('Ticker')
    for i, t in enumerate(tickers):
        #If date falls in weekends, take Friday and Monday
        extra_add = extra_sub = 0
        if datetime.strptime(dates[i], '%Y-%m-%d').isoweekday() == 6:
            extra_add = 1
        elif datetime.strptime(dates[i], '%Y-%m-%d').isoweekday() == 7:
            extra_sub = 1
                
        yesterday = datetime.strptime(dates[i], '%Y-%m-%d') - timedelta(days=1+ extra_sub)
        tomorrow = datetime.strptime(dates[i], '%Y-%m-%d') + timedelta(days=1 + extra_add)
        
        data = yf.download(t, start=yesterday + timedelta(days=1), end=tomorrow + timedelta(days=1))
        
        previous_close = data.iloc[0]['Close']
        next_open = data.iloc[-1]['Open']

        single = pd.DataFrame({"Reference Date":dates[i], "Previous Close":previous_close, "Next Open":next_open}, index=[t])
        results = results.append(single)
    return results

In [260]:
exp['Date Filed'] = pd.to_datetime(exp['Date Filed'], format = '%m-%d-%Y')
exp

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,Text,Ticker
0,2021-02-04,424B5,1697862,ARGENX SE,,ARGX
1,2021-02-04,424B5,1787306,"Arcutis Biotherapeutics, Inc.",You should consider carefully the risks descri...,ARQT
2,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB
3,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PC
4,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PE
5,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PD
6,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PF
7,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PG
8,2021-02-04,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,,
9,2021-02-04,424B5,313807,BP PLC,Investing in the securities offered using this...,BP


In [261]:
exp['Date Filed'] = exp['Date Filed'].dt.strftime('%Y-%m-%d')

In [262]:
exp

Unnamed: 0,Date Filed,Form,CIK Code,Company Name,Text,Ticker
0,2021-02-04,424B5,1697862,ARGENX SE,,ARGX
1,2021-02-04,424B5,1787306,"Arcutis Biotherapeutics, Inc.",You should consider carefully the risks descri...,ARQT
2,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB
3,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PC
4,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PE
5,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PD
6,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PF
7,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PG
8,2021-02-04,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,,
9,2021-02-04,424B5,313807,BP PLC,Investing in the securities offered using this...,BP


In [269]:
tickers = list(exp['Ticker'])
dates = list(exp['Date Filed'])

In [270]:
tickers

['ARGX',
 'ARQT',
 'WRB',
 'WRB-PC',
 'WRB-PE',
 'WRB-PD',
 'WRB-PF',
 'WRB-PG',
 nan,
 'BP',
 'CIDM',
 'DSS',
 'DRRX',
 'EQ',
 'HYRE',
 'ISR',
 nan,
 'MBRX',
 'OCX',
 'PEB',
 'PEB-PE',
 'PEB-PC',
 'PEB-PF',
 'PEB-PD',
 'PLSE',
 'QTRX',
 'REKR',
 'SYN',
 nan,
 'TGI',
 nan,
 'X',
 'URG']

In [280]:
tickers_cleaned = [i for i in tickers if (str(i) != 'nan' and len(str(i))<5)]
dates_cleaned = exp['Date Filed'][:-(len(tickers)-len(tickers_cleaned))]

In [282]:
len(dates_cleaned)

20

In [283]:
df = previous_close_and_next_open(tickers_cleaned, dates_cleaned)
df.head()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

Unnamed: 0,Reference Date,Previous Close,Next Open
ARGX,2021-02-04,362.440002,370.5
ARQT,2021-02-04,34.630001,35.5
WRB,2021-02-04,63.599998,65.360001
BP,2021-02-04,21.27,21.0
CIDM,2021-02-04,1.63,1.94


In [284]:
df = df.reset_index()
df.columns = ['Ticker', 'Date Filed', 'Previous Close', 'Next Open']

In [285]:
df

Unnamed: 0,Ticker,Date Filed,Previous Close,Next Open
0,ARGX,2021-02-04,362.440002,370.5
1,ARQT,2021-02-04,34.630001,35.5
2,WRB,2021-02-04,63.599998,65.360001
3,BP,2021-02-04,21.27,21.0
4,CIDM,2021-02-04,1.63,1.94
5,DSS,2021-02-04,4.38,3.02
6,DRRX,2021-02-04,2.87,2.57
7,EQ,2021-02-04,6.62,7.88
8,HYRE,2021-02-04,15.62,14.99
9,ISR,2021-02-04,1.89,1.6


In [286]:
exp1 = exp.merge(df, on ='Ticker', how = 'left')
exp1

Unnamed: 0,Date Filed_x,Form,CIK Code,Company Name,Text,Ticker,Date Filed_y,Previous Close,Next Open
0,2021-02-04,424B5,1697862,ARGENX SE,,ARGX,2021-02-04,362.440002,370.5
1,2021-02-04,424B5,1787306,"Arcutis Biotherapeutics, Inc.",You should consider carefully the risks descri...,ARQT,2021-02-04,34.630001,35.5
2,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB,2021-02-04,63.599998,65.360001
3,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PC,,,
4,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PE,,,
5,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PD,,,
6,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PF,,,
7,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB-PG,,,
8,2021-02-04,424B5,1167583,BP CAPITAL MARKETS AMERICA INC,,,,,
9,2021-02-04,424B5,313807,BP PLC,Investing in the securities offered using this...,BP,2021-02-04,21.27,21.0


In [292]:
exp1.dropna(axis = 0, how = 'any', inplace = True)

Unnamed: 0,Date Filed_x,Form,CIK Code,Company Name,Text,Ticker,Date Filed_y,Previous Close,Next Open
0,2021-02-04,424B5,1697862,ARGENX SE,,ARGX,2021-02-04,362.440002,370.5
1,2021-02-04,424B5,1787306,"Arcutis Biotherapeutics, Inc.",You should consider carefully the risks descri...,ARQT,2021-02-04,34.630001,35.5
2,2021-02-04,424B5,11544,BERKLEY W R CORP,"Before you invest in the debentures, you shoul...",WRB,2021-02-04,63.599998,65.360001
9,2021-02-04,424B5,313807,BP PLC,Investing in the securities offered using this...,BP,2021-02-04,21.27,21.0
10,2021-02-04,424B5,1173204,Cinedigm Corp.,Before you invest in shares of our ClassA comm...,CIDM,2021-02-04,1.63,1.94
11,2021-02-04,424B5,771999,DOCUMENT SECURITY SYSTEMS INC,Ourbusiness is influenced by many factors that...,DSS,2021-02-04,4.38,3.02
12,2021-02-04,424B5,1082038,DURECT CORP,"Before you invest in our common stock, in addi...",DRRX,2021-02-04,2.87,2.57
13,2021-02-04,424B5,1746466,"Equillium, Inc.",Investing in our securities involves a high de...,EQ,2021-02-04,6.62,7.88
14,2021-02-04,424B5,1713832,HyreCar Inc.,Investing in ourcommon stock involves a high d...,HYRE,2021-02-04,15.62,14.99
15,2021-02-04,424B5,728387,"Isoray, Inc.",Investinginoursecuritiesinvolvesahighdegreeofr...,ISR,2021-02-04,1.89,1.6
