# Machine learning for finance
---
## Lecture 11: Topic and sequence modeling
---

**Damien Ackerer**

Fall 2019

École Polytechnique Fédérale de Lausanne


## Table of contents

  * web scrapping
  * topic modeling
      * $n$-grams
  * sequence modeling
      * word2vec

# Web scrapping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The procedure is as follows:
  * fetch a web page
  * extract information from it
  
The first step is *in general* straightforward and use the HTTP. The second step typically implies parsing, searching, reformatting, etc. the download unstructured content to produce a structured dataset.

<img src="img/webscrapping-banner.png" alt="drawing" width="800"/>

## The U.S. Securities and Exchange Commission (SEC)

The SEC is an independent agency of the United States federal government that is responsible for enforcing the federal securities laws, proposing securities rules, and regulating the securities industry.

The SEC has a three-part mission: to protect investors; maintain fair, orderly, and efficient markets; and facilitate capital formation.

Public companies, funds, and large shareholders must publish and/or notify the SEC of major changes possibly affecting investors. In addition, public companies publish report periodically about their activities.

These report can actually be accessed on the SEC website, for example at: https://sec.report/
Here are some other interesting reports: https://www.investopedia.com/articles/fundamental-analysis/08/sec-forms.asp

Today we work on the N-1A form which is the registration form for open-end management companies, such as mutual funds, hedge funds, and ETFs.

### Forms available via RSS

First, we collect all the documents available in the RSS feed using `feedparser`.

Only a fraction of forms are available this way, but this will be sufficient for our goals. The rest can be accessed via the EDGAR database: https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm

In [None]:
!pip install feedparser

In [None]:
import feedparser

## URL of the RSS feed
rss_url = "https://sec.report/Form/N-1A.rss"

## Parse the feed
feed = ## TODO

## Check if there was some error
if feed.status != 200:
    print("Some connection error", feed.status)

print("Number of documents in the RSS feed: " + str(len(feed.entries)))

In [None]:
## What an entry looks like
## TODO

### Scrapping the webpages

We make simple HTTP requests below. If you meed more advance requests, for example with page login, maybe you want to have a look at the ` selenium` libray.

In [None]:
!pip install requests

In [None]:
import re
from urllib.request import urlopen, Request

In [None]:
url = feed.entries[0].links[0].href
print(url)

## Scrape the page content
request = Request(url)
html = urlopen(request).read().decode()

print(html)

Do you have a problem to download the webpage?

In [None]:
import requests

r = requests.get('http://httpbin.org/user-agent')
my_agent = r.text

print(my_agent)

In [None]:
## Define user-agent
headers = ## TODO

## Scrape the page content
request = Request(url, headers=headers)
html = urlopen(request).read().decode()

print(html)

### But... is it legal?

<img src="img/Is-web-scraping-legal-2.jpg" alt="drawing" width="800"/>

### Parsing the webpage

We use the `BeautifulSoup` library which offer powerful parsing tools for various documents type.

See: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [None]:
!pip install bs4

In [None]:
from bs4 import BeautifulSoup, NavigableString, Tag

def extract_paragraph(html, pattern='Principal Investment Strategies', verbose=False):
    soup = BeautifulSoup(html, 'html.parser')
    pattern = re.compile(pattern)
    funds = []
    ## Find the tags for bold text
    for tag in soup.find_all(re.compile("^b$")):
        ## If the text content match the pattern
        if pattern.match(tag.text):
            if(verbose):
                print("START TAG:", tag)
            fund_info = ""
            ## Collect all the text content until it reaches another tag for bold text
            for c in tag.next_elements:
                if isinstance(c, NavigableString):
                    fund_info += c.lower()
                    if(verbose):
                        print("*" *5 + "\n" + c.lower())
                if isinstance(c, Tag) and c.name == "b":
                    if(verbose):
                        print("*" *5 + "\n" + "TAG:", c)
                    if len(c.text.strip()) > 0:
                        if(verbose):
                            print("END HERE\n\n")
                        break
            funds.append(fund_info)
    return funds

In [None]:
html = urlopen(request).read().decode()
funds = extract_paragraph(html, verbose=True)

### Text normalization

This is similar to what we was done in week 10.

In [None]:
import nltk

## stopwords
cachedStopWords = nltk.corpus.stopwords.words('english')
## stemmer
porter = nltk.PorterStemmer()

def tokenize(text, min_length=2):
    """
    A tokenizer typical used for classification
    """
    
    ## remove some characters
    chars = ['i.e.', '-', '\n', '\xa0', '"', '(', ')', ';', ',', '. ', '“', '”', '·', ':', '\t', "’"]
    ## TODO
        
    ## remove extra spaces and tokenize
    words = re.sub('\s+', ' ', text).split(" ")
    
    ## remove stopwords
    words = ## TODO
    
    ## stem words
    tokens = ## TODO
    
    ## remove any token with anything but letters
    p = re.compile('[a-zA-Z]+')
    
    ## keep only tokens large than min_length
    filtered_tokens = ## TODO
    
    return filtered_tokens

In [None]:
print(tokenize(funds[0]))

### Put everything together

Perform the above operations on all the files available, this may take a little while.

In [None]:
import time

def scrape_and_parse(feed, headers={}):
    docs = []
    texts = []
    for entry in feed.entries:
        time.sleep(0.1) ## XXX
        ## Get the url for the entry
        url = entry.links[0].href
        print(url)
        ## Make the request
        request = Request(url, headers=headers)
        html = urlopen(request).read().decode()
        ## parse the webpage
        funds = extract_paragraph(html)
        ## tokenize
        for f in funds:
            texts.append(f)
            docs.append(tokenize(f))
    return docs, texts

In [None]:
## run
docs, texts = scrape_and_parse(feed, headers)

### Backup the files

In case you want to skip the above steps next time you work on this notebook.

The library `pickle` stores files in binary format, which is efficient both in terms of speed and size.

In [None]:
import pickle 

filename = "sec_n1a_backup.pickle"

## Save
# filehandler = open(filename, 'wb') 
# pickle.dump([docs, texts], filehandler)

## Load
# filehandler = open(filename, 'rb') 
# docs, texts = pickle.load(filehandler)

In [None]:
len(docs)

### Basic data checks

Let's see what the data looks like. In practice, you should probably spend some more time on this step.

In [None]:
import matplotlib.pyplot as plt

n_tokens = [len(d) for d in docs]

plt.figure()
plt.hist(n_tokens, bins=50)
plt.show()

In [None]:
long_docs = [d for d in docs if len(d) > 800]
long_docs

## Unsupervised topic modeling

In week 10, the news articles were annotated and we train multiple algorithms to lean the mapping between words to classes. 

This week, we have no annotations and we want to group the texts by common and meaningful topics.

A popular approach is to use LDA models, as desribed below.

### Latent Dirichlet Allocation (LDA)

This approach has become a standard by itself: http://www.jmlr.org/papers/v3/blei03a.html

The intuition is as follows:
  * each **topic** is a distribution over words
  * each **document** is a mixture of corpus-wide topics
  * each **word** is drawn from one of those topics

Notes:
  * a word can belong to multiple topics
  * the vocabulary is fixed
  * this is *bag of words* approach.

<img src="img/lda1.png" alt="drawing" width="800"/>

The Dirichlet distribution is a family of continuous multivariate probability distributions with one parameter vector $\alpha>0$. In dimension $K$ its support is the $K-1$ simplex, that is $\sum_{i=1}^K x_i = 1$.

Below are examples of 3-dim Dirichlet distributions for different $\alpha$ parameters.

<img src="img/dirdist.png" alt="drawing" width="800"/>


**Definitions and notations:**
  * $M$ denotes the number of documents
  * $N$ is number of words in a given document (document $i$ has $N_i$ words)
  * $\alpha$ is the parameter of the Dirichlet prior on the per-document topic distributions
  * $\beta$ is the parameter of the Dirichlet prior on the per-topic word distribution
  * $\theta_i$ is the topic distribution for document $i$
  * $\varphi_k$ is the word distribution for topic $k$
  * $z_{ij}$ is the topic for the $j$-th word in document $i$
  * $w_{ij}$ is the specific word.
  * $K$ is the number of topics
  
**The generative process:**
  * Give each topic a words distribution: draw $\varphi_k \sim \operatorname{Dir}(\beta)$, where $k \in \{ 1,\dots,K \}$ and $\beta$ typically is sparse
  * Give each document topics: draw $\theta_i \sim \operatorname{Dir}(\alpha)$, where $i \in \{ 1,\dots,M \}$ and
$\mathrm{Dir}(\alpha)$ is a (sparse) Dirichlet distribution
  * Give each document words: for each of the word positions $i, j$, where $i \in \{ 1,\dots,M \}$, and $j \in \{ 1,\dots,N_i \}$
    * Choose a topic $z_{i,j} \sim\operatorname{Multinomial}(\theta_i)$
    * Choose a word $w_{i,j} \sim\operatorname{Multinomial}( \varphi_{z_{i,j}})$

<img src="img/Smoothed_LDA.png" alt="drawing" width="400"/>

**The learning algorithm** is beyond the scope of this lecture :-) If you are interested: https://arxiv.org/abs/1711.04305

**LDA models typically use only the most frequent words!** E.g. 1'000 words for relatively small texts and corpus.

More details at: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

In [None]:
!pip install gensim

In [None]:
import gensim
from pprint import pprint

In [None]:
dictionary = gensim.corpora.Dictionary(docs)

## Optional filters for the tokens to keep
# dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)

print(dictionary)

# for k, v in dictionary.iteritems():
#     print(k, v)

In [None]:
## documents to bag-of-words
bow_corpus = ## TODO

pprint(bow_corpus[42])

In [None]:
print(sorted(docs[42]))

dictionary[5]

In [None]:
## transform BOW in TF-IDF
tfidf = gensim.models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

pprint(corpus_tfidf[0])

### LDA using raw BOW

In [None]:
lda_model = gensim.models.LdaMulticore(## TODO, 
                                       num_topics=## TODO, 
                                       id2word=## TODO, 
                                       passes=## TODO, 
                                       workers=## TODO)

for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

### LDA using TF-IDF

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(## TODO, 
                                             num_topics=## TODO, 
                                             id2word=## TODO, 
                                             passes=## TODO, 
                                             workers=## TODO)

for idx, topic in lda_model_tfidf.print_topics(-1):
    print('*' * 5 + '\n', 'Topic: {} Word: {}'.format(idx, topic))

In [None]:
i = 42

print(docs[i])

for index, score in sorted(lda_model_tfidf[bow_corpus[i]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

## Extending BOW methods with $n$-grams

An $n$-gram is simply a sequence of $n$ items. Standard BOW are made of *unigrams*, that is $n$-grams of length 1. 

<img src="img/ngrams.jpeg" alt="drawing" width="400"/>
  
The motivativation for $n$-grams is that it may be able to extract more precise information than unigrams, including for example negation (e.g.\ "not good").


Going back to topic modeling, one may observe that important expressions are lost with unigrams. For examples:
  * "short term" -> ["short", "term"]
  * "long term" -> ["long", "term"]
  * "high yield" -> ["high", "yield"]
  * "interest rate" -> ["interest", "rate"]
  
We now train an LDA model on bags of unigrams and bigrams, the code barely changes...

First, we create a `Phraser` that will create and select a subset of $n$-grams.

In [None]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

chars = ['-']

documents = ["the mayor of new york was there, the mayor of new york was there\n\n", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)

bigram = Phrases(sentence_stream, 
                 min_count=1, ## ignore all words and bigrams with total collected count lower than this.
                 threshold=2, ## higher means fewer sentences
                 delimiter=b' ')

bigram_phraser = Phraser(bigram)


print(bigram_phraser)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]

    print(tokens_)

In [None]:
sentence_stream = [tokenize(t) for t in texts]

bigram = Phrases(sentence_stream, 
                 min_count=1, ## ignore all words and bigrams with total collected count lower than this.
                 threshold=5, ## higher means fewer sentences
                 delimiter=b' ')

bigram_phraser = Phraser(bigram)


print(bigram_phraser[sentence_stream[0]])

In [None]:
## dictionnary of unigrams and bigrams
sentence_stream_bigram = [bigram_phraser[sent] for sent in sentence_stream]
dictionary = gensim.corpora.Dictionary(sentence_stream_bigram)


In [None]:
print(dictionary)

In [None]:
## BOW corpus
## TODO

## TFIDF
## TODO

In [None]:
## train LDA model
lda_model = ## TODO

for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

## Delete all variables

Clean the environment before moving to the next part. Make sure that you saved everything you wanted to keep.

In [None]:
%reset

## Word embeddings

[Word embeddings](https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795) is a set of language modeling and feature learning techniques in NLP where words/phrases/sentences/paragraphs/documents are mapped to vectors of real numbers (usually of less dimentilnality).

The idea about this form of dimentionality reduction is to capture semantic/morphological/contextual/hierarchical/etc information as possible from the original text. While training the models to find the embeddings, several directions could be taken:
  * preserving the [morphological structure](https://arxiv.org/pdf/1607.04606.pdf) (subword information, etc.);
  * [word context](https://arxiv.org/pdf/1411.2738.pdf) representation;
  * [global corpus statistics](https://nlp.stanford.edu/pubs/glove.pdf);
  * [word hierarchy](https://arxiv.org/pdf/1705.08039.pdf) as in WordNet;
  * [relationship between documents](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html) and the terms they contain.

## [Glove](https://nlp.stanford.edu/pubs/glove.pdf)

Global Vectors for Word Representation: https://nlp.stanford.edu/projects/glove
* [glove.6B.zip](http://nlp.stanford.edu/data/glove.6B.zip): Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download).
* [glove.840B.300d.zip](http://nlp.stanford.edu/data/glove.840B.300d.zip): Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

This approach tries to capture the meaning of one word embedding with the structure of the whole observed corpus.

This model is trained on the global co-occurrence counts and uses the word statistics.

More details and explations are either in the paper or in [this tutorial](https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795).

<img src="img/w2v.jpeg" alt="drawing" width="600"/>

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

## transform GloVe to word2vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

## load the Stanford GloVe model in word2vec format
filename = 'glove.6B.100d.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

In [None]:
model

In [None]:
## vector representation of a given word
## TODO

In [None]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=## TODO, 
                            negative=## TODO, 
                            topn=1)
print(result)

In [None]:
## distance between two words
## TODO

In [None]:
## similarity between two words
## TODO

Note that you can train a word2vec model for your specific application easily with existing libraries.

## Supervised score learning

Given financial news headlines, we aim to build a model that can predict the annotated score as precisely as possible. This score is a measure of sentiment.

This application is taken from this workshop: https://bitbucket.org/ssix-project/semeval-2017-task-5-subtask-2/src/master/

The modeling part is a simplified version of: https://github.com/apmoore1/semeval

### Read and check the data

In [None]:
import json
import numpy as np

def read_data(file_name):
    """
    Read and parse the news headlines and the sentiment scores
    """
    all_data = json.load(open(file_name, 'r'))
    text = []
    sentiment = []
    company = []
    for data in all_data:
        text.append(data['title'].lower())
        company.append(data['company'].lower())
        if 'sentiment' in data:
            sentiment.append(data['sentiment'])
        elif 'sentiment score' in data:
            sentiment.append(data['sentiment score'])
    return text, np.asarray(sentiment), company

## run
train_texts, train_sentiments, train_companies = read_data('Headline_Trainingdata.json')

In [None]:
train_sentiments

In [None]:
train_texts

## RNN models for NLP

The typical approach is sequence to *something*. But why using sequences?

### Sequence to sequence

They are typically used for text generation such as language translation and chat bots. There could be some *random generator* between the encoder and the decoder.

<img src="img/seq2seq.jpg" alt="drawing" width="600"/>

<img src="img/chatbot.png" alt="drawing" width="600"/>

### Sequence to class / score

Replace the decoder by an output layer in the above architecture.

For example:
<img src="img/seq2vec.png" alt="drawing" width="250"/>

### Bidirectional RNNs

Recent models include sequence reading and predictions in both directions.

<img src="img/bidirectional.png" alt="drawing" width="600"/>

### Tokenize and word2vec texts

In [None]:
from nltk.tokenize import word_tokenize 

def prepare_data(train_texts, word2vec_model, tokenizer):
    
    ## max number of tokens for headlines, will be used when creating the model
    max_token_length = 0

    ## tokenization
    train_tokens = []
    for text in train_texts:
        ## tokenize
        tokens = tokenizer(text)
        tokens = [token for token in tokens if token.strip()]
        ## is it longest sentence?
        if len(tokens) > max_token_length:
            max_token_length = len(tokens)
        ## save tokens
        train_tokens.append(tokens) 

    ## word2vec-ization    
    vector_length = model.vector_size
    all_vectors = []
    for tokens in train_tokens:
        vector_format = []
        for token in tokens:
            ## word2vec
            if token in word2vec_model.vocab:
                ## word embedding
                ## TODO: reshape(1,vector_length)
                vector_format.append(None)
            else:
                ## word not found
                ## TODO
        while len(vector_format) != max_token_length:
            ## padding
            ## TODO
        ## stack all the vector for this sequence
        all_vectors.append(np.vstack(vector_format))
    
    ## stack all the sequences
    return np.asarray(all_vectors), max_token_length

## run
train_vectors, max_length = prepare_data(train_texts, model, word_tokenize)

In [None]:
train_vectors.shape

In [None]:
train_vectors[1,:,:]

Do you see reasons why LSTM/GRU may be a better choice than simple RNN for this type of applications?

<img src="img/.png" alt="drawing" width="250"/>

### Model builder(s)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Bidirectional, LSTM

def build_lstm_model(max_length, vector_length):
    
    model = ## TODO
    
    ## output layer: TODO

    ## compile model with loss function and optimizer: TODO

    return(model)


### Build and display model

In [None]:
!pip3 install graphviz
!pip3 install pydotplus

In [None]:
from keras.utils import plot_model
import pydot

lstm_model = build_lstm_model(max_length, 100)

plot_model(lstm_model)

In [None]:
hist = lstm_model.fit(train_vectors, train_sentiments, nb_epoch=25)

### Train model with $k$-fold cross-valisation

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

def cross_validate(train_text, train_sentiments, n_folds=5, nb_epoch=25,
                   shuffle=True, score_function=mean_absolute_error):
    
    all_results = []
    train_text_array = np.asarray(train_text)
    train_sentiments_array = np.asarray(train_sentiments)

    kfold = KFold(n_splits=n_folds, shuffle=shuffle)
    
    max_length = train_text.shape[1]
    vector_length = train_text.shape[2]
    
    for train, test in kfold.split(train_text_array, train_sentiments_array):
        
        lstm_model = build_lstm_model(max_length, vector_length)
        lstm_model.fit(train_text_array[train], train_sentiments_array[train], nb_epoch=nb_epoch)
        
        predicted_sentiments = lstm_model.predict(train_text_array[test])
        result = score_function(predicted_sentiments, train_sentiments_array[test])
        
        all_results.append(result)
        
    return all_results

In [None]:
res_cv = cross_validate(train_vectors, train_sentiments)

## Data augmentation

The dataset that we used if fairly small. A common technique to improve generalization in machine learning is to augment the data. In the current situation, this may work as follows:
  * random swap of company names
  * synonym replacement of positive and negative words
  
  
More reading on data augmentation for NLP: https://arxiv.org/abs/1901.11196

<img src="img/nlpeda.png" alt="drawing" width="400"/>