# SESTM - Sentiment Extraction via Screening and Topic Modelling
This is a novel sentiment extraction algorithm that gleans sentiment from realised stock returns. This notebook will walk you through how the nitty gritty of how exactly it works

First things first, import this absolute monstrosity of imports; we'll be needing all of them.

In [135]:
import json
import sys
import re
import os
import sklearn
from sklearn import preprocessing
# import scikit-learn
import numpy as np
from numpy.linalg import inv
from pathlib import Path
from bs4 import BeautifulSoup as bs
# from textblob import TextBlob as tb
import datetime
import math
from nbformat import read
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import unittest
import sympy as sym

In [14]:
# install nltk stuff
nltk.download('stopwords')
nltk.download('words')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/josh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /home/josh/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/josh/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

Now that that's out of the way, let's define some sample articles to demonstrate how it works. Normally, I would pull articles from the Refinitiv Eikon API, which returns the webpage with the content inside. I've simulated this somwhat by using `<p>` tags, but the general idea for real articles is the same as these sample ones

Here are three sample articles that I made up:

In [39]:
article_1 = {
    'date': '2021-12-23 12:58:45.061000+00:00',
    'ticker': 'ABDN',
    'mrkt_info': {
        'open': 233.7,
        'close': 200.3
    },
    'html': '<p>John likes to watch films and eat pizza.\nMary likes films too.</p>'
}

article_2 = {
    'date': '2022-01-26 07:11:46.774000+00:00',
    'ticker': 'ABDN', 
    'mrkt_info': {
        'open': 229.2,
        'close': 241.0
    },
    'html': '<p>Mary also likes to watch football games.</p>'
}

article_3 = {
    'date': '2021-10-25 13:22:07.985000+00:00',
    'ticker': 'ABDN',
    'mrkt_info': {
        'open': 250.3,
        'close': 258.5
    },
    'html': '<p>Carl likes to play football. He finds films boring.</p>'
}

art_list = [article_1, article_2, article_3]

## Pre processing
We now need to define some things about these articles, namely:
1. The bag of words representation
2. The sign of the article (`sgn`)
3. The articles realised returns for article $i$ ($y_i$)

Steps 2 and 3 are the easiest, so let's do those first.

In [107]:
sgn = []    # list of article signs
y = []      # list of realised returns
i = 1
for a in art_list:
    returns = a['mrkt_info']['close'] - a['mrkt_info']['open']
    y.append(returns)
    sgn_a = -1
    if (returns > 0): # add -1 if returns are 0 or less, 1 otherwise
        sgn_a = 1
    sgn.append(sgn_a)
    print("Returns for article " + str(i) + ": " + str(returns))
    print("Sign for article " + str(i) + ": " + str(sgn_a))
    i +=1

Returns for article 1: -33.39999999999998
Sign for article 1: -1
Returns for article 2: 11.800000000000011
Sign for article 2: 1
Returns for article 3: 8.199999999999989
Sign for article 3: 1


Now we need to prepare the bag of words (BOW) representation for each of the articles. Before we can do that we need to *extract* the actual text content from the HTML, and then *normalise* the text content. Normlising consists of:
- Converting to lower case
- Removing non alphabetic chars
- Removing stop words (pronouns, connectives, etc.)
- Removing non-english words
- Lemmatising (converting likes to likes etc.)
- Tokenising (converting to list of words)

A bit of notation here:
- We have a collection of $n$ news articles and a dictionary of $m$ words
- We record the word counts in the $i^{th}$ article in vector $d_i$
    - $d_{i,j}$ is the number of times word $j$ appears in article $i$
    - $D = [d_{1}, ..., d_{n}]$ is an $m \times n$ doc-term matrix.
    - We occasionally work with a subset of columns where only indices are those with sentiment, we define this as $d_{[S],i}$

In [108]:
STOP_WORDS = set(stopwords.words('english'))
global_bow = {} # global bag of words
d = []          # word count vector d, where di is the word count vector for article i

i = 1
for a in art_list:
    raw_html = a['html']
    if (raw_html): # when extracting articles from eikon, some articles are just pdfs, so this might not exist
        readable_text = bs(raw_html, 'lxml').get_text().lower()
        print("Text for article " + str(i) + ": '" + readable_text + "'")
        # substitute non alphabet chars (new lines become spaces)
        readable_text = re.sub(r'\n', ' ', readable_text)
        readable_text = re.sub(r'[^a-z ]', '', readable_text)
        # sub multiple spaces with one space
        readable_text = re.sub(r'\s+', ' ', readable_text)
        # tokenise text
        words = nltk.wordpunct_tokenize(readable_text)
        if len(words) > 0:
            # lemmatise, remove non-english, and remove stopwords
            lemmatizer = WordNetLemmatizer()
            lemmatised_words = []
            en_words = set(nltk.corpus.words.words())
            for w in words:
                rootword = lemmatizer.lemmatize(w, pos="v")
                if rootword not in STOP_WORDS and (rootword in en_words or not rootword.isalpha()):
                    lemmatised_words.append(rootword)
            # convert to bag of words
            bow_art = {}
            # global_bow = {l: val+1 for l in lemmatised_words for val in global_bow.get(l, 0)}
            # bow_art = {l: val+1 for l in lemmatised_words for val in global_bow.get(l, 0)}
            for l in lemmatised_words:
                if l in global_bow:
                    global_bow[l] += 1
                else:
                    global_bow[l] = 1
                if l in bow_art:
                    bow_art[l] += 1
                else:
                    bow_art[l] = 1
            print("BOW for article " + str(i) + ": " + str(bow_art))
            d.append(bow_art)
    i += 1

print(d)
print('\n')
print("Global BOW: " + str(global_bow))


Text for article 1: 'john likes to watch films and eat pizza.
mary likes films too.'
BOW for article 1: {'like': 2, 'watch': 1, 'film': 2, 'eat': 1, 'pizza': 1, 'mary': 1}
Text for article 2: 'mary also likes to watch football games.'
BOW for article 2: {'mary': 1, 'also': 1, 'like': 1, 'watch': 1, 'football': 1, 'game': 1}
Text for article 3: 'carl likes to play football. he finds films boring.'
BOW for article 3: {'carl': 1, 'like': 1, 'play': 1, 'football': 1, 'find': 1, 'film': 1, 'bore': 1}
[{'like': 2, 'watch': 1, 'film': 2, 'eat': 1, 'pizza': 1, 'mary': 1}, {'mary': 1, 'also': 1, 'like': 1, 'watch': 1, 'football': 1, 'game': 1}, {'carl': 1, 'like': 1, 'play': 1, 'football': 1, 'find': 1, 'film': 1, 'bore': 1}]


Global BOW: {'like': 4, 'watch': 2, 'film': 3, 'eat': 1, 'pizza': 1, 'mary': 2, 'also': 1, 'football': 2, 'game': 1, 'carl': 1, 'play': 1, 'find': 1, 'bore': 1}


## Screening for sentiment-charged words
With our BOWs now defined, we can move onto the maths.

Our first step calculates the frequency word $j$ co-occurs with a positive return:

$$f_j = \frac{\text{count of word } j \text{ in article with } sgn(y) = + 1}{\text{ count of word } j \text{ in all articles}}$$

In [109]:
pos_j = {}  #j occuring in positive article
total_j = {}#j occuring in any article
for i in range(len(d)):
    for w in d[i]:
        pos_sent = 0
        if (sgn[i] == 1):
            pos_sent = 1
        if w in total_j:
            total_j[w] += d[i][w]
            pos_j[w] += d[i][w]*pos_sent
        else:
            total_j[w] = d[i][w]
            pos_j[w] = d[i][w]*pos_sent

f = {}
print("f_j = ")
for w in total_j:
    f[w] = pos_j[w]/total_j[w]
    print(w + ": " + str(pos_j[w]) + "/" + str(total_j[w]))

f_j = 
like: 2/4
watch: 1/2
film: 1/3
eat: 0/1
pizza: 0/1
mary: 1/2
also: 1/1
football: 2/2
game: 1/1
carl: 1/1
play: 1/1
find: 1/1
bore: 1/1


Now we can generate our list of sentiment charged words. Also, be aware that $\hat \pi$ is the fraction of articles marked with $sgn(y) = +1$. Normally, we would calculate this, but since there are only three, I'll just say it's $\hat \pi = 2/3$. In practice, $\hat \pi \approx 1/2$

At this point, we introduce some **hyper-parameters** that we can tune for an optimal model:
- $\alpha_{+}, \alpha_{-}$ are the upper and lower thresholds respectively that determine a word is sentimentally charged (or not). For this set of data, so that any word $j$ with $0.47 \le f_j \le 0.67$ is sentiment neutral, we set:
    - $\alpha_- = 0.2$ 
    - $\alpha_+ = 0$ 
- $\kappa$ is the minimum frequency with which a word should appear across all articles to avoid low frequency words messing up data.
    - Because our example dataset is very small, we will set $\kappa = 1$, so we remove this constraint for now.

This gives us the set of sentiment charged words $\hat S$:
$$\hat S = \{j: f_j \ge \hat \pi + \alpha_+ \text{ or } f_j \le \hat \pi - \alpha_-\} \cap \{j:k_j \ge \kappa\}

In [110]:
ALPHA_PLUS  = 0
ALPHA_MINUS = 0.2
KAPPA       = 1
pi = 2/3
sentiment_words = [] # S
neutral_words = []   # N
for i in total_j:
    if ((pos_j[i]/total_j[i] >= pi + ALPHA_PLUS or pos_j[i]/total_j[i] <= pi - ALPHA_MINUS) and total_j[i] >= KAPPA):
        sentiment_words.append(i)
    else:
        neutral_words.append(i)

print("S = " + str(sentiment_words))

S = ['film', 'eat', 'pizza', 'also', 'football', 'game', 'carl', 'play', 'find', 'bore']


## Learning Sentiment Topics
Ok, so we have the wordlise $S$, now we just need to fit a two-topic model to these word counts. We gather these in a matrix $O = [O_+,O_-]$ which determines the expected counts of sentiment charged words in each article.

To measure how much article $i$ tilts in favour of a certain sentiment, we say the estimated sentiment of article $i$ is $\hat p_i$ such that:
$$\hat p_i = \frac{\text{rank of } y_i \text{ in } \{y_l\}_{l=1}^n}{n}$$

In [111]:
# Calculates p_i
p = [0] * len(y)
for i, x in enumerate(sorted(range(len(y)), key=lambda y_lam: y[y_lam])):
    p[x] = float(1 - i/len(y))
# p = [((rank+1)/len(y)) for (rank,x) in enumerate(sorted(range(len(y)), key=lambda y_temp: y[y_temp]))]
for i in range(len(p)):
    print("p_" + str(i+1) + ": " + str(p[i]))

p_1: 1.0
p_2: 0.33333333333333337
p_3: 0.6666666666666667


We also require the estimated $|S| \times 1$ vector of word frequencies for article $i$, denoted $\hat h_i$:
$$\hat h_i = \frac{d_{[S],i}}{\hat s_i} \quad \text{where } \hat s_i = \sum_{j \in \hat S} d_{j,i}$$

In [112]:
s = []                                          # ith element corresponds to total count of sentiment charged words for document i
d_s = []                                        # ith element corresponds to list of word counts for each of the sentiment charged words for document i
h = np.zeros((len(d), len(sentiment_words)))    # ith element corresponds to |S|x1 vector of word frequencies divided by total sentiment words in doc i

# Calculates s_i
for doc in d:
    s.append(sum(doc.get(val,0) for val in sentiment_words))
    d_s.append([doc.get(val,0) for val in sentiment_words])

for i in range(len(s)):
    print("s_" + str(i+1) + ":" + str(s[i]))
    print("d_(s," + str(i+1) + "):" + str(d_s[i]))

s_1:4
d_(s,1):[2, 1, 1, 0, 0, 0, 0, 0, 0, 0]
s_2:3
d_(s,2):[0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
s_3:6
d_(s,3):[1, 0, 0, 0, 1, 0, 1, 1, 1, 1]


In [113]:
# Calculates h_i
for i in range(len(d)):
    # subvector of sentiment words in d_i
    if (s[i] == 0) :
        h[i] = np.zeros(len(sentiment_words)).transpose()
    else:
        h[i] = np.array([(j/s[i]) for j in d_s[i]]).transpose()

for i in range(len(h)):
    print("h_" + str(i+1) + ":" + str(h[i]))

h_1:[0.5  0.25 0.25 0.   0.   0.   0.   0.   0.   0.  ]
h_2:[0.         0.         0.         0.33333333 0.33333333 0.33333333
 0.         0.         0.         0.        ]
h_3:[0.16666667 0.         0.         0.         0.16666667 0.
 0.16666667 0.16666667 0.16666667 0.16666667]



We now have all the tools we need to estimate $O$ ($\hat O$):
$$\hat O = [\hat h_1, \hat h_2, ..., \hat h_n] \widehat W' (\widehat W \widehat W')^{-1} \quad \text{where } \widehat W =
\begin{bmatrix}
\hat p_1 & \hat p_2 & \cdots & \hat p_n \\
1- \hat p_1 & 1-\hat p_2 & \cdots & 1-\hat p_n
\end{bmatrix}$$

Finally, we just set all negative entries for $\hat O$ to 0.

In [137]:
# Calculates O
p_inv = [(1-val) for val in p]
W = np.column_stack((p, p_inv))
W = W.transpose()
ww = np.matmul(W,W.transpose())
w2 = np.matmul(W.transpose(), inv(ww))
O = np.matmul(h.transpose(),w2)
# Normalise O columns to have l1 norm
print("O =\n" + str(O))
O = O.transpose()
O = sklearn.preprocessing.normalize(O,norm='l1')
O = O.transpose()

print("O =\n" + str(O))

O =
[[ 4.72222222e-01 -2.77777778e-01]
 [ 2.08333333e-01 -1.66666667e-01]
 [ 2.08333333e-01 -1.66666667e-01]
 [-5.55555556e-02  4.44444444e-01]
 [ 2.77555756e-17  5.00000000e-01]
 [-5.55555556e-02  4.44444444e-01]
 [ 5.55555556e-02  5.55555556e-02]
 [ 5.55555556e-02  5.55555556e-02]
 [ 5.55555556e-02  5.55555556e-02]
 [ 5.55555556e-02  5.55555556e-02]]
O =
[[ 3.86363636e-01 -1.25000000e-01]
 [ 1.70454545e-01 -7.50000000e-02]
 [ 1.70454545e-01 -7.50000000e-02]
 [-4.54545455e-02  2.00000000e-01]
 [ 2.27091073e-17  2.25000000e-01]
 [-4.54545455e-02  2.00000000e-01]
 [ 4.54545455e-02  2.50000000e-02]
 [ 4.54545455e-02  2.50000000e-02]
 [ 4.54545455e-02  2.50000000e-02]
 [ 4.54545455e-02  2.50000000e-02]]


You might notice here that the last few entries in $\hat O$ for this example are equal. This is because these entries correspond to sentiment-charged words that only appear in article 3, which would be ranked 2/3, or in other words, directly in the middle. This makes the word neutral overall in this context, evidenced by the two values being equal.

## Scoring new articles
With our shiny new estimates for sentiment-charged words and our $\hat O$, we can now estimate sentiment of articles not in our dataset. So, let's define a new article we want to estimate.

In [81]:
article_4 = {
    'date': '2022-02-15 16:23:17.923000+00:00',
    'ticker': 'ABDN',
    # 'mrkt_info': {
    #     'open': 250.3,
    #     'close': 258.5
    # },
    'html': '<p>Josh likes to play football. He finds films tiring.</p>'
}

It's basically the same as article 3, so we would expect the predicted outcome to be neutral.

Let's turn it into a bag of words like before:

In [84]:
new_art_list = [article_4]
new_bow = {}
new_d = []
i = 1
for a in new_art_list:
    raw_html = a['html']
    if (raw_html): # when extracting articles from eikon, some articles are just pdfs, so this might not exist
        readable_text = bs(raw_html, 'lxml').get_text().lower()
        print("Text for article " + str(i) + ": '" + readable_text + "'")
        # substitute non alphabet chars (new lines become spaces)
        readable_text = re.sub(r'\n', ' ', readable_text)
        readable_text = re.sub(r'[^a-z ]', '', readable_text)
        # sub multiple spaces with one space
        readable_text = re.sub(r'\s+', ' ', readable_text)
        # tokenise text
        words = nltk.wordpunct_tokenize(readable_text)
        if len(words) > 0:
            # lemmatise, remove non-english, and remove stopwords
            lemmatizer = WordNetLemmatizer()
            lemmatised_words = []
            en_words = set(nltk.corpus.words.words())
            for w in words:
                rootword = lemmatizer.lemmatize(w, pos="v")
                if rootword not in STOP_WORDS and (rootword in en_words or not rootword.isalpha()):
                    lemmatised_words.append(rootword)
            # convert to bag of words
            bow_art = {}
            # global_bow = {l: val+1 for l in lemmatised_words for val in global_bow.get(l, 0)}
            # bow_art = {l: val+1 for l in lemmatised_words for val in global_bow.get(l, 0)}
            for l in lemmatised_words:
                if l in bow_art:
                    bow_art[l] += 1
                else:
                    bow_art[l] = 1
            new_bow = bow_art
            print("BOW for article " + str(i) + ": " + str(bow_art))
    i += 1

Text for article 1: 'josh likes to play football. he finds films tiring.'
BOW for article 1: {'josh': 1, 'like': 1, 'play': 1, 'football': 1, 'find': 1, 'film': 1, 'tire': 1}


Before we get to the estimating, we need to calculate $\hat s$ for the new article (total count of words from $\hat S$).

In [86]:
new_s = 0

for w in sentiment_words:
    new_s += new_bow.get(w, 0)

print("new s: " + str(new_s))

new s: 4


Now, we just estimate $p_4$ using **maximum likelihood estimation** (MLE) because it is statistically efficient. We also add a penalty term $\lambda \log(p(1-p))$ where $\lambda > 0$ is a tuning parameter. This is to help with limited numbers of observations and low signal-to-noise ratios, to give us:

$$\hat p = \arg \max_{p \in [0,1]} \left\{ \hat s^{-1} \sum_{j\in \hat S} d_j \log \left( p \hat O_{+,j} + (1-p) \hat O_{-,j} \right) + \lambda \log(p(1-p))\right\}$$


In [138]:
entries = []

new_p = sym.symbols('p')
lam = 1
terms = np.zeros(len(sentiment_words))
i = 0
for j in sentiment_words:
    # a = (new_bow.get(j,0) * math.log(new_p*O[i][0] + (1-new_p)*O[i][1]))
    d_j = new_bow.get(j,0)
    if d_j > 0:
        print("Word: " + j + " // " + str(f[j]))
        print("d_j:" + str(new_bow.get(j,0)))
        print("O+:" + str(O[i][0]))
        print("O-:" + str(O[i][1]))
    # i += 1/new_s + lam * (new_p*(1-new_p))

# /new_s + lam * (new_p*(1-new_p))


print(terms)

Word: film // 0.3333333333333333
d_j:1
O+:0.3863636363636363
O-:-0.12500000000000008
Word: football // 1.0
d_j:1
O+:0.3863636363636363
O-:-0.12500000000000008
Word: play // 1.0
d_j:1
O+:0.3863636363636363
O-:-0.12500000000000008
Word: find // 1.0
d_j:1
O+:0.3863636363636363
O-:-0.12500000000000008
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Hurrah! The $p$ value is approximately 0.68, which is really close to that of article 3, which it shares virtually all of the sentimentally charged words with, so with our very sparse dataset, it's done well.