<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-News-Data" data-toc-modified-id="Import-News-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import News Data</a></span></li><li><span><a href="#Sentiment-Analysis" data-toc-modified-id="Sentiment-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis</a></span><ul class="toc-item"><li><span><a href="#VADER-Dictionary" data-toc-modified-id="VADER-Dictionary-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>VADER Dictionary</a></span></li><li><span><a href="#WordNet-dictionary" data-toc-modified-id="WordNet-dictionary-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>WordNet dictionary</a></span></li><li><span><a href="#Multinomial-Naive-Bayes" data-toc-modified-id="Multinomial-Naive-Bayes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Multinomial Naive Bayes</a></span><ul class="toc-item"><li><span><a href="#Import-training-and-validation-data" data-toc-modified-id="Import-training-and-validation-data-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Import training and validation data</a></span></li><li><span><a href="#Validation-using-articles'-content" data-toc-modified-id="Validation-using-articles'-content-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Validation using articles' content</a></span></li><li><span><a href="#Validation-using-articles'-headlines" data-toc-modified-id="Validation-using-articles'-headlines-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Validation using articles' headlines</a></span></li><li><span><a href="#Prediction" data-toc-modified-id="Prediction-2.3.4"><span class="toc-item-num">2.3.4&nbsp;&nbsp;</span>Prediction</a></span></li></ul></li><li><span><a href="#Bi-LSTM" data-toc-modified-id="Bi-LSTM-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Bi-LSTM</a></span></li><li><span><a href="#Notes" data-toc-modified-id="Notes-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Notes</a></span></li></ul></li></ul></div>

In [1]:
# Import libraries 
import numpy as np
import pandas as pd
import os

from datetime import datetime

# Import News Data

In [2]:
zip_lst = ['Data1.zip', 'Data2.zip', 'Data3.zip']
df_news = pd.DataFrame()

for zip_dt in zip_lst:
    df_news = pd.concat([df_news, pd.read_csv(zip_dt)], axis=0)

# Remove entries without content
df_news['content'].replace('empty string', np.nan, inplace=True)
df_news = df_news.loc[~df_news['content'].isna(), ]

# Reset index
df_news.reset_index(drop=True, inplace=True)

df_news.shape

(38515, 7)

In [3]:
df_news.head()

Unnamed: 0,ticker,date,time,headline,news,content,site
0,MMM,Feb-19-21,08:11AM,"Did You Acquire 3M (MMM) Before February 9, 20...",PR Newswire,"SAN DIEGO, Feb. 19, 2021 /PRNewswire/ -- Johns...",finance.yahoo.com
1,MMM,Feb-18-21,11:00AM,3M launches New 3M Clean & Protect Certified B...,PR Newswire,The new four-part program promotes cleanliness...,finance.yahoo.com
2,MMM,Feb-17-21,09:37AM,ROCE Insights For 3M,Benzinga,3M (NYSE:MMM) posted a 3.14% decrease in earni...,finance.yahoo.com
3,MMM,Feb-16-21,04:18PM,Why 3M is spending $1 billion to help improve ...,Yahoo Finance,3M Chairman and CEO Mike Roman said it's time ...,finance.yahoo.com
4,MMM,Feb-16-21,03:58PM,3M CEO on the company's plans to invest $1B to...,Yahoo Finance Video,"Mike Roman, 3M Chairman and CEO, joined Yahoo ...",finance.yahoo.com


In [4]:
# Random sampling for manual labelling 
# df_sample = df_news.sample(100)
# df_sample.to_csv('data/news_sample.csv')

In [5]:
# Import labelled validation data 
valid = pd.read_csv('data/news_sample.csv', index_col=0)
valid.head()

Unnamed: 0,ticker,date,time,headline,news,content,site,sent
3174,PM,Feb-04-21,10:33AM,Philip Morris (PM) Q4 Earnings Beat Estimates ...,Zacks,Philip Morris International Inc. PM reported f...,finance.yahoo.com,1
8657,CTLT,Sep-09-20,10:14AM,Radius Completes Enrollment in Phase III Osteo...,Zacks,"Radius Health, Inc. RDUS announced the complet...",finance.yahoo.com,1
16599,ZTS,Feb-12-21,08:38AM,Is a Surprise Coming for Zoetis (ZTS) This Ear...,Zacks,Investors are always looking for stocks that a...,finance.yahoo.com,1
5695,AZO,Sep-18-20,06:25AM,If You Had Bought AutoZone (NYSE:AZO) Shares T...,Simply Wall St.,"The worst result, after buying shares in a com...",finance.yahoo.com,1
7327,SEE,May-28-20,09:08AM,Sealed Air (SEE) Stock Down 18% YTD: Will it B...,Zacks,Sealed Air Corporation’s SEE focus on innovati...,finance.yahoo.com,0


# Sentiment Analysis

## VADER Dictionary 

- Notes 
    - Removal of stopwords and punctuations not required.  
    - Source: [Link](https://stackoverflow.com/questions/45296897/is-there-a-way-to-improve-performance-of-nltk-sentiment-vader-sentiment-analyser)

In [6]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

In [7]:
# Predict sentiment score using VADER
sid = SentimentIntensityAnalyzer()
sent = valid['content'].apply(sid.polarity_scores)
sent = pd.Series([x['compound'] for x in sent]) # get only compound score
sent.describe()

count    100.000000
mean       0.874052
std        0.370407
min       -0.947700
25%        0.967550
50%        0.995700
75%        0.997925
max        0.999900
dtype: float64

In [8]:
# Calculate correlation between pred and true 
sent.corr(valid['sent'], method='spearman') # return nan for unknown reason 

# Check if the variance is zero 
from statistics import variance 
import math
math.sqrt(variance(sent))
math.sqrt(variance(valid['sent']))

# Carry out same execution using scipy 
from scipy.stats import spearmanr
spearmanr(sent, valid['sent'])

SpearmanrResult(correlation=0.3206755922037266, pvalue=0.0011427719959698362)

## WordNet dictionary

- A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets)   
- Details: https://www.notion.so/WordNet-SentiWordNet-e783be8f64484b899726ab5026ba63f3

In [9]:
import nltk
import re

from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [10]:
# Define functions for text preprocessing 

def remove_punct(x):
    '''Remove punctuations in a string'''
    special_chars_p = "[.,;®'&$’\"\-()#@!?/:]"
    s1 = re.sub(special_chars_p, '', x)  
    return(s1)

stopword = stopwords.words('english')
def remove_stopwords(x):
    '''Remove stopwords in a string'''
    s1 = [word for word in x.split() if word not in stopword]
    s1 = ' '.join(s1)
    return s1

def get_wordnet_pos(treebank_tag):
    '''Get pos tag of each word in a text'''
    # Note that nltk's default tags don't work in wordnet
    # A conversion of tags is hence necessary
    # Source: https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
    
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('N'):
        return wn.NOUN
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return None     

In [11]:
# Define a function for sentiment analysis

lemmatizer = WordNetLemmatizer()

def analyze_sentiment(news):
    '''Calculate sentiment score of articles using WordNet'''   
    sentiments = []
    
    # Tokenize into sentences to get pos tags in each sentence later
    sentences = nltk.word_tokenize(news)
          
    for sentence in sentences: 
        # Remove punctuations 
        sentence = remove_punct(sentence)
        # not removing stopwords to keep the sentence structure 
        
        # Tokenize the sentence into words 
        sentence = nltk.word_tokenize(sentence)
        
        # Get pos tags
        tags = nltk.pos_tag(sentence)
                
        for word, tag in tags:
            # Convert nltk pos tag into wordnet tag 
            wn_tag = get_wordnet_pos(tag)
            if not wn_tag: # Skip if tag is None 
                continue 
            
            # Lemmatize the word 
            lemma = lemmatizer.lemmatize(word, wn_tag)
            if not lemma: # Skip if lemma is None 
                continue 
            
            # Get synset for the word from wordnet 
            synsets = wn.synsets(lemma, wn_tag)
            if not synsets: # Skip if synsets is None 
                continue 
            synset = synsets[0] # get the closest word 
            
            # Get sentiment score from sentiwordnet
            swn_synset = swn.senti_synset(synset.name())
            sentiments.append(swn_synset.pos_score() - swn_synset.neg_score())
    
    return sum(sentiments)/len(sentiments)

# Predict sentiment score using WordNet
sent = valid['content'].apply(analyze_sentiment)
print(sent.describe())

count    100.000000
mean       0.025060
std        0.019332
min       -0.029639
25%        0.015455
50%        0.023944
75%        0.035819
max        0.089286
Name: content, dtype: float64


In [12]:
# Calculate correlation between pred and true 
sent.corr(valid['sent'], method='spearman')

0.30821886857806274

In [13]:
# Same execution using scipy 
spearmanr(sent, valid['sent'])

SpearmanrResult(correlation=0.30821886857806274, pvalue=0.001809982844963061)

## Multinomial Naive Bayes 

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import f1_score, mean_squared_error

### Import training and validation data

In [15]:
# Import training data 
# source: https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news
train = pd.read_csv('data/news_classified_training.csv', 
                    header=None, 
                    names=['sent', 'text'])

# Change str to numerical for sent
dict_map = {'neutral':0, 'negative':-1, 'positive':1}
train['sent'] = train['sent'].map(dict_map)
print(train['sent'].value_counts())

# Import validation data 
valid = pd.read_csv('data/news_sample.csv', index_col=0)
print(valid['sent'].value_counts())

 0    2879
 1    1363
-1     604
Name: sent, dtype: int64
 1    61
 0    27
-1    12
Name: sent, dtype: int64


### Validation using articles' content

In [16]:
# Vectorize text
# Way 1: Word count vector
vec_count = CountVectorizer(stop_words='english')
vec_count.fit(train['text'])
X_train_count = vec_count.transform(train['text'])

# Train MNB model 
y_train = train['sent']
clf = MultinomialNB()
clf.fit(X_train_count, y_train)

# Assess model performance using f1 score
X_valid_count = vec_count.transform(valid['content'])
y_pred = clf.predict(X_valid_count)
y_valid = valid['sent']
print('F1 score for word count vectorizer:', f1_score(y_valid, y_pred, average='micro'))

# Way 2: TF-IDF vector
vec_tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = vec_tfidf.fit_transform(train['text'])

# Train MNB model 
y_train = train['sent']
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Assess model performance using f1 score
X_valid_count = vec_tfidf.transform(valid['content'])
y_pred = clf.predict(X_valid_count)
print('F1 score for tdidf vectorizer:', f1_score(y_valid, y_pred, average='micro'))

F1 score for word count vectorizer: 0.47
F1 score for tdidf vectorizer: 0.35


### Validation using articles' headlines

In [17]:
# Vectorize text
# Way 1: Word count vector
vec_count = CountVectorizer(stop_words='english')
vec_count.fit(train['text'])
X_train_count = vec_count.transform(train['text'])

# Train MNB model 
y_train = train['sent']
clf = MultinomialNB()
clf.fit(X_train_count, y_train)

# Assess model performance using f1 score
X_valid_count = vec_count.transform(valid['headline'])
y_pred = clf.predict(X_valid_count)
y_valid = valid['sent']
print('F1 score for word count vectorizer:', f1_score(y_valid, y_pred, average='micro'))

# Way 2: TF-IDF vector
vec_tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = vec_tfidf.fit_transform(train['text'])

# Train MNB model 
y_train = train['sent']
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Assess model performance using f1 score
X_valid_count = vec_tfidf.transform(valid['headline'])
y_pred = clf.predict(X_valid_count)
print('F1 score for tdidf vectorizer:', f1_score(y_valid, y_pred, average='micro'))

F1 score for word count vectorizer: 0.37
F1 score for tdidf vectorizer: 0.35


### Prediction 

In [18]:
# Retrain the model with both training and validation data 

# Concatenate train and validation data 
X_train = train['text'].append(valid['content'], ignore_index=True)
y_train = train['sent'].append(valid['sent'], ignore_index=True)

# Use word count vectorizer (higher f1 score)
vec_count = CountVectorizer(stop_words='english')
vec_count.fit(X_train)
X_train = vec_count.transform(X_train)

# Train MNB model
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make predictions 
X_test = vec_count.transform(df_news['content'])
y_pred = clf.predict(X_test)
df_news['sent'] = y_pred

In [19]:
# Check distribution of predictions 
print(df_news['sent'].value_counts())

# Examine the accuracy of negative articles 
df_news.loc[df_news['sent']==-1, ['headline', 'content', 'sent']]

 1    27187
 0    11072
-1      256
Name: sent, dtype: int64


Unnamed: 0,headline,content,sent
1042,FAA chief vows tough line after some Trump sup...,"By David ShepardsonWASHINGTON, Jan 9 (Reuters)...",-1
1043,RPT-Alaska Airlines puts 14 people on no-fly l...,(Repeating to fix formatting)Jan 8 (Reuters) -...,-1
1044,Alaska Airlines puts 14 people on no-fly list ...,Jan 8 (Reuters) - Alaska Airlines said on Frid...,-1
1106,GameStop fizzles as stock falls below $50 a share,GameStop (GME) shares continue to decline as t...,-1
1632,Michael Burry's Top 5 Trades of the 3rd Quarter,"- By James LiMichael Burry, the investor famou...",-1
...,...,...,...
37346,Investment Note: WestRock Looks Like a Bargain,This article first appeared on GuruFocus.Warni...,-1
37780,Sheldon Adelsons Last Roll of the Dice May Be ...,(Bloomberg Opinion) -- No one ever said that t...,-1
37918,Carl Icahn Discards HP Stake in the 2nd Quarter,"Carl Icahn (Trades, Portfolio), board chairman...",-1
37919,MarineMax and 3 Other Stocks Shine on Price-Ca...,This article first appeared on GuruFocus.HZO 3...,-1


Notes:

1. Noises in articles' content:
    - advertisement text (at the end of every article)
    - information about other stocks (e.g. analysis for 5 stocks)
2. Nevertheless, articles' content improves model performance (F1 score)

## Bi-LSTM 


In [27]:
import tensorflow
from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import OneHotEncoder

In [21]:
# One hot encode variable "sentiment"
enc = OneHotEncoder()
y = enc.fit_transform(np.array(train['sent']).reshape(-1, 1)).toarray()
cols = enc.get_feature_names(['sent'])

y = pd.DataFrame(y, columns=cols)
y.head()

Unnamed: 0,sent_-1,sent_0,sent_1
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [22]:
# Split train into train and validation data
X_train, X_test, y_train, y_test = train_test_split(train['text'], y, 
                                                    test_size=0.2, random_state=42, 
                                                    stratify=train['sent'])

# Preprocess text data 
X_train = X_train.apply(remove_punct)
X_train = X_train.apply(remove_stopwords)
X_train = X_train.str.lower()

X_test = X_test.apply(remove_punct)
X_test = X_test.apply(remove_stopwords)
X_test = X_test.str.lower()

In [25]:
max_features = 5000  # Only consider the top 5k words
maxlen = 1000

inputs = keras.Input(shape=(None,), dtype="int32")

# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)

# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)

# Add a classifier
outputs = layers.Dense(3, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         640000    
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         98816     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 3)                 387       
Total params: 838,019
Trainable params: 838,019
Non-trainable params: 0
_________________________________________________________________


In [26]:
# Transform X for model training and prediction 

# Tokenization 
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)

# Convert into sequences
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

In [None]:
# Train the model (took 7 mins to run)

# model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])
# model.fit(X_train, y_train, batch_size=32, epochs=2, validation_data=(X_test, y_test))
# model.save('bilstm_model.h5')  

In [28]:
# Load model 
model = load_model('bilstm_model.h5')

In [29]:
# Transform validation data 
X_valid = valid['content'].apply(remove_punct)
X_valid = X_valid.apply(remove_stopwords)
X_valid = X_valid.str.lower()
X_valid = tokenizer.texts_to_sequences(X_valid)
X_valid = keras.preprocessing.sequence.pad_sequences(X_valid, maxlen=maxlen)

In [30]:
# Assess model performance 
preds = model.predict(X_valid) # each prediction will have three values 
labels = [-1, 0, 1]
y_pred = []

for pred in preds: 
    idx = np.where(pred == max(pred))[0][0] # Get the index of max value
    y_pred.append(labels[idx])
    
y_valid = valid['sent']
f1_score(y_valid, y_pred, average='micro')

0.45

## Notes
1. Could try BERT
2. Compare the models on a standardized metric