## Changes from default:

## Effect of changes:
N/A

## 0. Overview

"Quantifying unstructured and noisy textual content is complex and involves numerous methodological issues related to the preprocessing of the data and the optimization of the algorithm used to quantify textual content. The number of **text preprocessing** that can be implemented is numerous (**lowercase**, **stemming**, **lemma-tization**, part-of-speech tagging, **stopwords removal**, **punctuation removal**, etc.) and it is not easy to identify which transformation increases (decreases) the accuracy of the classiﬁcation. The same is true for the choice of the algorithm: the large number of algorithms (Naive Bayes, SVM, logistic regression, random forest, multilayer perceptron, etc.) and the even greater number of hyperparameters for each algorithm lead to an immense number of combinations.Furthermore, the answers relative to those methodological issues strongly depend on the type of data used (informal or formal content, short or long text), on the size of the dataset (few hundreds or mil-lions of documents), on the availability of pre-classiﬁed messages (supervised or unsupervised learning), and on the type of documents (domain-speciﬁc or generic documents). While there is no one-ﬁts-all solution, we nonetheless believe that some guidance and tips can help researchers to avoid common mistakes."

Renault, Thomas. (2020). Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance. 2. 10.1007/s42521-019-00014-x. pp. 2

## 1. Import

In [1]:
#import nltk
#nltk.download('all-corpora')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

from nltk.stem.porter import PorterStemmer
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words_set = set(stopwords.words('english'))

## 2. Load data
Current status: Default (~5000 tweets)

In [2]:
# Change this in testing.
data = pd.read_csv('stock_data.csv')


## 3. Define Clean and Lower

In [3]:
def clean_and_lower(text):
    '''
    Input:
        A string of messy data
    Output:
        A string of clean lowercase data
    
    1) Replacing everything that isn't a letter with a space (special characters, numbers)
    2) Sending all text to lowercase
    '''
    clean_text = re.sub('[^a-zA-Z]'," ", text)
    clean_text = clean_text.lower()
    return clean_text



## 4. Define Tokenize
Current status: Default

In [4]:
def tokenize_data(text):
    '''
    Input:
        A string of cleaned data
    Output:
        A list of the words in the string
        
    Will split the string into tokens based on what nltk package thinks is best.
    Not sure how different it is compared to string.split(' ')
    
    There is also a TwitterTokenizer, maybe something to look into if we want to deal with 
        emojiis instead of replacing them with spaces.
    '''
    return word_tokenize(text, language='english')

## 5. Define Lemmatize and Stopword removal
Current status: Lemmatization on

Stopword removal on

In [5]:
def lemma_and_stopwords(text, remove_stop_words=False):
    '''
    Input:
        List of words that have been cleaned and tokenized
    Output:
        A string that is supposed to represent the meaning of the original sentence, as reduced as possible
        
        
    From from https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
    "
        Lemmatization is the process of converting a word to its base form. The difference 
            between stemming and lemmatization is, lemmatization considers the context and 
            converts the word to its meaningful base form, whereas stemming just removes the 
            last few characters, often leading to incorrect meanings and spelling errors.

        For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, 
            whereas, stemming would cutoff the ‘ing’ part and convert it to car.

        ‘Caring’ -> Lemmatization -> ‘Care’
        ‘Caring’ -> Stemming -> ‘Car’
    "
    
    
    Since the original paper also mentioned stop words, I made sure the word is not a stop word
    before trying to convert it. I ran it with and without the stop word and I didn't notice much difference
    but maybe it's something to fiddle with later.
    
    
    EDIT 1: The paper also mentioned 'part-of-speech' tagging, which seems to attach context to each word to help
    convert its meaning properly. Since this seemed more complicated, and this isn't a text processing project,
    I didn't do it. 
    However there is an entire coded example (example 3) here: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
    so it might be quick to implement and see if it improves our classification scores after
    
    EDIT 2:
    About stop words, from the paper: 
        We also ﬁnd that removing stopwords using the NLTK stopwords corpus signiﬁcantly decreases the accuracy 
        of the classiﬁcation. We believe that this result is due to the fact that the stopwords corpus from NLTK includes 
        words that could be very useful for sentiment analysis in ﬁnance such as “up”, “down”, “below” or “above”. Thus, 
        researchers should not use the standard NLTK list and should consider a more restrictive list of stopwords for 
        sentiment analysis (“a”, “an”, “the”...). This result is consistent with Saif et al (2014) who show that Naive 
        Bayes classiﬁers are more sensitive to stopword removal and that using pre-existing lists of stopwords negatively 
        impacts the performance of sentiment classiﬁcation for short-messages posted on social media.
    '''
    if remove_stop_words:
        list_of_words = [lemmatizer.lemmatize(word) for word in text if(word) not in stop_words_set]
    else:
        list_of_words = [lemmatizer.lemmatize(word) for word in text]
        
    string_of_words = " ".join(list_of_words)
    return string_of_words

## 6. Final preprocessing

In [6]:
def prep_all_data(data, remove_stop_words=False):
    '''
    Input:
        The column of the dataframe that contains text
    Output:
        A list of strings that has been through:
            clean_data = clean_and_lower(data)
            token_data = tokenize_data(clean_data)
            lemma_data = lemma_and_stopwords(token_data)
    '''
    list_of_nice_strings = []
    for i in data:
        clean_string = clean_and_lower(i)
        token_list = tokenize_data(clean_string)
        lemma_string = lemma_and_stopwords(token_list, remove_stop_words=remove_stop_words)
        list_of_nice_strings.append(lemma_string)
        
    return list_of_nice_strings

# Note: Just eyeballing it, it looks like some words that are useful are being thrown out
# I.e in fourth row (index 3) it goes from "MNTA Over 12" --> "mnta". It seems like that is a bullish tweet, but something
# is throwing away the "over".
# EDIT: I just tried without "stop words" and it recovered the "Over"... Best to try both

clean_data = prep_all_data(data['Text'])


## 7. Vectorization
Current status:

Vectorization: CountVectorizer

n-grams: 1 and 2

In [7]:
def text_to_num(clean_data):
    '''
    Input: 
        cleaned list of strings
    Output:
        Numerical vector representation [0's and 1's]
    
    From: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
    More info: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    
    Text Analysis is a major application field for machine learning algorithms. However,
        the raw data, a sequence of symbols cannot be fed directly to the algorithms 
        themselves as most of them expect numerical feature vectors with a fixed size 
        rather than the raw text documents with variable length.
    ...       
    We call vectorization the general process of turning a collection of text documents into numerical 
        feature vectors. This specific strategy (tokenization, counting and normalization) is called 
        the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences 
        while completely ignoring the relative position information of the words in the document.
    '''
    from sklearn.feature_extraction.text import CountVectorizer
    
    # Like usual we may need to cross validate this to determine the optimal represenation
    # Cross validation link: https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
    # max_features = ?
    # ngram_range = ?
    # fit_transform: Learn the vocabulary dictionary and return document-term matrix.
    # EDIT: He has results in his paper about choosing good parameters for ngrams

    vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=3)
    document_term_matrix = vectorizer.fit_transform(clean_data).toarray()
    return document_term_matrix
    

## 8. Dimension reduction
Current status:

No dimension reduction.

In [8]:
# None by default.
# When we add this, we'll do it off the X matrix here.

X = pd.DataFrame(clean_data)
# X = text_to_num(clean_data)
# for word2vec model, we need to transfer X later
y = data["Sentiment"]
y = y.replace(-1, 0)


## 8.5 Word2Vec Steps

In [38]:
def word2vec_transform_features(clean_data2, model, size, same_origin=False):
    # clean_data2, list of list, in first layer, one list is a sentence, in the second layer, each list contains words it has
    # model, a gensim word2vec model, can take words as input
    # size, the representing vector length
    # it is possible that the words in the train set doesn't exist in the test set
    if same_origin:
        print('model comes from the data')
    else:
        print('model is a pretrained model')

    vec_list = []
    for sentence in clean_data2:
        curr_sentence_vecs = []
        in_count = 0
        notin_count = 0
        for word in sentence:
            try:
                vec = model[word].reshape(-1,1)
                curr_sentence_vecs.append(vec)
                in_count += 1
            except KeyError:
                # print(f"Not in dictionary: {word}")
                notin_count += 1
        if len(curr_sentence_vecs)>0:
            vec_list.append(curr_sentence_vecs)
        else:
            vec_list.append([np.zeros(size).reshape(-1,1)])
    vec_list2 = [np.concatenate(x, axis=1) for x in vec_list] #combine vectors to a matrix
    vec_list3 = [np.sum(x, axis=1) for x in vec_list2] # sum the vectors as a vector to represent the sentence
    vec_list4 = [x.reshape(1,-1) for x in vec_list3]
    feature_mat = np.concatenate(vec_list4, axis=0)
    
    return feature_mat

In [10]:
#functions


## 9. Training/Testing split

In [23]:
size = 300
#simple_model = gensim.models.Word2Vec(X_train_l2,size=size)
model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use

In [49]:
from sklearn.model_selection import train_test_split


print("X Shape: {}".format(X.shape))
# Should we do PCA on this? X is (5791, 8330)....
# from sklearn.decomposition import PCA
# n_components=250
# pca = PCA(n_components=n_components) 
# X_reduced = pca.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2021)


X Shape: (5791, 1)


In [50]:
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [51]:
X_train_word =X_train 
X_test_word = X_test 

In [52]:
# transfer the dataframe of sentences back to list in order to used in word2vec model
def df2list(X):
    result = [X.iloc[i,0] for i in range(len(X))]
    return result

X_train_l = df2list(X_train)
X_test_l = df2list(X_test)

In [53]:
def sent2list(clean_data):
    clean_data2 = [x.split(' ') for x in clean_data]
    return clean_data2

X_train_l2 = sent2list(X_train_l)
X_test_l2 = sent2list(X_test_l)

In [54]:
import gensim.downloader as api

# info = api.info()  # show info about available models/datasets

In [55]:
size=300
X_train = word2vec_transform_features(X_train_l2, model, size, same_origin=True)
X_test = word2vec_transform_features(X_test_l2, model, size, same_origin=True)

model comes from the data
model comes from the data


## 10. Model list definition:
Current status: Multinomial NB, Logistic Regression, SVM, Random Forest

In [56]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

models_to_try = []

# Multinomial Naive Bayes
models_to_try.append(("GaussianNB",GaussianNB()))

# Gaussian Naive Bayes
# Fitting gaussing naive bayes (like in class) instead
# models_to_try.append(("GaussianNB", GaussianNB()))

# Logistic Regression
models_to_try.append(("LogisticReg", LogisticRegression()))

# Support Vector Machine
models_to_try.append(("SVC", SVC()))
# models_to_try.append(("SVC", SVC(probability=True))) # Need this to use .predict_proba()

# Random Forest
models_to_try.append(("RandomForest", RandomForestClassifier()))

# Multilayer Perceptron 
# Not going to fit a multilayer perceptron, here's XGB instead
#models_to_try.append(("XGB", XGBClassifier()))

## 11. Results

In [60]:
from sklearn.metrics import  roc_auc_score
def prediction_results(models_to_try, X_train, X_test, y_train, y_test, X_train_word, X_test_word):
    '''
    Input:
        models_to_try: a list of tuples ("Name as a string", object)
        Data as usual
    Ouput:
        Prints the accuracy and returns a dictionary to create a confusion matrix
    
    '''
    confused_dict=dict()
    sample_incorrect = dict()
    for name, classifier in models_to_try:

        classifier.fit(X_train, y_train)
            
        preds = classifier.predict(X_test)
        
        preds_train = classifier.predict(X_train)
        X_trained_incorrect = X_train_word[preds_train!=y_train.values]
        X_test_incorrect = X_test_word[preds!=y_test.values]
        
        sample_incorrect[name] = (X_trained_incorrect, X_test_incorrect)
        
        print("{} Accuracy: {}".format(name, accuracy_score(y_test, preds)))
        
        if name != "SVC":
            probs = classifier.predict_proba(X_test)[:, 1]
            print("{} AUC: {}".format(name, roc_auc_score(y_test, probs)))
            
        tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
        
        MCC = (tp*tn - fp*fn)/np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
        print("{} MCC: {}".format(name, MCC))
        confused_dict[name] = [tn, fp, fn, tp]
        
        
    return confused_dict, sample_incorrect

from sklearn.metrics import confusion_matrix, accuracy_score
confused_dict,sample_incorrect = prediction_results(models_to_try, X_train, X_test, y_train, y_test, X_train_word, X_test_word)

def dict_to_output(confused_dict):
    for key, value in confused_dict.items():
        print("{} Confusion Matrix".format(key))
        print(pd.DataFrame({"True Y=1":[value[3],value[2]],"True Y=0":[value[1],value[0]]},index=["Guess Y=1","Guess Y=0"]))
        print("\n")
        
dict_to_output(confused_dict)

GaussianNB Accuracy: 0.6065573770491803
GaussianNB AUC: 0.6191755928385867
GaussianNB MCC: 0.15374202404461323
LogisticReg Accuracy: 0.7385677308024159
LogisticReg AUC: 0.7878322519547998
LogisticReg MCC: 0.421657678879287


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


SVC Accuracy: 0.7575496117342536
SVC MCC: 0.4627771252460948
RandomForest Accuracy: 0.7109577221742882
RandomForest AUC: 0.765152096850565
RandomForest MCC: 0.34196785653215794
GaussianNB Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       505       229
Guess Y=0       227       198


LogisticReg Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       617       188
Guess Y=0       115       239


SVC Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       634       183
Guess Y=0        98       244


RandomForest Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       664       267
Guess Y=0        68       160




In [65]:
for i in range(sample_incorrect['LogisticReg'][1].shape[0]):
    print(sample_incorrect['LogisticReg'][1].iloc[i,0])

aap with billion in cash that is crazy
jpm not looking pretty here
sensex end point lower nifty settle at a market take hit after a day s breather amid coronavi http t co dzp suya v
goog igv google software firing on all cylinder compq ndx
selling bvsn here from stop b e even
glad that the easy money trade is over every joker thought that making money in the market is synonym with getting long aap
emn increasing selling volume next support is the ascending trend line
aap want to buy more around
af watching spx high and relative strength weakness af at ma
spy ncle ben needed to spice up thing a little aap people were bored wake up now
state government scramble for fund a coronavirus take toll on coffer http t co kztfrvxhci
nvda nvidia show off phoenix reference phone at mwc
a the coronavirus pandemic intensifies adherent of the financial independence retire early movement are doublin http t co klzyro r
aap about to go red amazing how many time that pattern can repeat over and over
jpmor