## Changes from default:

## Effect of changes:
N/A

## 0. Overview

"Quantifying unstructured and noisy textual content is complex and involves numerous methodological issues related to the preprocessing of the data and the optimization of the algorithm used to quantify textual content. The number of **text preprocessing** that can be implemented is numerous (**lowercase**, **stemming**, **lemma-tization**, part-of-speech tagging, **stopwords removal**, **punctuation removal**, etc.) and it is not easy to identify which transformation increases (decreases) the accuracy of the classiﬁcation. The same is true for the choice of the algorithm: the large number of algorithms (Naive Bayes, SVM, logistic regression, random forest, multilayer perceptron, etc.) and the even greater number of hyperparameters for each algorithm lead to an immense number of combinations.Furthermore, the answers relative to those methodological issues strongly depend on the type of data used (informal or formal content, short or long text), on the size of the dataset (few hundreds or mil-lions of documents), on the availability of pre-classiﬁed messages (supervised or unsupervised learning), and on the type of documents (domain-speciﬁc or generic documents). While there is no one-ﬁts-all solution, we nonetheless believe that some guidance and tips can help researchers to avoid common mistakes."

Renault, Thomas. (2020). Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance. 2. 10.1007/s42521-019-00014-x. pp. 2

## 1. Import

In [1]:
#import nltk
#nltk.download('all-corpora')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

from nltk.stem.porter import PorterStemmer
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words_set = set(stopwords.words('english'))

## 2. Load data
Current status: Default (~5000 tweets)

In [2]:
# Change this in testing.
data = pd.read_csv('stock_data.csv')
data_additional = pd.read_csv("additional_data_preprocessed.csv")

In [3]:
data.groupby(by="Sentiment").count()

Unnamed: 0_level_0,Text
Sentiment,Unnamed: 1_level_1
-1,2106
1,3685


In [4]:
data_additional.groupby(by="Sentiment").count()

Unnamed: 0_level_0,Text
Sentiment,Unnamed: 1_level_1
-1,604
1,1363


## 3. Define Clean and Lower

In [5]:
def clean_and_lower(text):
    '''
    Input:
        A string of messy data
    Output:
        A string of clean lowercase data
    
    1) Replacing everything that isn't a letter with a space (special characters, numbers)
    2) Sending all text to lowercase
    '''
    clean_text = re.sub('[^a-zA-Z]'," ", text)
    clean_text = clean_text.lower()
    return clean_text



## 4. Define Tokenize
Current status: Default

In [6]:
def tokenize_data(text):
    '''
    Input:
        A string of cleaned data
    Output:
        A list of the words in the string
        
    Will split the string into tokens based on what nltk package thinks is best.
    Not sure how different it is compared to string.split(' ')
    
    There is also a TwitterTokenizer, maybe something to look into if we want to deal with 
        emojiis instead of replacing them with spaces.
    '''
    return word_tokenize(text, language='english')

## 5. Define Lemmatize and Stopword removal
Current status: Lemmatization on

Stopword removal on

In [7]:
def lemma_and_stopwords(text, remove_stop_words=False):
    '''
    Input:
        List of words that have been cleaned and tokenized
    Output:
        A string that is supposed to represent the meaning of the original sentence, as reduced as possible
        
        
    From from https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
    "
        Lemmatization is the process of converting a word to its base form. The difference 
            between stemming and lemmatization is, lemmatization considers the context and 
            converts the word to its meaningful base form, whereas stemming just removes the 
            last few characters, often leading to incorrect meanings and spelling errors.

        For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, 
            whereas, stemming would cutoff the ‘ing’ part and convert it to car.

        ‘Caring’ -> Lemmatization -> ‘Care’
        ‘Caring’ -> Stemming -> ‘Car’
    "
    
    
    Since the original paper also mentioned stop words, I made sure the word is not a stop word
    before trying to convert it. I ran it with and without the stop word and I didn't notice much difference
    but maybe it's something to fiddle with later.
    
    
    EDIT 1: The paper also mentioned 'part-of-speech' tagging, which seems to attach context to each word to help
    convert its meaning properly. Since this seemed more complicated, and this isn't a text processing project,
    I didn't do it. 
    However there is an entire coded example (example 3) here: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
    so it might be quick to implement and see if it improves our classification scores after
    
    EDIT 2:
    About stop words, from the paper: 
        We also ﬁnd that removing stopwords using the NLTK stopwords corpus signiﬁcantly decreases the accuracy 
        of the classiﬁcation. We believe that this result is due to the fact that the stopwords corpus from NLTK includes 
        words that could be very useful for sentiment analysis in ﬁnance such as “up”, “down”, “below” or “above”. Thus, 
        researchers should not use the standard NLTK list and should consider a more restrictive list of stopwords for 
        sentiment analysis (“a”, “an”, “the”...). This result is consistent with Saif et al (2014) who show that Naive 
        Bayes classiﬁers are more sensitive to stopword removal and that using pre-existing lists of stopwords negatively 
        impacts the performance of sentiment classiﬁcation for short-messages posted on social media.
    '''
    if remove_stop_words:
        list_of_words = [lemmatizer.lemmatize(word) for word in text if(word) not in stop_words_set]
    else:
        list_of_words = [lemmatizer.lemmatize(word) for word in text]
        
    string_of_words = " ".join(list_of_words)
    return string_of_words

## 6. Final preprocessing

In [8]:
def prep_all_data(data, remove_stop_words=False):
    '''
    Input:
        The column of the dataframe that contains text
    Output:
        A list of strings that has been through:
            clean_data = clean_and_lower(data)
            token_data = tokenize_data(clean_data)
            lemma_data = lemma_and_stopwords(token_data)
    '''
    list_of_nice_strings = []
    for i in data:
        clean_string = clean_and_lower(i)
        token_list = tokenize_data(clean_string)
        lemma_string = lemma_and_stopwords(token_list, remove_stop_words=remove_stop_words)
        list_of_nice_strings.append(lemma_string)
        
    return list_of_nice_strings

# Note: Just eyeballing it, it looks like some words that are useful are being thrown out
# I.e in fourth row (index 3) it goes from "MNTA Over 12" --> "mnta". It seems like that is a bullish tweet, but something
# is throwing away the "over".
# EDIT: I just tried without "stop words" and it recovered the "Over"... Best to try both


# Preprocess both test and original data
remove_stop_words = False
clean_data = prep_all_data(data['Text'],remove_stop_words)
clean_data_additional = prep_all_data(data_additional['Text'],remove_stop_words)

In [9]:
print("train data len : ", len(clean_data))
print("test data len  : ", len(clean_data_additional))

train data len :  5791
test data len  :  1967


## 7. Vectorization
Current status:

Vectorization: CountVectorizer

n-grams: 1 and 2

In [10]:
def text_to_num(clean_data):
    '''
    Input: 
        cleaned list of strings
    Output:
        Numerical vector representation [0's and 1's]
    
    From: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
    More info: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    
    Text Analysis is a major application field for machine learning algorithms. However,
        the raw data, a sequence of symbols cannot be fed directly to the algorithms 
        themselves as most of them expect numerical feature vectors with a fixed size 
        rather than the raw text documents with variable length.
    ...       
    We call vectorization the general process of turning a collection of text documents into numerical 
        feature vectors. This specific strategy (tokenization, counting and normalization) is called 
        the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences 
        while completely ignoring the relative position information of the words in the document.
    '''
    from sklearn.feature_extraction.text import CountVectorizer
    
    # Like usual we may need to cross validate this to determine the optimal represenation
    # Cross validation link: https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
    # max_features = ?
    # ngram_range = ?
    # fit_transform: Learn the vocabulary dictionary and return document-term matrix.
    # EDIT: He has results in his paper about choosing good parameters for ngrams

    vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=3)
    document_term_matrix = vectorizer.fit_transform(clean_data).toarray()
    return document_term_matrix

# Conduct vectorization to all data including train (original data) and test (additional data)
## If not, dimension of train and testing data would not match => unable to conduct classification

train_data_length = len(clean_data)
clean_data_all = clean_data + clean_data_additional
X_dataset = text_to_num(clean_data_all)

## 8. Dimension reduction
Current status:

No dimension reduction.

In [11]:
# None by default.
# When we add this, we'll do it off the X matrix here.


## 9. Training/Testing split

In [12]:
from sklearn.model_selection import train_test_split



# Should we do PCA on this? X is (5791, 8330)....
# from sklearn.decomposition import PCA
# n_components=250
# pca = PCA(n_components=n_components) 
# X_reduced = pca.fit_transform(X)

#  We are trying to see :
###  when training 80% of original data, how well we estimate new dataset compared to 20% of the same dataset

X_train = X_dataset[:train_data_length]
y_train = data["Sentiment"]
y_train = y_train.replace(-1, 0)

X_additional = X_dataset[train_data_length:]
y_additional = data_additional["Sentiment"]
y_additional = y_additional.replace(-1, 0)

alpha = 0.2  # percentage of additional data in the training set
X_o_train, X_o_test, y_o_train, y_o_test = train_test_split(X_train, y_train, 
                                                            train_size = int(train_data_length*0.8*(1-alpha)),
                                                           test_size = int(train_data_length*0.2),
                                                           random_state=2021)
add_len = len(y_additional)
X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(X_additional, y_additional,
                                                           train_size = int(train_data_length*0.8*alpha),
                                                           random_state=2021)

X_train = np.vstack([X_o_train, X_h_train])
y_train = np.concatenate([y_o_train, y_h_train])
X_original = X_o_test
y_original = y_o_test
X_additional = X_h_test
y_additional = y_h_test


#X_train, X_original, y_train, y_original = train_test_split(X_train, y_train, test_size=0.2, random_state=2021)
print("X train Shape: {}".format(X_train.shape))
print("X original Shape: {}".format(X_original.shape))
print("X additional Shape: {}".format(X_additional.shape))
print(f"y train shape{y_train.shape}")

X train Shape: (4632, 9280)
X original Shape: (1158, 9280)
X additional Shape: (1041, 9280)
y train shape(4632,)


In [13]:
X_o_train.shape

(3706, 9280)

mixed data split

## 10. Model list definition:
Current status: Multinomial NB, Logistic Regression, SVM, Random Forest

In [14]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

models_to_try = []

# Multinomial Naive Bayes
models_to_try.append(("MultinomialNB",MultinomialNB()))

# Gaussian Naive Bayes
# Fitting gaussing naive bayes (like in class) instead
# models_to_try.append(("GaussianNB", GaussianNB()))

# Logistic Regression
models_to_try.append(("LogisticReg", LogisticRegression()))

# Support Vector Machine
models_to_try.append(("SVC", SVC()))
# models_to_try.append(("SVC", SVC(probability=True))) # Need this to use .predict_proba()

# Random Forest
models_to_try.append(("RandomForest", RandomForestClassifier()))

# Multilayer Perceptron 
# Not going to fit a multilayer perceptron, here's XGB instead
#models_to_try.append(("XGB", XGBClassifier()))

## 11. Results

In [15]:
aucs = []
mccs = []
mcrs = []
model_names = []

aucs_add = []
mccs_add = []
mcrs_add = []
model_names_add = []

from sklearn.metrics import  roc_auc_score
def prediction_results(models_to_try, X_train, X_original, X_additional, y_train, y_original, y_additional):
    '''
    Input:
        models_to_try: a list of tuples ("Name as a string", object)
        Data as usual
        X, y : train / original / additional
    Ouput:
        Prints the accuracy and returns a dictionary to create a confusion matrix
    
    '''
    
    def predict(classifier, X_test, y_test, name, confused_dict, model_names, aucs, mccs, mcrs) :
        '''
        Input :
            classifiers, dataset
            
        Trys to find name, model_names, aucs, mccs, mcrs of each original and additional dataset
        '''
        preds = classifier.predict(X_test)
        print("{} Accuracy: {}".format(name, accuracy_score(y_test, preds)))
        
        if name != "SVC":
            probs = classifier.predict_proba(X_test)[:, 1]
            print("{} AUC: {}".format(name, roc_auc_score(y_test, probs)))
            aucs.append(roc_auc_score(y_test, probs))
            
        tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
        
        MCC = (tp*tn - fp*fn)/np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
        print("{} MCC: {}".format(name, MCC))
        confused_dict[name] = [tn, fp, fn, tp]
         
        model_names.append(name)
        mccs.append(MCC)
        mcrs.append(accuracy_score(y_test, preds))
        
    confused_dict=dict()
    confused_dict_add=dict()
    for name, classifier in models_to_try:
        # train 
        classifier.fit(X_train, y_train)

        # predict on original data
        print("[ Original ]")
        predict(classifier, X_original, y_original, name, confused_dict, model_names, aucs, mccs, mcrs)
        
        # predict on additional data
        print("[ Additional ]")
        predict(classifier, X_additional, y_additional, name, confused_dict_add, model_names_add, aucs_add, mccs_add, mcrs_add)
        
    return confused_dict, confused_dict_add

from sklearn.metrics import confusion_matrix, accuracy_score
confused_dict, confused_dict_add = prediction_results(models_to_try, X_train, X_original, X_additional, y_train, y_original, y_additional)

def dict_to_output(confused_dict):
    for key, value in confused_dict.items():
        print("{} Confusion Matrix".format(key))
        print(pd.DataFrame({"True Y=1":[value[3],value[2]],"True Y=0":[value[1],value[0]]},index=["Guess Y=1","Guess Y=0"]))
        print("\n")

[ Original ]
MultinomialNB Accuracy: 0.7841105354058722
MultinomialNB AUC: 0.8471120053053627
MultinomialNB MCC: 0.5381197041620264
[ Additional ]
MultinomialNB Accuracy: 0.8318924111431316
MultinomialNB AUC: 0.8888251380535331
MultinomialNB MCC: 0.6013900517116358
[ Original ]
LogisticReg Accuracy: 0.7884283246977547
LogisticReg AUC: 0.8541730073653556
LogisticReg MCC: 0.5378997051545973
[ Additional ]
LogisticReg Accuracy: 0.8491834774255523
LogisticReg AUC: 0.909326527382083
LogisticReg MCC: 0.6265886816719821
[ Original ]
SVC Accuracy: 0.7728842832469776
SVC MCC: 0.49588283950593653
[ Additional ]
SVC Accuracy: 0.7848222862632085
SVC MCC: 0.4461195698357654
[ Original ]
RandomForest Accuracy: 0.7962003454231433
RandomForest AUC: 0.85350983702669
RandomForest MCC: 0.5597978451184612
[ Additional ]
RandomForest Accuracy: 0.8626320845341018
RandomForest AUC: 0.8858354437058141
RandomForest MCC: 0.6607263046152583


In [16]:
print("[ Original Data ] \n")
dict_to_output(confused_dict)

df = pd.DataFrame()
df['Misclassification'] = mcrs
df['MCC'] = mccs
df.index = model_names

[ Original Data ] 

MultinomialNB Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       602       121
Guess Y=0       129       306


LogisticReg Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       629       143
Guess Y=0       102       284


SVC Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       664       196
Guess Y=0        67       231


RandomForest Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       619       124
Guess Y=0       112       303




In [17]:
print("[ Additional Data ] \n")
dict_to_output(confused_dict_add)

df1 = pd.DataFrame()
df1['Misclassification'] = mcrs_add
df1['MCC'] = mccs_add
df1.index = model_names

[ Additional Data ] 

MultinomialNB Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       639        85
Guess Y=0        90       227


LogisticReg Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       680       108
Guess Y=0        49       204


SVC Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       712       207
Guess Y=0        17       105


RandomForest Confusion Matrix
           True Y=1  True Y=0
Guess Y=1       706       120
Guess Y=0        23       192




In [20]:
df["dataType"] = "original"
df.to_csv("result/original_data_results_mix0.2.csv")
df

Unnamed: 0,Misclassification,MCC,dataType
MultinomialNB,0.784111,0.53812,original
LogisticReg,0.788428,0.5379,original
SVC,0.772884,0.495883,original
RandomForest,0.7962,0.559798,original


In [21]:
df1["dataType"] = "additional"
df1.to_csv("result/additional_data_results_mix0.2.csv")
df1

Unnamed: 0,Misclassification,MCC,dataType
MultinomialNB,0.831892,0.60139,additional
LogisticReg,0.849183,0.626589,additional
SVC,0.784822,0.44612,additional
RandomForest,0.862632,0.660726,additional
