# Predicting Tweet Sentiment
- by Chee-Foong
- Apr 2020

## Summary

This analysis attempts to train a couple of machine learning models to predict the sentiment of tweets.

## Data

Using data downloaded from Kaggle competition: [Tweet Sentiment Extraction](https://www.kaggle.com/c/tweet-sentiment-extraction/data)

## Scope

Note that the objective of the Kaggle is to predict the list of keywords that is relevant for the sentiment of each tweet.  The work here is <span style="color:red">**NOT**</span> an analysis for the competition.  I am just using the tweets and the respective sentiments given in the train and test data set of explore various machine learning models and NLP libraries / techniques.

---
## Download data files from Kaggle
Reference code how to download and extract files from Kaggle.  Not required once files are downloaded.  Files are stored in directory <span style="color:red">**../data**</span>

In [1]:
# !pip install kaggle --upgrade
# !cp ../utility/kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json
# !kaggle competitions download -c tweet-sentiment-extraction
# !mv *.zip ../data/

In [2]:
# import zipfile
# import os

# datafolder = '../data/'

# file_list = []
# dir_list = []
# for root, dirs, files in os.walk(datafolder, topdown=False):
#     for name in files:
#         if (name != '.DS_Store'):
#             file_list.append(os.path.join(datafolder, name))
#     for name in dirs:
#         dir_list.append(os.path.join(datafolder, name))

# for file in file_list:
#     if file.endswith('.zip'):
#         zip_ref = zipfile.ZipFile(file, 'r')
#         zip_ref.extractall(datafolder)
#         zip_ref.close()
#         os.remove(file)

---
## Libraries

In [3]:
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_lg

### Public

In [4]:
import numpy as np
import pandas as pd
import copy
import matplotlib.pyplot as plt
%matplotlib inline

## Loading Language Model
import spacy

# Load the en_core_web_sm model
# nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en_core_web_sm')

In [5]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fb113f45400>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fb113f473a0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fb113f478e0>)]

### Private

In [6]:
import sys  
sys.path.append('../src') 

from cleandata import *
from memory import *

---
## Data Ingestion and Exploratory

Notes
- Sample_submission.csv is not relevant for the analysis and will not be loaded.
- 'selected_text' column in train.csv is not relevant here and will be dropped.
- 'textID' will remain for referencing later if required
- test.csv will serve as the out of sample data set.
- 'sentiment' column in test.csv will be used to measure model performance.



In [7]:
datafolder = '../data/'

In [8]:
# sample_submission = pd.read_csv(datafolder + "sample_submission.csv")
test = pd.read_csv(datafolder + "test.csv")
train = pd.read_csv(datafolder + "train.csv")

In [9]:
train.shape

(27481, 4)

In [10]:
# train = train.head(1000)

In [11]:
train = train.drop(['selected_text'], axis=1)

In [12]:
print(train.columns)
print(test.columns)

Index(['textID', 'text', 'sentiment'], dtype='object')
Index(['textID', 'text', 'sentiment'], dtype='object')


Dropping invalid text records from the training dataset

In [13]:
train_cat = train.select_dtypes(include=['object']).copy()
print(train_cat.isnull().values.sum())
print(train_cat.isnull().sum())

1
textID       0
text         1
sentiment    0
dtype: int64


In [14]:
train[train.text.isna()]

Unnamed: 0,textID,text,sentiment
314,fdb77c3752,,neutral


In [15]:
train.dropna(inplace=True)

In [16]:
test_cat = test.select_dtypes(include=['object']).copy()
print(test_cat.isnull().values.sum())
print(test_cat.isnull().sum())

0
textID       0
text         0
sentiment    0
dtype: int64


In [17]:
del train_cat
del test_cat

---
## Data Cleanising and Feature Engineering

Generate relevant social media data
- add number of urls in tweets
- add number of mentions in tweets
- add number of mentions in tweets

Clean up 'text' column
- remove url links
- clean up short forms and abbreviations
- remove punctuations

Apply all above functions to both train and test dataset

In [18]:
from spacy.matcher import Matcher

def add_socialinfo(df, ref_column):
    df[['url','mention','hashtag']] = df[ref_column].apply(socialinfo)
    return df


def socialinfo(text):
    matcher = Matcher(nlp.vocab)

    like_url = [{"LIKE_URL": True}]
    like_mention = [{"TEXT": {"REGEX": "^[\@][A-z0-9]+$"}}]
    like_hashtag = [{"ORTH": "#"}, {"IS_ASCII": True}]

    matcher.add("url", None, like_url)
    matcher.add("mention", None, like_mention)
    matcher.add("hashtag", None, like_hashtag)

    doc = nlp(text)
    matches = matcher(doc)

    url_count = 0
    mention_count = 0
    hashtag_count = 0
    
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        
        if string_id == 'url':
            url_count += 1
        elif string_id == 'mention':
            mention_count += 1
        elif string_id == 'hashtag':
            hashtag_count += 1
            hashtag_text = doc[start:end].text
            
    return pd.Series([url_count, mention_count, hashtag_count])

In [19]:
import string
# Create our list of punctuation marks
punctuations = string.punctuation

# Function to remove URLs
def remove_http(tweet):
    words = tweet.split()
    out = [word for word in words if not word.startswith('http://')]
    return ' '.join(out)

def remove_punc(tweet):
    punctuations = string.punctuation
    words = tweet.split()
    out = [ word for word in words if word not in punctuations ]
    return ' '.join(out)

In [20]:
def text_feature_engineering(df_train, df_test):
    to_drop = [x for x in df_train.columns if x not in df_test.columns]
    full = df_train.append(df_test, sort=False)

    full = add_socialinfo(full, 'text')

    full['text'] = full['text'].apply(remove_http)
    full['text'] = full['text'].apply(clean)
    full['text'] = full['text'].apply(remove_punc)

    full['text'] = full['text'].str.lower()
    
    train = full.iloc[0:len(df_train)]
    test = full.iloc[len(df_train):].drop(to_drop, axis=1)
    return train, test

<span style="color:red">**WARNING**</span>: The following cell may take a while to run.

In [21]:
%%time
train, test = text_feature_engineering(train, test)

CPU times: user 35min 25s, sys: 9.3 s, total: 35min 34s
Wall time: 35min 47s


Selecting relevant columns

In [22]:
target = 'sentiment'
categorical = []
numerical = ['url', 'mention', 'hashtag']
to_drop = ['textID']

In [23]:
train = train.drop(to_drop, axis=1)
test = test.drop(to_drop, axis=1)

train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

train_wide_x = train.drop([target], axis=1)
train_wide_y = train[target]

Memory usage of dataframe is 1.26 MB
Memory usage after optimization is: 0.71 MB
Decreased by 43.8%
Memory usage of dataframe is 0.16 MB
Memory usage after optimization is: 0.09 MB
Decreased by 43.8%


---
## Data Processing for Modeling
- Train-Test split with a ratio of 9:1
- Tokenising and Vectorising the 'text' column
- Replacing 'text' with tokens columns

In [24]:
import time
import scipy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


Building a tokeniser with spacy

In [25]:
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# nlp = spacy.load('en')

# Create our list of stopwords
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
# parser = English()

def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
#     mytokens = [ word.lower_.strip() for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

<span style='color:red'>Due to memory issues, I was not able to Ngram 2 or more here.</span>

In [26]:
NGRAM = 1

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(train_wide_x, train_wide_y, test_size=0.1, 
                                                    random_state=42, stratify=train_wide_y)

print(train_X.shape)

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,NGRAM))
# vectorizer = CountVectorizer(ngram_range=(1,3))

train_vec_X = vectorizer.fit_transform(train_X['text'])
test_vec_X = vectorizer.transform(test_X['text'])

print(train_vec_X.shape)

(24732, 4)
(24732, 20079)


In [27]:
## Prepare input dataframe for model.  Combining all transformed dataframes together
def prepare_df_final(df, mat_vec, drop_col):
    df_dropped = df.drop(drop_col, axis = 1)
    df_vec = pd.DataFrame(mat_vec.toarray()).set_index(df_dropped.index)
    final = pd.concat([df_dropped, df_vec], axis = 1)
    return final

In [28]:
%%time
train_X = prepare_df_final(train_X, train_vec_X, ['text'])
test_X = prepare_df_final(test_X, test_vec_X, ['text'])

del train_wide_x
del train_wide_y
del train_vec_X
del test_vec_X

gc.collect()

## Important to manage memory usage
train_X = scipy.sparse.csr_matrix(train_X.values)
test_X = scipy.sparse.csr_matrix(test_X.values)

CPU times: user 2min 28s, sys: 1min 26s, total: 3min 55s
Wall time: 4min 3s


---
## Model Training

Checks to ensure shape of datasets are consistent and within expectation

In [29]:
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)

(24732, 20082)
(2748, 20082)
(24732,)
(2748,)


Build a generic function to train models

In [30]:
from sklearn.model_selection import cross_val_score

def fitModel(model, train_X, test_X, train_y, test_y):
    start_time = time.time()
    
    print("Model: {}".format(type(model).__name__))

    if type(model).__name__ == 'XGBClassifier':
        model.fit(train_X.toarray(), train_y)
    else:
        model.fit(train_X, train_y)
    
    # Print accuracy, time and number of dimensions
    print("The program took %.3f seconds to complete." % (time.time() - start_time))

    # Cross validation score
#     f1score = cross_val_score(model, train_X, train_y, cv=10, scoring="f1")
#     print(f1score)
#     print("The 10-fold cross validation f1 score is %.3f" % f1score)    
    
    # Measure the accuracy
    accuracy = model.score(test_X, test_y)
    print("The accuracy of the classifier on the test set is %.3f" % accuracy)
    
    return model, accuracy

models = []
ACCURACY_THRESHOLD = 0.65

Training based on list of models below with default setting.  No tuning at this stage.  Objective is to compare how each model performs against one another.

   1. Naive Bayes Model
   2. Ridge Classifier
   3. Logistic Regression
   4. XGBoost
   5. Random Forest
   6. Decision Tree
   7. Gradient Boosting
   8. Support Vector Machine

### Naive Bayes Model

In [31]:
from sklearn.naive_bayes import MultinomialNB
nb, accuracy = fitModel(MultinomialNB(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(nb)

Model: MultinomialNB
The program took 0.948 seconds to complete.
The accuracy of the classifier on the test set is 0.624


### Ridge Classifier

In [32]:
from sklearn.linear_model import RidgeClassifier
rc, accuracy = fitModel(RidgeClassifier(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(rc)

Model: RidgeClassifier
The program took 1.178 seconds to complete.
The accuracy of the classifier on the test set is 0.674


### Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression
lr, accuracy = fitModel(LogisticRegression(random_state = 1), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(lr)

Model: LogisticRegression
The program took 4.053 seconds to complete.
The accuracy of the classifier on the test set is 0.694


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### XGBoost
<span style="color:red">This may take hours to run</span>

In [34]:
from xgboost import XGBClassifier
xgb, accuracy = fitModel(XGBClassifier(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(xgb)

Model: XGBClassifier
The program took 13205.883 seconds to complete.
The accuracy of the classifier on the test set is 0.692


### Random Forest Classifier

In [35]:
from sklearn.ensemble import RandomForestClassifier
rfc, accuracy = fitModel(RandomForestClassifier(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(rfc)

Model: RandomForestClassifier
The program took 65.399 seconds to complete.
The accuracy of the classifier on the test set is 0.706


### Decision Tree Classifier

In [36]:
from sklearn.tree import DecisionTreeClassifier
dtc, accuracy = fitModel(DecisionTreeClassifier(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(dtc)

Model: DecisionTreeClassifier
The program took 9.126 seconds to complete.
The accuracy of the classifier on the test set is 0.656


### Gradient Boosting Classifier

In [37]:
from sklearn.ensemble import GradientBoostingClassifier
gbc, accuracy = fitModel(GradientBoostingClassifier(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(gbc)

Model: GradientBoostingClassifier
The program took 283.471 seconds to complete.
The accuracy of the classifier on the test set is 0.661


### Support Vector Machine

In [38]:
from sklearn.svm import SVC
svc, accuracy = fitModel(SVC(), train_X, test_X, train_y, test_y)

if accuracy > ACCURACY_THRESHOLD:
    models.append(svc)

Model: SVC
The program took 251.189 seconds to complete.
The accuracy of the classifier on the test set is 0.697


In [39]:
print('Number of models that meet minimum accuracy threshold: {}'.format(len(models)))

Number of models that meet minimum accuracy threshold: 7


---
## Data Transformation for Test Dataset
- Tokenising and Vectorising the 'text' column
- Replacing 'text' with tokens columns

In [40]:
test_vec_text = vectorizer.transform(test['text'])
test_final = prepare_df_final(test.drop(['sentiment'], axis=1), test_vec_text, ['text'])

In [41]:
test_final = scipy.sparse.csr_matrix(test_final.values)
test_final.shape

(3534, 20082)

---
## Model Prediction and Results
Show details of all models that will be used for prediction

In [42]:
models

[RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                 max_iter=None, normalize=False, random_state=None,
                 solver='auto', tol=0.001),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
               importance_type='gain', interaction_constraints=None,
               learning_rate=0.300000012, max_delta_step=0, max_depth=6,
               min_child_weight=1, missing=nan, monotone_constraints=None,
               n_estimators=100, n_jobs=0, num_parallel_tree=1,
               objective='multi:softprob', random_state=0

In [43]:
import sys

predictions = []
for model in models:
    try:
        if type(model).__name__ == 'XGBClassifier':
            prediction = model.predict(test_final.toarray())
        else:
            prediction = model.predict(test_final)      
        predictions.append(prediction)
    except:
        print(model)
        print("Unexpected error:", sys.exc_info()[0])
        continue

### Respective Model Results - Weighted F1 Score
Random Forest Classifier performs best on out of sample data

In [44]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

results = list(zip(models, predictions))
labels = np.unique(test.sentiment)

f1 = {}
for result in results:
    model = type(result[0]).__name__
    score = f1_score(test.sentiment, result[1], labels=labels, average='weighted')
    f1[model] = score
    
pd.DataFrame([f1], index=['Weighted_F1']).T

Unnamed: 0,Weighted_F1
DecisionTreeClassifier,0.667039
GradientBoostingClassifier,0.648697
LogisticRegression,0.703477
RandomForestClassifier,0.711881
RidgeClassifier,0.679198
SVC,0.696089
XGBClassifier,0.696794


### Ensembling the models
Aggregating the results

In [45]:
predict_results = pd.DataFrame(predictions)
predict_results = pd.Series(predict_results.mode().iloc[0])

In [46]:
compare = test.copy()
compare['predicted'] = predict_results

In [47]:
pd.DataFrame([f1_score(compare.sentiment, compare.predicted, average='weighted')], index=['Weighted_F1'], columns=['Ensemble'])

Unnamed: 0,Ensemble
Weighted_F1,0.709599


Confusion matrix over the true (rows), predicted (columns)

In [48]:
cm =  confusion_matrix(compare.sentiment, compare.predicted, labels=labels)

pd.DataFrame(cm, index=labels, columns=labels)

Unnamed: 0,negative,neutral,positive
negative,615,341,45
neutral,170,1103,157
positive,37,277,789


F1 Score of respective labels

In [49]:
pd.DataFrame([f1_score(compare.sentiment, compare.predicted, average=None, labels=labels)], index=['F1'], columns=labels)

Unnamed: 0,negative,neutral,positive
F1,0.674712,0.700095,0.753582


## Conclusion

Random Forest Classifier has the best accuracy performance in terms of F1 Score.  The next analysis will attempt to fine tune the parameters of the RF model to see whether the accuracy can be improved.

In addition, due to memory issues, I was not able to create more features by increasing ngrams beyond 1.  I believe model accuracy will improve with higher ngrams.  I will explore ways to overcome this issue.  Maybe running this script on Google Colab will work.