## Sentiment Analysis - Tweets

I have a dataset downloaded with some tweets from analytics vidhya. I'll be implementing my own sentiment analysis trainer using this dataset and a bunch of tools that I learnt recently.

In [39]:
import pandas as pd
import spacy
import numpy as np

nlp = spacy.load('en_core_web_md')

In [26]:
dataset = 'datasets/tweets.csv'
dataframe = pd.read_csv(dataset)
dataframe.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Let's just use spacy to tokenize, remove stop words and generate vectors for the rest

In [142]:
def tokenize(text):
    
    doc = nlp(text)
    tokens = []
    for token in doc:
        if token.is_stop:
            continue
        if token.is_punct:
            continue
        if token.is_digit:
            continue
        if token.is_space:
            continue
        if token.is_oov:
            continue
        tokens.append(token)
    if len(tokens) == 0:
        return None
    return tokens

dataframe["tokens"] = dataframe["text"].apply(tokenize)
dataframe["tokens"]

0                                                   [said]
1            [plus, added, commercials, experience, tacky]
2                                [today, mean, need, trip]
3        [aggressive, blast, obnoxious, entertainment, ...
4                                        [big, bad, thing]
                               ...                        
14635             [thank, got, different, flight, Chicago]
14637                          [bring, American, Airlines]
14638    [money, change, flight, answer, phones, sugges...
14639    [ppl, need, know, seats, flight, Plz, standby,...
Name: tokens, Length: 14640, dtype: object

In [143]:
dataframe = dataframe[dataframe["tokens"].notna()]
dataframe

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,tokens,vectors,mv
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),[said],"[-0.061115395, 0.304086, -0.097632006, -0.2018...","[-0.27903, 0.7704, -0.14395, -0.22742, 0.04295..."
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),"[plus, added, commercials, experience, tacky]","[-0.016589679, 0.14309001, -0.13764508, -0.164...","[-0.13081339, 0.13338801, -0.0367288, -0.22699..."
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),"[today, mean, need, trip]","[0.035537396, 0.17360125, -0.17913023, -0.0846...","[0.0019049942, 0.1551125, -0.13563375, -0.0625..."
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),"[aggressive, blast, obnoxious, entertainment, ...","[-0.105681576, 0.24877128, -0.08282183, -0.085...","[-0.21348687, 0.14160678, 0.003382216, -0.0696..."
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),"[big, bad, thing]","[-0.17423238, 0.28028718, -0.15822393, -0.0262...","[-0.42118666, 0.19221734, -0.206463, -0.15501,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,,"[thank, got, different, flight, Chicago]","[0.021060742, 0.08695117, -0.19565582, -0.0582...","[0.0071934015, 0.0038350075, -0.14336602, -0.1..."
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,,"[leaving, minutes, Late, Flight, warnings, com...","[-0.019926567, 0.18126783, -0.04629069, -0.089...","[0.036526017, 0.12320561, -0.068907924, -0.069..."
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",,"[bring, American, Airlines]","[-0.07359338, 0.1143195, 0.077381626, 0.021280...","[-0.11785234, 0.034711335, 0.33031765, -0.1876..."
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada),"[money, change, flight, answer, phones, sugges...","[-0.07736141, 0.19389972, -0.24701768, -0.0365...","[-0.10838828, 0.13719143, -0.046205997, -0.050..."


In [144]:
from sklearn.preprocessing import LabelEncoder
target = LabelEncoder().fit_transform(dataframe["airline_sentiment"])
target.shape

(14552,)

## Spacy Vectors

I am going to try two approaches to generating vectors. The first is a lazy approach. I'll just assume the tweet is a valid english sentence (which it certainly is not) and generate a vector using spacy. The second is where I will clean up the tweet and take the mean of the vectors for each remnant token.

In [45]:
def vectorise(text):
    doc = nlp(text)
    return doc.vector

dataframe["vectors"] = dataframe["text"].apply(vectorise)

In [150]:
def mean_vector_for_tokens(list_of_tokens):
    vectors = []
    for token in list_of_tokens:
        vectors.append(token.vector)
    if len(vectors) == 0:
        return None
    mv = np.mean(vectors, axis=0)
    return mv

dataframe["mv"] = dataframe["tokens"].apply(mean_vector_for_tokens)


In [146]:
train_set = dataframe["mv"].apply(pd.Series)
train_set.shape

(14552, 300)

In [158]:
train_set_2 = dataframe["vectors"].apply(pd.Series)
train_set_2.shape

(14552, 300)

In [168]:
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(train_set, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

x_train2, x_valid2, y_train2, y_valid2 = train_test_split(train_set_2, target, random_state=20, stratify=target)
print(x_train2.shape, y_train2.shape)
print(x_valid2.shape, y_valid2.shape)

(10914, 300) (10914,)
(3638, 300) (3638,)
(10914, 300) (10914,)
(3638, 300) (3638,)


## Classifier Approaches

I am going to try three different classifier models with the above two vectors and see how they perform.
1. Logistic Regression
2. Decision Tree Classifier
3. Simple SVM Classifier


In [173]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lgmodel1 = LogisticRegression(max_iter=1000)
lgmodel1.fit(x_train, y_train)

lgmodel2 = LogisticRegression(max_iter=1000)
lgmodel2.fit(x_train2, y_train2)

predictions = {
    "train1": lgmodel1.predict(x_train),
    "valid1": lgmodel1.predict(x_valid),
    "train2": lgmodel2.predict(x_train2),
    "valid2": lgmodel2.predict(x_valid2),
}

accuracy_lg = {
    "train1": accuracy_score(y_train, predictions["train1"]),
    "valid1": accuracy_score(y_valid, predictions["valid1"]),
    "train2": accuracy_score(y_train2, predictions["train2"]),
    "valid2": accuracy_score(y_valid2, predictions["valid2"])
}

accuracy_lg

{'train1': 0.7915521348726406,
 'valid1': 0.765805387575591,
 'train2': 0.817482133040132,
 'valid2': 0.7976910390324354}

In [175]:
from sklearn.tree import DecisionTreeClassifier

dtcmodel1 = DecisionTreeClassifier()
dtcmodel1.fit(x_train, y_train)

dtcmodel2 = DecisionTreeClassifier()
dtcmodel2.fit(x_train2, y_train2)


predictions = {
    "train1": dtcmodel1.predict(x_train),
    "valid1": dtcmodel1.predict(x_valid),
    "train2": dtcmodel2.predict(x_train2),
    "valid2": dtcmodel2.predict(x_valid2),
}

accuracy_dtc = {
    "train1": accuracy_score(y_train, predictions["train1"]),
    "valid1": accuracy_score(y_valid, predictions["valid1"]),
    "train2": accuracy_score(y_train2, predictions["train2"]),
    "valid2": accuracy_score(y_valid2, predictions["valid2"])
}

accuracy_dtc

{'train1': 0.9937694704049844,
 'valid1': 0.6388125343595382,
 'train2': 0.9974344878138172,
 'valid2': 0.6390874106652007}

In [174]:
from sklearn import svm

svcmodel1 = svm.SVC()
svcmodel1.fit(x_train, y_train)

svcmodel2 = svm.SVC()
svcmodel2.fit(x_train2, y_train2)

predictions = {
    "train1": svcmodel1.predict(x_train),
    "valid1": svcmodel1.predict(x_valid),
    "train2": svcmodel2.predict(x_train2),
    "valid2": svcmodel2.predict(x_valid2),
}

accuracy_svc = {
    "train1": accuracy_score(y_train, predictions["train1"]),
    "valid1": accuracy_score(y_valid, predictions["valid1"]),
    "train2": accuracy_score(y_train2, predictions["train2"]),
    "valid2": accuracy_score(y_valid2, predictions["valid2"])
}

accuracy_svc

{'train1': 0.8373648524830493,
 'valid1': 0.7792743265530512,
 'train2': 0.8233461608942643,
 'valid2': 0.8007146783947223}

## Results

From the above runs, we see that the best case performance is only about 80% accurate. Decision Tree seems to overfit based on how well it performs on the training data. On the other hand, I don't really know if I need to pass other parameters to improve its performance at this time. I'll park that for later.

We know that we have taken an extremely simple approach here. The vector generation is actually in vain. It removes quite a bit of information from the tweet text itself and doesn't take the meta data into account. But before exploring meta data, I am going to repeat this with a simple Tf-IDf vectoriser and then again with Word2Vec to see how they perform given just the above information.

### Summary of the experiment so far


Approach        |Logistic Regression   | Decision Tree | SVM 
----------------|-------|-------|-----------------------
Tweet as Vector | 76.5% | 63.8% | 77.9% 
Tokens Vector   | 79.7% | 63.9% | 80%
