## Classification Experiments Continued Some More

In this notebook, am going to add more features for training to see if I can push the accuracy score. I want to see if I can add back Naive Bayes model as well. Finally, I'll also add F-score scoring to compare the performance. 

In [51]:
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, csr_matrix
from nltk.corpus import stopwords
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler

In [2]:
dataset = 'datasets/tweets.csv'
dataframe = pd.read_csv(dataset)
dataframe.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
glove_embeddings = {}
def load_glove(glove_file):
    with open(glove_file, 'r') as gf:
        for line in gf:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            glove_embeddings[word] = vector

load_glove('models/glove/glove.twitter.27B.200d.txt')

In [121]:
stop_words = set(stopwords.words('english'))

mention = re.compile('^@.*')
hashtag = re.compile('^#.*')
punctuation = re.compile('[!#,*$%^&.\-\+]')
digits = re.compile('^[0-9,\.].*')
urls = re.compile('^http.*')
def cleanup(text:str) -> [str]:
    # lowercase the text first
    ltext = text.lower()
    words = ltext.split(' ')
    clean_words = []
    # since these are tweets, remove any @ mentions. they can't really map meaningfully to lang vectors anyway.
    for word in words:
        # replace mentions with <user>
        # replace hashtags with <hashtag>
        # replace numbers with <number>
        # replace urls with <url>
        clean = re.sub(mention, '<user>', word)
        clean = re.sub(hashtag, '<hashtag>', clean)
        clean = re.sub(digits, '<number>', clean)
        clean = re.sub(urls, '<url>', clean)
        clean = re.sub(punctuation, '', clean)
        if clean and clean not in stop_words:
            clean_words.append(clean)
    return clean_words if len(clean_words) > 0 else None


dataframe["glove_clean"] = dataframe["text"].apply(cleanup)

In [122]:
def glove_vector(list_of_tokens: [str]):
    vectors = []
    for token in list_of_tokens:
        vec = glove_embeddings.get(token)
        if vec is not None:
            vectors.append(vec)

    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    return None

dataframe["glove_vectors"] = dataframe["glove_clean"].apply(glove_vector)

In [123]:
def predict(model, x_train, x_valid):
    return {
        "train": model.predict(x_train),
        "valid": model.predict(x_valid),
    }

def accuracy(predictions, y_train, y_valid):
    return {
        "train": accuracy_score(y_train, predictions["train"]),
        "valid": accuracy_score(y_valid, predictions["valid"])
    }

def compute_f1_score(predictions, y_train, y_valid):
    return {
        "train": f1_score(y_train, predictions["train"], average='weighted'),
        "valid": f1_score(y_valid, predictions["valid"], average='weighted')
    }

def fit_and_predict_returning_metrics(model, x_train, y_train, x_valid, y_valid):
    model.fit(x_train, y_train)
    predictions = predict(model, x_train, x_valid)
    acc = accuracy(predictions, y_train, y_valid)
    #return acc
    f1score = compute_f1_score(predictions, y_train, y_valid)
    #return f1score
    return model, {
        "accuracy": acc,
        "f1score": f1score
    }

def logistic_regression(x_train, y_train, x_valid, y_valid):
    lgmodel = LogisticRegression(max_iter=5000)
    return fit_and_predict_returning_metrics(lgmodel, x_train, y_train, x_valid, y_valid)

def decision_tree_classifier(x_train, y_train, x_valid, y_valid):
    dtcmodel = DecisionTreeClassifier()
    return fit_and_predict_returning_metrics(dtcmodel, x_train, y_train, x_valid, y_valid)

def svc(x_train, y_train, x_valid, y_valid):
    svcmodel = svm.SVC()
    return fit_and_predict_returning_metrics(svcmodel, x_train, y_train, x_valid, y_valid)

def naive_bayes(x_train, y_train, x_valid, y_valid):
    nbmodel = MultinomialNB()
    return fit_and_predict_returning_metrics(nbmodel, x_train, y_train, x_valid, y_valid)


### Features

All of the above are the same as what we have used in the previous notebooks. So am just copy pasting them to get to the real thing fast. In the next set of cells, we will define the metadata features that we want to use as part of the training.

The features we wish to capture going forward are like this:
1. mean glove embedding vectors for the cleaned up text
2. tfidf vectors for the cleaned up text
3. number of mentions in the tweet
4. number of hashtags in the tweet
5. number of urls in the tweet
6. length of the tweet
7. retweet count
8. whether the tweet is a RT or a tweet
9. number of cleaned up tokens in the tweet
10. count of numbers in the tweet

It is possible that some of these overlap. Overlaps might negatively contribute to the overall accuracy. But let's start somewhere first.

In [124]:
def is_retweet(text: str):
    return 1 if text.split()[0].lower() == 'RT' else 0

def count_entity(ent, list_of_tokens):
    count = 0
    for token in list_of_tokens:
        if token == ent:
            count += 1
    return count

def count_mentions(list_of_tokens):
    return count_entity('<user>', list_of_tokens)

def count_hashtags(list_of_tokens):
    return count_entity('<hashtag>', list_of_tokens)

def count_urls(list_of_tokens):
    return count_entity('<url>', list_of_tokens)

def count_numbers(list_of_tokens):
    return count_entity('<number>', list_of_tokens)

dataframe["num_tokens"] =  dataframe["glove_clean"].apply(lambda x: len(x))
dataframe["length"] = dataframe["text"].apply(lambda x: len(x))
dataframe["num_mentions"] = dataframe["glove_clean"].apply(count_mentions)
dataframe["num_hashtags"] = dataframe["glove_clean"].apply(count_hashtags)
dataframe["num_urls"] = dataframe["glove_clean"].apply(count_urls)
dataframe["is_rt"] = dataframe["text"].apply(is_retweet)
dataframe["numbers"] = dataframe["glove_clean"].apply(count_numbers)
dataframe.shape

(14640, 24)

In [125]:
# drop all rows where the glove vectors are absent for whatever reason
pruned_df = dataframe[dataframe["glove_vectors"].notna()]
print(pruned_df.shape)

tfidf = TfidfVectorizer()
tfidf_vectors = tfidf.fit_transform(pruned_df["glove_clean"].apply(lambda x: ' '.join((map(str, x)))))
print(tfidf_vectors.shape)

(14638, 24)
(14638, 11321)


now let's form the training set by combining all the features.

In [140]:
features = ["retweet_count", "num_tokens", "length", "num_mentions", "num_hashtags", "num_urls", "is_rt", "numbers"]
features_set = pruned_df[features]
glove_set = pruned_df["glove_vectors"].apply(pd.Series)
scaler = MinMaxScaler()
glove_set_scaled = scaler.fit_transform(glove_set)

train_set = hstack([tfidf_vectors, csr_matrix(features_set), csr_matrix(glove_set_scaled )])
train_set.shape

(14638, 11529)

In [141]:
le = LabelEncoder()
target = le.fit_transform(pruned_df["airline_sentiment"])
target.shape

(14638,)

In [142]:
x_train, x_valid, y_train, y_valid = train_test_split(train_set, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(10978, 11529) (10978,)
(3660, 11529) (3660,)


In [144]:
lgmodel, results = logistic_regression(x_train, y_train, x_valid, y_valid)
print(results)

{'accuracy': {'train': 0.873656403716524, 'valid': 0.7969945355191257}, 'f1score': {'train': 0.8701546693055096, 'valid': 0.7902594262037741}}


In [145]:
dtcmodel, results = decision_tree_classifier(x_train, y_train, x_valid, y_valid)
print(results)

{'accuracy': {'train': 0.9972672617963199, 'valid': 0.6418032786885246}, 'f1score': {'train': 0.9972660160562759, 'valid': 0.6431935532726364}}


In [146]:
svcmodel, results = svc(x_train, y_train, x_valid, y_valid)
print(results)

{'accuracy': {'train': 0.6386409182000364, 'valid': 0.6371584699453552}, 'f1score': {'train': 0.5349562001519731, 'valid': 0.5334870735614848}}


In [143]:
nbmodel, results = naive_bayes(x_train, y_train, x_valid, y_valid)
print(results)

{'accuracy': {'train': 0.7095099289488067, 'valid': 0.6773224043715848}, 'f1score': {'train': 0.6910913881277492, 'valid': 0.652410232472699}}


### Results

We have an overall accuracy of around 80% with logistic regression, which performs the best in our case. Whether the metric is accuracy score or f1 score, Logic Regression simply works the best for this dataset and features. For some reason, DTC always seems to overfit. This is the only exercise I need to perform again after learning a bit of what DTC really tries to do.

Going forward, I'll use F1 Score as the metric of choice. Both Andrew Ng's course and my mentor have suggested this metric. Besides, the accuracy metric is not all that far off either.

Lastly, I snuck in a scaler for the glove vectors in order to be able to run a Naive Bayes classifier. I guess this set of experiments conclude here. I'll revisit classification with a different dataset in a later notebook.

The next session implements a simple pipleline to take a tweet and try and predict the sentiment based on the above learning. We will use the Logistic Regression model for this purpose.

In [147]:
sample_tweet = "@USAirways flt 419. 2+ hrs Late Flight, baggage + 1 more hr. Now I see they delivered my suitcase wet inside &amp; out. #NotHappy"

clean_tokens = cleanup(sample_tweet)
word_vector = scaler.transform(glove_vector(clean_tokens).reshape(1,-1))
tfidf_vector = tfidf.transform([' '.join((map(str, clean_tokens)))])
features = ["retweet_count", "num_tokens", "length", "num_mentions", "num_hashtags", "num_urls", "is_rt"]
rtcount = 0
num_tokens = len(clean_tokens)
length = len(sample_tweet)
num_mentions = 1
num_hashtags = 1
num_urls = 0
is_rt = 0
numbers = 3
extra_features = np.array([rtcount, num_tokens, length, num_mentions, num_hashtags, num_urls, is_rt, numbers], dtype='float32')
vector = hstack([tfidf_vector, csr_matrix(word_vector), csr_matrix(extra_features)])
res = lgmodel.predict(vector)
le.inverse_transform(np.array(res))

array(['negative'], dtype=object)