## Classifiction Experiments Continued

In this notebook, I plan to explore the same tweets dataset and the three classification algorithms but using word2vec and tfidf vector models.

In [82]:
import pandas as pd
import numpy as np
import gensim

# get the google news dataset model binary from here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
model = gensim.models.KeyedVectors.load_word2vec_format('./models/GoogleNews-vectors-negative300.bin', binary=True)

dataset = 'datasets/tweets.csv'
dataframe = pd.read_csv(dataset)
dataframe.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [84]:
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import preprocess_string

    
dataframe["clean_words"] = dataframe["text"].apply(preprocess_string)
dataframe.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,clean_words
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),"[virginamerica, dhepburn, said]"
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),"[virginamerica, plu, ad, commerci, experi, tacki]"
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),"[virginamerica, todai, mean, need, trip]"
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),"[virginamerica, aggress, blast, obnoxi, entert..."
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),"[virginamerica, big, bad, thing]"


In [99]:
def generate_mean_vector(words: [str]):
    vectors = []
    for word in words:
        try:
            vectors.append(model.get_vector(word, norm=True))
        except KeyError:
            pass
    return np.mean(vectors, axis=0) if len(vectors) > 0 else np.NaN

dataframe["mv"] = dataframe["clean_words"].apply(generate_mean_vector)

In [100]:
# remove rows that have a nan in mv column
dataframe = dataframe[dataframe["mv"].notna()]
dataframe.shape

(14508, 17)

In [87]:
from sklearn.preprocessing import LabelEncoder
target = LabelEncoder().fit_transform(dataframe["airline_sentiment"])
target.shape

(14508,)

In [88]:
train_set = dataframe["mv"].apply(pd.Series)
train_set.shape

(14508, 300)

In [89]:
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(train_set, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(10881, 300) (10881,)
(3627, 300) (3627,)


In [90]:
from sklearn.metrics import accuracy_score

def predict(model, x_train, x_valid):
    return {
        "train": model.predict(x_train),
        "valid": model.predict(x_valid),
    }

def accuracy(predictions, y_train, y_valid):
    return {
        "train": accuracy_score(y_train, predictions["train"]),
        "valid": accuracy_score(y_valid, predictions["valid"])
    }


In [91]:
from sklearn.linear_model import LogisticRegression

def logistic_regression(x_train, y_train, x_valid, y_valid):
    lgmodel = LogisticRegression(max_iter=1000)
    lgmodel.fit(x_train, y_train)

    predictions = predict(lgmodel, x_train, x_valid)
    return accuracy(predictions, y_train, y_valid)

print(logistic_regression(x_train, y_train, x_valid, y_valid))

{'train': 0.7463468431210367, 'valid': 0.7419354838709677}


In [92]:
from sklearn.tree import DecisionTreeClassifier

def decision_tree_classifier(x_train, y_train, x_valid, y_valid):
    dtcmodel = DecisionTreeClassifier()
    dtcmodel.fit(x_train, y_train)

    predictions = predict(dtcmodel, x_train, x_valid)
    return accuracy(predictions, y_train, y_valid)

print(decision_tree_classifier(x_train, y_train, x_valid, y_valid))

{'train': 0.9885120852862789, 'valid': 0.5985663082437276}


In [93]:
from sklearn import svm

def svc(x_train, y_train, x_valid, y_valid):
    svcmodel = svm.SVC()
    svcmodel.fit(x_train, y_train)

    predictions = predict(svcmodel, x_train, x_valid)
    return accuracy(predictions, y_train, y_valid)

print(svc(x_train, y_train, x_valid, y_valid))

{'train': 0.8312655086848635, 'valid': 0.7620623104494072}


## Results so far

Performance with gensim or plain spacy vectors are quite similar. Perhaps they just prove the equivalence of the English language and the fact that we haven't anything all that different. Next step is to repeat this with TfIdf Vectors and see if they differ from the above two approaches.

In [94]:
# re-read the dataset. we dropped a few frames earlier. so let's restart from scratch for this.
dataframe = pd.read_csv(dataset)
dataframe.head()
dataframe["clean_words"] = dataframe["text"].apply(preprocess_string)

In [102]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(dataframe["clean_words"].apply(lambda x: ' '.join((map(str, x)))))
tfidf_matrix.shape

(14508, 10150)

In [103]:
target = LabelEncoder().fit_transform(dataframe["airline_sentiment"])
target.shape

x_train, x_valid, y_train, y_valid = train_test_split(tfidf_matrix, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(10881, 10150) (10881,)
(3627, 10150) (3627,)


In [104]:
print(logistic_regression(x_train, y_train, x_valid, y_valid))
print(decision_tree_classifier(x_train, y_train, x_valid, y_valid))
print(svc(x_train, y_train, x_valid, y_valid))

{'train': 0.8652697362374782, 'valid': 0.7741935483870968}
{'train': 0.9949453175259627, 'valid': 0.6845878136200717}
{'train': 0.9518426615200809, 'valid': 0.775296388199614}


This is slightly better than the above results. But still not anywhere where I want. I think it should be possible to get to a 90% accuracy in a pretty straight forward fashion. Also, I think I'll drop DTC going forward. I need to read up and understand where the algorithm really works. Maybe am just using it all wrong here.

Next step would be try and combine the above vectors in the same model and train them.

In [105]:
from scipy.sparse import hstack, csr_matrix

combined_train_set = hstack([tfidf_matrix, csr_matrix(train_set)])
combined_train_set.shape

(14508, 10450)

In [106]:
target = LabelEncoder().fit_transform(dataframe["airline_sentiment"])
target.shape

x_train, x_valid, y_train, y_valid = train_test_split(tfidf_matrix, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(10881, 10150) (10881,)
(3627, 10150) (3627,)


In [107]:
print(logistic_regression(x_train, y_train, x_valid, y_valid))
print(decision_tree_classifier(x_train, y_train, x_valid, y_valid))
print(svc(x_train, y_train, x_valid, y_valid))

{'train': 0.8652697362374782, 'valid': 0.7741935483870968}
{'train': 0.9949453175259627, 'valid': 0.6873449131513648}
{'train': 0.9518426615200809, 'valid': 0.775296388199614}


One last thing to try before getting on meta-data is to use the glove embeddings. We are going to use their pre-trained model vectors. You can get a copy from [here](https://github.com/stanfordnlp/GloVe). Since this is twitter data, am going to use their twitter trained data model.

In [108]:
# am going to start from scratch for glove here.
dataframe = pd.read_csv(dataset)
dataframe.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Let's try to load glove vectors instead of gensim vectors. glove has a vector set trained on twitter data.

In [109]:
glove_embeddings = {}
def load_glove(glove_file):
    with open(glove_file, 'r') as gf:
        for line in gf:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            glove_embeddings[word] = vector

load_glove('models/glove/glove.twitter.27B.200d.txt')

In [110]:
len(glove_embeddings.keys())

1193514

In [126]:
import re
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

mention = re.compile('^@.*')
hashtag = re.compile('^#.*')
punctuation = re.compile('[!#,*$%^&.-]')
digits = re.compile('^[0-9,\.].*')
urls = re.compile('^http.*')
def cleanup(text:str) -> [str]:
    # lowercase the text first
    ltext = remove_stopwords(text.lower())
    words = ltext.split(' ')
    clean_words = []
    # since these are tweets, remove any @ mentions. they can't really map meaningfully to lang vectors anyway.
    for word in words:
        # replace mentions with <user>
        # replace hashtags with <hashtag>
        # replace numbers with <number>
        # replace urls with <url>
        clean = re.sub(mention, '<user>', word)
        clean = re.sub(hashtag, '<hashtag>', clean)
        clean = re.sub(digits, '<number>', clean)
        clean = re.sub(urls, '<url>', clean)
        clean = re.sub(punctuation, '', clean)
        if clean and clean not in stop_words:
            clean_words.append(clean)
    return clean_words if len(clean_words) > 0 else None


dataframe["glove_clean"] = dataframe["text"].apply(cleanup)

In [148]:
def glove_vector(list_of_tokens: [str]):
    vectors = []
    for token in list_of_tokens:
        vec = glove_embeddings.get(token)
        if vec is not None:
            vectors.append(vec)

    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    return None

dataframe["glove_vectors"] = dataframe["glove_clean"].apply(glove_vector)

#glove_vector(dataframe["glove_clean"][3])
#print(dataframe["glove_clean"][3])


In [149]:
# remove rows that have a nan in mv column
dataframe = dataframe[dataframe["glove_vectors"].notna()]
dataframe.shape

(14640, 17)

In [150]:
target = LabelEncoder().fit_transform(dataframe["airline_sentiment"])
print(target.shape)

train_set = dataframe["glove_vectors"].apply(pd.Series)
print(train_set.shape)

x_train, x_valid, y_train, y_valid = train_test_split(train_set, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(14640,)
(14640, 200)
(10980, 200) (10980,)
(3660, 200) (3660,)


In [151]:
print(logistic_regression(x_train, y_train, x_valid, y_valid))
print(decision_tree_classifier(x_train, y_train, x_valid, y_valid))
print(svc(x_train, y_train, x_valid, y_valid))

{'train': 0.7828779599271403, 'valid': 0.7612021857923498}
{'train': 0.9906193078324226, 'valid': 0.6232240437158469}
{'train': 0.7960837887067396, 'valid': 0.773224043715847}


Let's append this with tf-idf vectors and see if it gives us any edge

In [152]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(dataframe["glove_clean"].apply(lambda x: ' '.join((map(str, x)))))
tfidf_matrix.shape

(14640, 11238)

In [153]:
combined_train_set = hstack([tfidf_matrix, csr_matrix(train_set)])
combined_train_set.shape

(14640, 11438)

In [154]:
x_train, x_valid, y_train, y_valid = train_test_split(tfidf_matrix, target, random_state=20, stratify=target)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)

(10980, 11238) (10980,)
(3660, 11238) (3660,)


In [155]:
print(logistic_regression(x_train, y_train, x_valid, y_valid))
print(decision_tree_classifier(x_train, y_train, x_valid, y_valid))
print(svc(x_train, y_train, x_valid, y_valid))

{'train': 0.8612932604735883, 'valid': 0.7734972677595628}
{'train': 0.9939890710382514, 'valid': 0.6822404371584699}
{'train': 0.9481785063752277, 'valid': 0.7601092896174864}


So from all these exhaustive experiments, it seems using just a word vector or just the tfidf counts don't really get us over the 80% mark. In the next notebook, we'll continue using the glove model but we'l