## 1 - Introduction
I am exploring a new way of doing sentiment analysis. I'm going to use word2vec.

word2vec is a group of Deep Learning models developed by Google with the aim of capturing the context of words while at the same time proposing a very efficient way of preprocessing raw text data. This model takes as input a large corpus of documents like tweets or news articles and generates a vector space of typically several hundred dimensions. Each word in the corpus is being assigned a unique vector in the vector space.

The powerful concept behind word2vec is that word vectors that are close to each other in the vector space represent words that are not only of the same meaning but of the same context as well.
What I find interesting about the vector representation of words is that it automatically embeds several features that we would normally have to handcraft ourselves. Since word2vec relies on Deep Neural Nets to detect patterns, we can rely on it to detect multiple features on different levels of abstractions.

## 2 - Environment set-up and data preparation

# Required modules

In [104]:
import pandas as pd # provide sql-like data manipulation tools. very handy.
pd.options.mode.chained_assignment = None
import numpy as np # high dimensional vector computing library.
from copy import deepcopy
from string import punctuation
from random import shuffle

import gensim
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class
LabeledSentence = gensim.models.doc2vec.LabeledSentence # we'll talk about this down below

from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

from nltk.tokenize import word_tokenize # a tweet tokenizer from nltk.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading the training and test data

In [55]:
train = pd.read_csv("kaggle/labeledTrainData.tsv",header=0, \
                    delimiter="\t", quoting=3)
test_data = pd.read_csv("kaggle/testData.tsv", header = 0, delimiter="\t", quoting=3)

In [56]:
# test_data.drop(['id'], axis=1, inplace=True)
test_data.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [57]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [58]:
train.describe()
train["sentiment"].value_counts() # class is balanced

1    12500
0    12500
Name: sentiment, dtype: int64

In [59]:
len(train)

25000

In [60]:
train['id'].nunique() # no duplicate values

25000

In [68]:
def ingest():
    data = pd.read_csv("kaggle/labeledTrainData.tsv",header=0, \
                    delimiter="\t", quoting=3)
    data.drop(['id'], axis=1, inplace=True)
    data = data[data.sentiment.isnull() == False]
    data['sentiment'] = data['sentiment'].map(int)
    data = data[data['review'].isnull() == False]
    data.reset_index(inplace=True)
    data.drop('index', axis=1, inplace=True)
    print 'dataset loaded with shape', data.shape    
    return data

data = ingest()
data.head(5)

dataset loaded with shape (25000, 2)


Unnamed: 0,sentiment,review
0,1,"""With all this stuff going down at the moment ..."
1,1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,0,"""The film starts with a manager (Nicholas Bell..."
3,0,"""It must be assumed that those who praised thi..."
4,1,"""Superbly trashy and wondrously unpretentious ..."


# Data Preprocessing

In [69]:
def tokenize(tweet):
    try:
        tweet = unicode(tweet.decode('utf-8').lower())
        tokens = nltk.tokenize.word_tokenize(tweet)
        tokens = filter(lambda t: not t.startswith('@'), tokens)
        tokens = filter(lambda t: not t.startswith('#'), tokens)
        tokens = filter(lambda t: not t.startswith('http'), tokens)
        return tokens
    except:
        return 'NC'

In [70]:
def postprocess(data, n=1000000):
    data = data.head(n)
    data['tokens'] = data['review'].progress_map(tokenize)  ## progress_map is a variant of the map function plus a progress bar. Handy to monitor DataFrame creations.
    data = data[data.tokens != 'NC']
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data

data = postprocess(data)

progress-bar: 100%|██████████| 25000/25000 [02:14<00:00, 186.38it/s]


In [71]:
data.head()

Unnamed: 0,sentiment,review,tokens
0,1,"""With all this stuff going down at the moment ...","[``, with, all, this, stuff, going, down, at, ..."
1,1,"""\""The Classic War of the Worlds\"" by Timothy ...","[``, \, '', the, classic, war, of, the, worlds..."
2,0,"""The film starts with a manager (Nicholas Bell...","[``, the, film, starts, with, a, manager, (, n..."
3,0,"""It must be assumed that those who praised thi...","[``, it, must, be, assumed, that, those, who, ..."
4,1,"""Superbly trashy and wondrously unpretentious ...","[``, superbly, trashy, and, wondrously, unpret..."


# Building word2vec model

In [72]:
n = 1000000
x_train, x_test, y_train, y_test = train_test_split(np.array(data.head(n).tokens),
                                                    np.array(data.head(n).sentiment), test_size=0.2)

In [73]:
print len(y_train)
y_test

20000


array([1, 1, 0, ..., 1, 0, 1])

In [74]:
# x_train[0]

In [75]:
def labelizeTweets(tweets, label_type):
    labelized = []
    for i,v in tqdm(enumerate(tweets)):
        label = '%s_%s'%(label_type,i)
        labelized.append(LabeledSentence(v, [label]))
    return labelized

x_train = labelizeTweets(x_train, 'TRAIN')
x_test = labelizeTweets(x_test, 'TEST')

  """
20000it [00:00, 33551.53it/s]
5000it [00:00, 99336.48it/s]


In [76]:
# x_train[0]

In [77]:
n_dim = 200
tweet_w2v = Word2Vec(size=n_dim, min_count=10)
tweet_w2v.build_vocab([x.words for x in tqdm(x_train)])
tweet_w2v.train([x.words for x in tqdm(x_train)],total_examples=tweet_w2v.corpus_count, epochs=tweet_w2v.iter)

100%|██████████| 20000/20000 [00:00<00:00, 60375.37it/s]
100%|██████████| 20000/20000 [00:00<00:00, 625726.01it/s]
  after removing the cwd from sys.path.


(18963780, 28502955)

In [78]:
# tweet_w2v['good']

In [79]:
# tweet_w2v.most_similar('happy')

In [80]:
# tweet_w2v.most_similar('bar')

# Building a sentiment classifier

In [82]:
print 'building tf-idf matrix ...'
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x.words for x in x_train])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
print 'vocab size :', len(tfidf)

building tf-idf matrix ...
vocab size : 17221


In [83]:
def buildWordVector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

In [84]:
from sklearn.preprocessing import scale
train_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_w2v = scale(train_vecs_w2v)

test_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_test))])
test_vecs_w2v = scale(test_vecs_w2v)

  
100%|██████████| 20000/20000 [02:16<00:00, 146.51it/s]
100%|██████████| 5000/5000 [00:33<00:00, 148.23it/s]


In [190]:
import tensorflow as tf


def f2_score(y_true, y_pred):
    y_true = tf.cast(y_true, "int32")
    y_pred = tf.cast(tf.round(y_pred), "int32") # implicit 0.5 threshold via tf.round
    y_correct = y_true * y_pred
    sum_true = tf.reduce_sum(y_true, axis=1)
    sum_pred = tf.reduce_sum(y_pred, axis=1)
    sum_correct = tf.reduce_sum(y_correct, axis=1)
    precision = sum_correct / sum_pred
    recall = sum_correct / sum_true
    f_score = 5 * precision * recall / (4 * precision + recall)
    f_score = tf.where(tf.is_nan(f_score), tf.zeros_like(f_score), f_score)
    return tf.reduce_mean(f_score)

In [191]:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=200))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=[f2_score])

model.fit(train_vecs_w2v, y_train, epochs=9, batch_size=32, verbose=2)

TypeError: Value passed to parameter 'x' has DataType int32 not in list of allowed values: float16, float32, float64

# Saving model

In [86]:
model.save("md1")

In [87]:
score = model.evaluate(test_vecs_w2v, y_test, batch_size=128, verbose=2)
print score[1]

0.8428


# working on test data

In [88]:
test = postprocess(test_data)

progress-bar: 100%|██████████| 25000/25000 [02:23<00:00, 174.79it/s]


In [89]:
test.head()

Unnamed: 0,id,review,tokens
0,"""12311_10""","""Naturally in a film who's main themes are of ...","[``, naturally, in, a, film, who, 's, main, th..."
1,"""8348_2""","""This movie is a disaster within a disaster fi...","[``, this, movie, is, a, disaster, within, a, ..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ...","[``, all, in, all, ,, this, is, a, movie, for,..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio...","[``, afraid, of, the, dark, left, me, with, th..."
4,"""12128_7""","""A very accurate depiction of small time mob l...","[``, a, very, accurate, depiction, of, small, ..."


In [90]:
n_test = np.array(test.head(n).tokens)

In [91]:
# n_test[1]

In [92]:
test1 = labelizeTweets(n_test, 'TEST')

  """
25000it [01:13, 342.45it/s]  


In [93]:
n_dim = 200
tweet_w2v = Word2Vec(size=n_dim, min_count=10)
tweet_w2v.build_vocab([x.words for x in tqdm(test1)])
tweet_w2v.train([x.words for x in tqdm(test1)],total_examples=tweet_w2v.corpus_count, epochs=tweet_w2v.iter)

100%|██████████| 25000/25000 [00:00<00:00, 490773.10it/s]
100%|██████████| 25000/25000 [00:00<00:00, 638017.88it/s]
  after removing the cwd from sys.path.


(23337567, 34910265)

In [94]:
print 'building tf-idf matrix ...'
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x.words for x in test1])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
print 'vocab size :', len(tfidf)

building tf-idf matrix ...
vocab size : 19293


In [95]:
from sklearn.preprocessing import scale
test_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, test1))])
test_w2v = scale(test_w2v)

  
100%|██████████| 25000/25000 [02:47<00:00, 149.37it/s]


In [105]:
# score = model.evaluate(test_w2v, y_test, batch_size=128, verbose=2)
# print score[1]

In [98]:
np.argmax(model.predict_proba(test_w2v))

24916

In [99]:
final_preds=model.predict(test_w2v)

In [110]:
final_preds.shape

(25000, 1)

In [101]:
sample_data = pd.read_csv("kaggle/sampleSubmission.csv")
sample_data.head(3)

Unnamed: 0,id,sentiment
0,12311_10,0
1,8348_2,0
2,5828_4,0


In [116]:
test_data['sentiment1'] = final_preds

In [117]:
test_data.head()

Unnamed: 0,id,sentiment,sentiment1
0,"""12311_10""",0.941234,0.941234
1,"""8348_2""",0.000559,0.000559
2,"""5828_4""",0.986575,0.986575
3,"""7186_2""",0.876342,0.876342
4,"""12128_7""",0.823426,0.823426


In [118]:
test_data.drop(["sentiment"], axis=1, inplace=True)

In [119]:
test_data.head()

Unnamed: 0,id,sentiment1
0,"""12311_10""",0.941234
1,"""8348_2""",0.000559
2,"""5828_4""",0.986575
3,"""7186_2""",0.876342
4,"""12128_7""",0.823426


In [153]:
mask = test_data.sentiment1 > 0.5
column_name = 'my_channel'
test_data.loc[mask, column_name] = 0

In [162]:
test_data['my_channel'] = test_data['my_channel'].replace("NaN",1)

In [163]:
test_data.head()

Unnamed: 0,id,sentiment1,new,my_channel
0,"""12311_10""",0.941234,1,0.0
1,"""8348_2""",0.000559,1,
2,"""5828_4""",0.986575,1,0.0
3,"""7186_2""",0.876342,1,0.0
4,"""12128_7""",0.823426,1,0.0


In [164]:
# replacing na values in college with No college 
test_data['my_channel'].fillna("1", inplace = True)

In [166]:
test_data.head()
# 1 respresenting negative review and 0 is respresenting positive review

Unnamed: 0,id,sentiment1,new,my_channel
0,"""12311_10""",0.941234,1,0
1,"""8348_2""",0.000559,1,1
2,"""5828_4""",0.986575,1,0
3,"""7186_2""",0.876342,1,0
4,"""12128_7""",0.823426,1,0


In [167]:
test_data.drop(['sentiment1', 'new'], axis=1, inplace=True)

In [168]:
test_data.head()

Unnamed: 0,id,my_channel
0,"""12311_10""",0
1,"""8348_2""",1
2,"""5828_4""",0
3,"""7186_2""",0
4,"""12128_7""",0


In [169]:
test_data.rename(columns={'my_channel': 'sentiment'}, inplace=True)

In [170]:
test_data.head()

Unnamed: 0,id,sentiment
0,"""12311_10""",0
1,"""8348_2""",1
2,"""5828_4""",0
3,"""7186_2""",0
4,"""12128_7""",0


In [171]:
test_data.to_csv("sampleSubmissiom.csv")

# F-1 score

In [186]:
print (metrics.val_f1s)

AttributeError: 'Metrics' object has no attribute 'val_f1s'