Continued from Part 3 Word2Vec

This time we turn each review into a vector representation using Doc2Vec algoritm (https://radimrehurek.com/gensim/models/doc2vec.html)

In [1]:
import pickle

import numpy as np
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, LabeledSentence

First, load the processed data from Part 1

In [10]:
train = pd.read_csv("labeledTrainData.tsv",
                    delimiter="\t",
                    header=0,
                    quoting=3)

unlabeled = pd.read_csv("unlabeledTrainData.tsv",
                        delimiter="\t",
                        header=0,
                        quoting=3)

train_sents = pickle.load(open("train_sents.pickle"))
unlabeled_sents = pickle.load(open("unlabeled_sents.pickle"))

The corpus on which Doc2Vec model is trained on must be an iterable of **LabeledSentence** object (i.e. a list of words, and a unique sentence ID)

In [11]:
class IMDBReview(object):
    def __init__(self, sents, prefix="TRAIN"):
        self.sents = sents
        self.prefix = prefix
    def __iter__(self):
        for index, sent in enumerate(self.sents):
            yield LabeledSentence(sent.split(), ["%s_%d" % (self.prefix, index)])

Initialize the model and build the vocabulary:

In [14]:
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(IMDBReview(train_sents))

Note that Doc2Vec is unsupervised, so it can be trained using **unlabeled** review

In [12]:
for epoch in range(10):
    print epoch
    corpus = IMDBReview(train_sents)
    model.train(corpus)
    corpus = IMDBReview(unlabeled_sents, prefix="UNLABELED")
    model.train(corpus)

X = np.array([model.docvecs["TRAIN_%d" % i] for i in range(len(train_sents))])
y = train["sentiment"].values

0
1
2
3
4
5
6
7
8
9


***train_sents*** is now turned into a real-valued matrix ***X***:

In [16]:
X.shape

(25000, 100)

In [15]:
X

array([[ 1.51470816, -0.14695954, -0.38569   , ...,  0.54711825,
        -0.38615489, -0.6846233 ],
       [ 0.58430964,  0.22183745, -1.08024156, ..., -0.14015758,
        -1.03678322, -0.08200566],
       [ 2.27951765, -0.63483417, -1.43968844, ...,  1.99514902,
        -1.36144769, -2.15728498],
       ..., 
       [ 0.61271161,  1.44076931, -0.78728318, ...,  0.67236358,
         0.55372548,  0.20077997],
       [-0.0238454 ,  0.18018231, -0.45349646, ..., -0.7789551 ,
         0.11597048,  0.1945295 ],
       [ 0.44540593,  0.58089393, -0.95667773, ...,  0.57978415,
         0.46543002,  0.42108217]], dtype=float32)

Let's try out the quality of vector representation using a simple linear model:

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
lr = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr.fit(X_train, y_train).score(X_test, y_test)

0.86173333333333335

In [26]:
str(model)

'Doc2Vec(dm/m,d100,n5,w10,s0.0001,t8)'

In [27]:
model = Doc2Vec(dm=1, dbow_words=1, min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(IMDBReview(train_sents))

In [28]:
for epoch in range(10):
    print epoch
    corpus = IMDBReview(train_sents)
    model.train(corpus)
    corpus = IMDBReview(unlabeled_sents, prefix="UNLABELED")
    model.train(corpus)

X = np.array([model.docvecs["TRAIN_%d" % i] for i in range(len(train_sents))])
y = train["sentiment"].values

0
1
2
3
4
5
6
7
8
9


In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
lr = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr.fit(X_train, y_train).score(X_test, y_test)

0.86160000000000003