In this part, we'll try out the most straightforward way to represent a document as a fixed length vector

* CountVectorizer: length = size of vocabulary, entry = the number of occurrence of words with the given ID/index
* TfidfVectorizer: Term-frequency multiplied by inverse document frequency
* HashingVectorizer: Convert document into vector using hashing function

In [1]:
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Load the processed data

In [2]:
train = pd.read_csv("labeledTrainData.tsv",
                    delimiter="\t",
                    header=0,
                    quoting=3)

train_sents = pickle.load(open("train_sents.pickle"))

Initialize vectorizers as transformer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

count_vectorizer = CountVectorizer(analyzer="word", stop_words="english", max_features=200) 
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words="english", max_features=200)
hash_vectorizer = HashingVectorizer(analyzer='word', stop_words="english", n_features=200)

vectorizers = [count_vectorizer, tfidf_vectorizer, hash_vectorizer]

We'll use a simple linear model LogisticRegression to test out the quality of the vector representation in a binary classification task.

In [5]:
y = train["sentiment"].values
for vectorizer, name in zip(vectorizers, ["count", "tfidf", "hash"]):
    X = vectorizer.fit_transform(train_sents).toarray().astype(np.float64)
    X = StandardScaler().fit_transform(X)
    lr = LogisticRegression(random_state=42)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    print "vectorizer = %s, accuracy = %f\n" % (name, lr.fit(X_train, y_train).score(X_test, y_test))

vectorizer = count, accuracy = 0.788000

vectorizer = tfidf, accuracy = 0.788133

vectorizer = hash, accuracy = 0.721600



Increase the size of the feature space from 200 to 2066:

In [6]:
vectorizers[0].max_features = 2056
vectorizers[1].max_features = 2056
vectorizers[2].max_features = 2056

for vectorizer, name in zip(vectorizers, ["count", "tfidf", "hash"]):
    X = vectorizer.fit_transform(train_sents).toarray().astype(np.float64)
    X = StandardScaler().fit_transform(X)
    lr = LogisticRegression(random_state=42)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    print "vectorizer = %s, accuracy = %f\n" % (name, lr.fit(X_train, y_train).score(X_test, y_test))

vectorizer = count, accuracy = 0.844667

vectorizer = tfidf, accuracy = 0.852400

vectorizer = hash, accuracy = 0.721600

