In this part, we'll try out the most straightforward way to represent a document as a fixed length vector

* CountVectorizer: length = size of vocabulary, entry = the number of occurrence of words with the given ID/index
* TfidfVectorizer: Term-frequency multiplied by inverse document frequency
* HashingVectorizer: Convert document into vector using hashing function

We'll also pipeline the vectorizer with two standardization methods:
* StandardScaler: normalize data with zero mean and unit standar deviation
* Normalizer: normalize each data point with unit l2 norm

Finally we'll try reducing the dimensionality using topic modeling approach Latent Semantic Index (LSI, aka LSA)

In [1]:
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import Normalizer

Load the processed data

In [2]:
train = pd.read_csv("labeledTrainData.tsv",
                    delimiter="\t",
                    header=0,
                    quoting=3)

train_sents = pickle.load(open("train_sents.pickle"))
y = train["sentiment"].values

Initialize vectorizers as transformer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

count_vectorizer = CountVectorizer(analyzer="word", stop_words="english", max_features=200) 
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words="english", max_features=200)
hash_vectorizer = HashingVectorizer(analyzer='word', stop_words="english", n_features=200)

vectorizers = [count_vectorizer, tfidf_vectorizer, hash_vectorizer]
standardizers = [StandardScaler(), Normalizer()]

We'll use a simple linear model LogisticRegression to test out the quality of the vector representation in a binary classification task.

In [4]:
for vectorizer, v_name in zip(vectorizers, ["count", "tfidf", "hash"]):
    for standardizer, s_name in zip(standardizers, ["standarscaler", "normalizer"]):  
        X = vectorizer.fit_transform(train_sents).toarray().astype(np.float64)
        X = standardizer.fit_transform(X)
        lr = LogisticRegression(random_state=42)
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        print "vectorizer = %s, standardizer = %s, accuracy = %f\n" % (v_name, s_name, lr.fit(X_train, y_train).score(X_test, y_test))

vectorizer = count, standardizer = standarscaler, accuracy = 0.788000

vectorizer = count, standardizer = normalizer, accuracy = 0.788267

vectorizer = tfidf, standardizer = standarscaler, accuracy = 0.788133

vectorizer = tfidf, standardizer = normalizer, accuracy = 0.788133

vectorizer = hash, standardizer = standarscaler, accuracy = 0.721600

vectorizer = hash, standardizer = normalizer, accuracy = 0.722933



Increase the size of the feature space from 200 to 2048:

In [5]:
vectorizers[0].max_features = 2048
vectorizers[1].max_features = 2048
vectorizers[2].max_features = 2048

for vectorizer, v_name in zip(vectorizers, ["count", "tfidf", "hash"]):
    for standardizer, s_name in zip(standardizers, ["standarscaler", "normalizer"]): 
        X = vectorizer.fit_transform(train_sents).toarray().astype(np.float64)
        X = standardizer.fit_transform(X)
        lr = LogisticRegression(random_state=42)
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        print "vectorizer = %s, standardizer = %s, accuracy = %f\n" % (v_name, s_name, lr.fit(X_train, y_train).score(X_test, y_test))

vectorizer = count, standardizer = standarscaler, accuracy = 0.844667

vectorizer = count, standardizer = normalizer, accuracy = 0.865333

vectorizer = tfidf, standardizer = standarscaler, accuracy = 0.852400

vectorizer = tfidf, standardizer = normalizer, accuracy = 0.871067

vectorizer = hash, standardizer = standarscaler, accuracy = 0.721600

vectorizer = hash, standardizer = normalizer, accuracy = 0.722933



It appears that Tfidf has a slight edge over the other two vectorizers, and normatlizing vector with unit l2 norm always resulted in improved performance over the standard-scaler (0.0 mean, 1.0 std) 

Finally we'll try out the LSI method that represent documents as vectors in topic space. It's implemented as applying Singular Value Decomposition upon the bag of words representation of documents (with or without tfidf transformation). We could alternatively try the gensim implementation https://radimrehurek.com/gensim/models/lsimodel.html

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD

n_topics = [32, 64, 128, 256, 512, 1024]
lsi = Pipeline([("tfidf", TfidfVectorizer(analyzer='word', stop_words="english", max_features=4096)),
                    ("svd", TruncatedSVD(n_components=100)), 
                    ("normalizer", Normalizer())])

for n in n_topics:
    lsi.set_params(svd__n_components=n)
    X = lsi.fit_transform(train_sents)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    lr.fit(X_train, y_train).score(X_test, y_test)
    
    print "Num of Topics = %d, accuracy = %f" % (n, lr.fit(X_train, y_train).score(X_test, y_test))

Num of Topics = 32, accuracy = 0.830800
Num of Topics = 64, accuracy = 0.844267
Num of Topics = 128, accuracy = 0.862267
Num of Topics = 256, accuracy = 0.872133
Num of Topics = 512, accuracy = 0.873467
Num of Topics = 1024, accuracy = 0.875600


The accuracy increases with the number of topics, which suggests that more topics make the vector representation more expressive (but at the cost of increased computation)