# Assignment 3: Word2Vec

In this assignment, we will see how we can use Word2Vec (or any similar word embedding) to use information from unlabelled data to help us classify better!

You will be using the sentiment data from last week, either the yelps or movies, whichever you wish. 

Your goal will be to simulate the following situation: you have a **small** set of labelled data and a large set of unlabelled data. Show how the two follow 2 techniques compare as the amount of labelled data increases. You should train them on the small labelled subset and test their performance on the rest of the data. 

In other words, train on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k. Etc.

1. Logistic regression trained on labelled data, documents represented as term-frequency matrix of your choice. You can learn the vocabulary from the entire dataset or only the labelled data.

2. Logistic regression trained on the labelled data, documents represented as word2vec vectors where you train word2vec using the entire dataset. Play around with different settings of word2vec (training window size, K-negative, skip-gram vs BOW, training windows, etc.). Note: we didn't go over the options in detail in class, so you will need to read about them a bit!

You can read about the gensime word2vec implementation [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [1]:
import re
import spacy
import pandas as pd
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import multiprocessing
from gensim.models import Word2Vec

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
SEED = 1234


## 1 - Logistic regression with term-frequency matrix

For the first model I chose tf-idf vectors. As expected the performance metrics were improving as the training size increases, even though it did not do a bad job even for a small sample size. The results are summarized in a dataframe at the end of this section

In [2]:
# define prepocessing and vectorizer
not_alphanumeric_or_space = re.compile('[^(\w|\s|\d)]')
nlp = spacy.load('en_core_web_sm')

def preprocess(doc):
    doc = re.sub(not_alphanumeric_or_space, '', doc)
    words = [t.lemma_ for t in nlp(doc) if t.lemma_ != '-PRON-']
    return ' '.join(words).lower()

vectorizer = TfidfVectorizer(min_df=.1, 
                             max_df=.7, 
                             max_features=300,
                             preprocessor=preprocess,
                             use_idf=True, 
                             stop_words='english')

In [5]:
# load data 
yelps = pd.read_csv('sentiment/yelps.csv').sample(frac=1.)
yelps = yelps[yelps.positive != 'positive'].copy()

In [6]:
train_size = [1000,2000,4000,8000,16000,32000,64000]
lr_acc = []

for size in train_size:
    
    X_train, X_test, y_train, y_test = train_test_split(yelps['text'],
                                                        yelps['positive'],
                                                        train_size=size, 
                                                        random_state=SEED)

    X_train_m = vectorizer.fit_transform(X_train)
    X_test_m = vectorizer.transform(X_test)

    lr = LogisticRegression(solver="lbfgs")
    lr.fit(X_train_m,y_train)
    y_hat = lr.predict(X_test_m)

    print("Accuracy on {} entries: {}".format(size, accuracy_score(y_hat, y_test)))
    lr_acc.append((size, accuracy_score(y_hat, y_test)))

cols=['train_size','accuracy']
result_tf = pd.DataFrame(lr_acc, columns=cols)

Accuracy on 1000 entries: 0.7948282828282828
Accuracy on 2000 entries: 0.809265306122449
Accuracy on 4000 entries: 0.8090833333333334
Accuracy on 8000 entries: 0.8130326086956522
Accuracy on 16000 entries: 0.8142142857142857
Accuracy on 32000 entries: 0.8150735294117647
Accuracy on 64000 entries: 0.8195555555555556


In [7]:
result_tf.sort_values('accuracy', ascending=False)

Unnamed: 0,train_size,accuracy
6,64000,0.819556
5,32000,0.815074
4,16000,0.814214
3,8000,0.813033
1,2000,0.809265
2,4000,0.809083
0,1000,0.794828


## 2 - Logistic regression with word2vec

For the second part I implemented the word2vec embedding on the data before the logistic regression. Because it ran fairly fast, I tried to experiment with a few different parameters like window size, k-negative and skip-grams. The results are also summarised in a dataframe at the end of this section but I can say that they were better with higher vector size, train size and with skip-gram.

In [66]:
# functions to converte docs to vectors using trained model
def doc2vec(doc, w2vmodel):
    vectors = np.array([w2v_model.wv[w] for w in doc])
    return np.nanmean(vectors,axis=0)

def corpus2vec(doc_split, w2vmodel):
    vectors = np.empty((len(doc_split),len(doc2vec(doc_split[0], w2vmodel))))
    for i in range(len(doc_split)):
        vectors[i] = doc2vec(doc_split[i], w2vmodel)
    return vectors

# define and train model
def train_model(docs, size = 20, window_size = 5, sg = 0, k_neg = 5):
    
    w2v_model = Word2Vec(workers=cores-1,
                         min_count=0,
                         size= size,
                         window= window_size,
                         sg = sg,
                         negative= k_neg) #define model
    
    w2v_model.build_vocab(docs, progress_per=100) #build vocab
    w2v_model.train(docs, total_examples=w2v_model.corpus_count, epochs=20, word_count=0) #train embeddings
    
    return w2v_model

In [9]:
# preprocess corpus
yelps_df = [preprocess(doc) for doc in yelps.text]
doc_split = [doc.split() for doc in yelps_df]

In [None]:
import warnings
warnings.filterwarnings(action='once')

# set param grid_search
window_size = [2,5,8]
k_negative = [1,7]
sg = [0,1] #bow, skip-grams
train_size = [1000,4000,16000,32000,64000]
size = [10,100] #vector size
store = []

# run model for different parameters and store results
for sz in size:
    for wsize in window_size:
        for k in k_negative:
            for s in sg:
                for tsize in train_size:

                    w2v_model = train_model(doc_split, 
                                            size = sz,
                                            window_size = wsize, 
                                            sg = s, 
                                            k_neg = k) #train model

                    docs_embed = corpus2vec(doc_split, w2v_model) # transform docs into vector
                    
                    # there were veery few vectors with nan values (0.03%), which i would need to investigate more
                    # but for this exercise i just replace them with zeros
                    docs_embed = np.nan_to_num(docs_embed, nan=0.0)
                    
                    X_train, X_test, y_train, y_test = train_test_split(docs_embed,
                                                                            yelps['positive'],
                                                                            train_size=tsize, 
                                                                            random_state=1)

                    lr = LogisticRegression(solver="lbfgs")
                    lr.fit(X_train, y_train)
                    y_hat = lr.predict(X_test)
                    #print(accuracy_score(y_hat, y_test))
                    store.append((sz, wsize, k, s, tsize, accuracy_score(y_hat, y_test)))

cols=['vec_size','window_size','k_negative','skip-grams','train_size','accuracy']
result_w2v = pd.DataFrame(store, columns=cols)

In [76]:
pd.set_option("display.max_rows", 200)
result_w2v.sort_values('accuracy', ascending=False)

Unnamed: 0,vec_size,window_size,k_negative,skip-grams,train_size,accuracy
109,100,8,1,1,64000,0.953528
108,100,8,1,1,32000,0.952824
118,100,8,7,1,32000,0.951397
88,100,5,1,1,32000,0.951206
89,100,5,1,1,64000,0.951167
107,100,8,1,1,16000,0.950571
119,100,8,7,1,64000,0.950528
103,100,8,1,0,32000,0.950397
104,100,8,1,0,64000,0.950361
114,100,8,7,0,64000,0.950194
