# Assignment 3: Word2Vec

In this assignment, we will see how we can use Word2Vec (or any similar word embedding) to use information from unlabelled data to help us classify better!

You will be using the sentiment data from last week, either the yelps or movies, whichever you wish. 

Your goal will be to simulate the following situation: you have a **small** set of labelled data and a large set of unlabelled data. Show how the two follow 2 techniques compare as the amount of labelled data increases. You should train them on the small labelled subset and test their performance on the rest of the data. 

In other words, train on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k. Etc.

1. Logistic regression trained on labelled data, documents represented as term-frequency matrix of your choice. You can learn the vocabulary from the entire dataset or only the labelled data.

2. Logistic regression trained on the labelled data, documents represented as word2vec vectors where you train word2vec using the entire dataset. Play around with different settings of word2vec (training window size, K-negative, skip-gram vs BOW, training windows, etc.). Note: we didn't go over the options in detail in class, so you will need to read about them a bit!

You can read about the gensime word2vec implementation [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [1]:
import re
import spacy
import nltk
import pandas as pd
import numpy as np
from time import time
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
import multiprocessing
cores = multiprocessing.cpu_count()
seed = 13

In [2]:
yelps = pd.read_csv('yelps.csv', error_bad_lines=False)

## 1. Logistic regression, documents represented as term-frequency

In [None]:
# prepocess and vectorizer
not_alphanumeric_or_space = re.compile('[^(\w|\s|\d)]')
nlp = spacy.load('en')

def preprocess(doc):
    doc = re.sub(not_alphanumeric_or_space, '', doc)
    words = [t.lemma_ for t in nlp(doc) if t.lemma_ != '-PRON-']
    return ' '.join(words).lower()

vectorizer = TfidfVectorizer(min_df=.1, 
                             max_df=.7, 
                             max_features=500,
                             preprocessor=preprocess,
                             use_idf=True, 
                             stop_words='english')



In [None]:
# load data 
yelps = pd.read_csv('yelps.csv', error_bad_lines=False).sample(frac=1.)

train_size = [1000,2000,4000,8000,16000,32000,64000]
lr_acc = []
for size in train_size:
    X_train, X_test, y_train, y_test = train_test_split(yelps['text'],
                                                        yelps['positive'],
                                                        train_size=size, 
                                                        random_state=seed)
    
    X_train_m = vectorizer.fit_transform(X_train)
    X_test_m = vectorizer.transform(X_test)
    lr = LogisticRegression(solver="lbfgs")
    lr.fit(X_train_m,y_train)
    y_pred = lr.predict(X_test_m)
    print("Accuracy on {} entries: {}".format(size, accuracy_score(y_pred, y_test)))
    lr_acc.append((size, accuracy_score(y_pred, y_test)))

## 2. Logistic Regression using Word2Vec

In [74]:
yelps = pd.read_csv('yelps.csv', error_bad_lines=False)

In [75]:
yelps_test = yelps[:1000]

In [76]:
yelps_test

Unnamed: 0,business_id,positive,text
0,usmGI198mrIsZXtTzkXa3A,True,We have a vacation rental in Las Vegas that we...
1,r1vcpe1gZ7XHg2sgveoQ8A,True,Went for a couple of gin cocktails and ended u...
2,qx6WhZ42eDKmBchZDax4dQ,True,Such a great place. You walk in and they treat...
3,2aIgbnGUg8VC0u9iXO-wnQ,True,Great sushi however do not order it for delive...
4,8cr7Kdx1bT51CnKsrWABbw,True,"In lieu of a birthday cake, my friend brought ..."
...,...,...,...
995,gsftQKOI-Kj2buIUI-YV5Q,False,I have been to Mirage Nails and Spa a few time...
996,-1xuC540Nycht_iWFeJ-dw,False,So disappointed in my experience today. I've b...
997,YRyYbOSwvHkZsZOLv98oQg,True,Went for lunch based on Yelp high ratings; and...
998,TSGBM2z5BTeJvYQAznz8Fg,False,We arrived for diner at 3:45 pm today. The hos...


In [77]:
not_alphanumeric_or_space = re.compile(r'[^(\w|\s|\d)]')

def cleanText(text):
    text = re.sub(not_alphanumeric_or_space, r'', text)
    text = text.replace("\n", "")
    text = text.lower()
    text = text.split()
    return text


yelps_test['text'] = yelps_test['text'].apply(cleanText)

In [78]:
yelps_test

Unnamed: 0,business_id,positive,text
0,usmGI198mrIsZXtTzkXa3A,True,"[we, have, a, vacation, rental, in, las, vegas..."
1,r1vcpe1gZ7XHg2sgveoQ8A,True,"[went, for, a, couple, of, gin, cocktails, and..."
2,qx6WhZ42eDKmBchZDax4dQ,True,"[such, a, great, place, you, walk, in, and, th..."
3,2aIgbnGUg8VC0u9iXO-wnQ,True,"[great, sushi, however, do, not, order, it, fo..."
4,8cr7Kdx1bT51CnKsrWABbw,True,"[in, lieu, of, a, birthday, cake, my, friend, ..."
...,...,...,...
995,gsftQKOI-Kj2buIUI-YV5Q,False,"[i, have, been, to, mirage, nails, and, spa, a..."
996,-1xuC540Nycht_iWFeJ-dw,False,"[so, disappointed, in, my, experience, today, ..."
997,YRyYbOSwvHkZsZOLv98oQg,True,"[went, for, lunch, based, on, yelp, high, rati..."
998,TSGBM2z5BTeJvYQAznz8Fg,False,"[we, arrived, for, diner, at, 345, pm, today, ..."


In [79]:
yelps_test.positive = yelps_test.positive.eq('True').mul(1)
y = yelps_test.positive.values
print(y)
len(y)

[1 1 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1
 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1
 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0
 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0
 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1
 1 1 1 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0
 0 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1
 1 1 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0
 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0
 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1
 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1
 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 0
 0 1 1 1 0 1 0 1 0 0 1 1 

1000

In [80]:
sentences = yelps_test.text.tolist()


In [81]:
sentences

[['we',
  'have',
  'a',
  'vacation',
  'rental',
  'in',
  'las',
  'vegas',
  'that',
  'we',
  'rent',
  'out',
  'so',
  'the',
  'level',
  'of',
  'cleanliness',
  'that',
  'we',
  'expect',
  'is',
  'exponentially',
  'higher',
  'than',
  'a',
  'regular',
  'residential',
  'customer',
  'after',
  'all',
  'guests',
  'love',
  'the',
  'convenience',
  'of',
  'a',
  'vacation',
  'rental',
  'but',
  'nobody',
  'likes',
  'to',
  'be',
  'reminded',
  'that',
  'some',
  'other',
  'strangers',
  'butt',
  'has',
  'been',
  'on',
  'the',
  'toilet',
  'youre',
  'usingi',
  'called',
  'rhinos',
  'based',
  'entirely',
  'on',
  'the',
  'yelp',
  'reviews',
  'and',
  'i',
  'was',
  'not',
  'disappointed',
  'the',
  'booking',
  'process',
  'was',
  'exceptionally',
  'simple',
  'i',
  'called',
  'the',
  'person',
  'on',
  'the',
  'other',
  'end',
  'answered',
  'my',
  'questions',
  'and',
  'then',
  'i',
  'went',
  'online',
  'to',
  'book',
  'my',

In [82]:
t = time()
model = Word2Vec(yelps_test['text'],min_count=10,size= 300, workers=cores, window =4,sg = 1)
print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 0.02 mins


In [83]:
X_train, X_test, y_train, y_test = train_test_split(yelps_test['text'],
                                                    y,
                                                    test_size=0.25, random_state=13)

In [84]:
def turn_doc_into_vec(doc):
    doc = [word for word in doc if word in model.wv.vocab]
    return np.mean(model[doc], axis=0)

In [85]:
type(X_train)

pandas.core.series.Series

In [86]:
# I dont know why is there an error, it used to run without a prob the X_trainw and then the same
# error was on X_testw

X_trainw = X_train.apply(turn_doc_into_vev)
X_testw = X_test.apply(turn_doc_into_vev)

ValueError: need at least one array to concatenate

In [None]:
#X_train_w = model.fit_transform(X_trainw)
#X_test_w = model.transform(X_testw)

lr = LogisticRegression(solver="lbfgs")

lr.fit(X_trainw,y_train)
y_pred = lr.predict(X_testw)
    
print("Accuracy: ", accuracy_score(y_pred, y_test))