# Assignment 3: Word2Vec

In this assignment, we will see how we can use Word2Vec (or any similar word embedding) to use information from unlabelled data to help us classify better!

You will be using the sentiment data from last week, either the yelps or movies, whichever you wish. 

Your goal will be to simulate the following situation: you have a **small** set of labelled data and a large set of unlabelled data. Show how the two follow 2 techniques compare as the amount of labelled data increases. You should train them on the small labelled subset and test their performance on the rest of the data. 

In other words, train on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k. Etc.

1. Logistic regression trained on labelled data, documents represented as term-frequency matrix of your choice. You can learn the vocabulary from the entire dataset or only the labelled data.

2. Logistic regression trained on the labelled data, documents represented as word2vec vectors where you train word2vec using the entire dataset. Play around with different settings of word2vec (training window size, K-negative, skip-gram vs BOW, training windows, etc.). Note: we didn't go over the options in detail in class, so you will need to read about them a bit!

You can read about the gensime word2vec implementation [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
#Question 1: What fo unlabelled and what do labelled data mean?? What are their differences in our context?


In [None]:
#Small set of labelled data
#Large set of unlabelled data

#Compare as the aount of labelled data increases

In [None]:
#train on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k
#THIS WILL NEED A FOR LOOP!!!

In [None]:
import re
import spacy
import seaborn as sns
import pandas as pd
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import utils

In [2]:
yelps = pd.read_csv('sentiment/yelps.csv')
movies = pd.read_csv('sentiment/movies.csv')

In [3]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from gensim.models import Doc2Vec

In [4]:
from nltk.corpus import brown, movie_reviews, treebank

In [None]:
#QUESTION 1- Logistic regression trained on labelled data, documents represented as 
#term-frequency matrix of your choice. 
#You can learn the vocabulary from the entire dataset or only the labelled data.

In [None]:
#Step 1-Preprocessing of the text

In [5]:
# prepocess and vectorizer
stemmer = SnowballStemmer("english")
not_alphanumeric_or_space = re.compile('[^(\w|\s|\d)]')
nlp = spacy.load('en')

def preprocess(doc):
    doc = re.sub(not_alphanumeric_or_space, '', doc)
    words = [t.lemma_ for t in nlp(doc) if t.lemma_ != '-PRON-']
    return ' '.join(words).lower()

vectorizer = TfidfVectorizer(min_df=.1, 
                             max_df=.7, 
                             max_features=500,
                             preprocessor=preprocess,
                             use_idf=True, 
                             stop_words='english')

In [None]:
train_size = [1000,2000,4000,8000,16000,32000,64000]
logreg_acc = []
for size in train_size:
    X_train, X_test, y_train, y_test = train_test_split(yelps['text'],
                                                        yelps['positive'],
                                                        train_size=size, 
                                                        random_state=12)
    
    X_train_fitted = vectorizer.fit_transform(X_train)
    X_test_fitted = vectorizer.transform(X_test)
    lr = LogisticRegression(solver="lbfgs")
    lr.fit(X_train_fitted,y_train)
    y_hat = lr.predict(X_test_fitted)
    print("Accuracy on {} entries: {}".format(size, accuracy_score(y_hat, y_test)))
    logreg_acc.append((size, accuracy_score(y_hat, y_test)))

In [None]:
#stop words returns a warning that prevents me from proceeding! 
#Warning says: Your stop_words may be inconsistent with your preprocessing. 
#Tokenizing the stop words generated tokens ['make'] not in stop_words.
#So I cannot get the accuracy score as desired

In [None]:
#EXERCISE 2

In [None]:
#Logistic regression trained on the labelled data, documents represented as word2vec vectors where you train 
#word2vec using the entire dataset. Play around with different settings of word2vec 
#(training window size, K-negative, skip-gram vs BOW, training windows, etc.). 
#Note: we didn't go over the options in detail in class, so you will need to read about them a bit!

In [6]:
import matplotlib.pyplot as plt

In [7]:
from bs4 import BeautifulSoup
def cleanText(text):
    text = BeautifulSoup(text, "lxml").text
    text = re.sub(r'\|\|\|', r' ', text) 
    text = re.sub(r'http\S+', r'<URL>', text)
    text = text.lower()
    text = text.replace('x', '')
    return text
yelps['text'] = yelps['text'].apply(cleanText)

  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


In [9]:
from gensim.models.doc2vec import TaggedDocument

In [None]:
nltk.download('punkt')

In [10]:
#Instead of using the train test split for 
#training on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k etc (as I have shown in
#the train test split process of the 1st exercise)
#I am just using a test size of 0.16 (i.e 16k). Thats why the others take too much time loading and seem to
#not work properly
train, test = train_test_split(yelps, test_size=0.16, random_state=42)
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens
train_tagged = train.apply(lambda r: TaggedDocument(words=tokenize_text(r['text']), tags=[r.positive]), axis=1)
test_tagged = test.apply(lambda r: TaggedDocument(words=tokenize_text(r['text']), tags=[r.positive]), axis=1)

In [11]:
model1 = Word2Vec('text', min_count=1,size= 50,workers=3, window =3, sg = 1)
#sg=1 means that we are equipping a skip-gram training algorithm
#in order to get a CBOW training algorithm all we need to do is
# to not put sg=1 (CBOW is the default)

In [None]:
#model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
#model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])

In [12]:
#Building the final Vector Feature for the Classifier
def vec_for_learning(model1, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model1.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

In [13]:
y_train, X_train = vec_for_learning(model1, train_tagged)
y_test, X_test = vec_for_learning(model1, test_tagged)
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

from sklearn.metrics import accuracy_score, f1_score

print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))

AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

In [None]:
#Doesnt work so cant get accuracy results!!!! (Attribute error that says 
#'Word2Vec' object has no attribute 'infer_vector' ) doesnt allow me to proceed!
