#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2019


# Homework 2:   word2vec + SVM + Evaluation

### 100 points [6% of your final grade]

### Due: Tuesday, February 26, 2019 by 11:59pm

*Goals of this homework:* Understand word2vec-like term embeddings,  explore real-world challenges with SVM-based classifiers, understand and implement several evaluation metrics.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw2.ipynb`. For example, my homework submission would be something like `555001234_hw2.ipynb`. Submit this notebook via eCampus (look for the homework 2 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: Term embeddings + SVM (80 points)

### Dataset


For this homework, we will still play with Yelp reviews from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge). As in Homework 1, you'll see that each line corresponds to a review on a particular business. Each review has a unique "ID" and the text content is in the "review" field. Additionally, this time, we also offer you the "label". If `label=1`, it means that this review is `Food-relevant`. If `label=0`, it means that this review is `Food-irrelevant`. Similarly, we have already done some basic preprocessing on the reviews, so you can just tokenize each review using whitespace.

There are about 40,000 reviews in total, in which about 20,000 reviews are "Food-irrelevant". We split the review data into two sets. *review_train.json* is the training set. *review_test.json* is the testing set. 

In [33]:
# Please load the dataset

import json
import zipfile

reviewsTest = []
reviewsTrain = []

zip_ref = zipfile.ZipFile("hw2_data.zip", 'r')
zip_ref.extractall()
zip_ref.close()

for line in open('hw2_data/review_test.json', 'r'):
    reviewsTest.append(json.loads(line))
    
for line in open('hw2_data/review_train.json', 'r'):
    reviewsTrain.append(json.loads(line))


###  Pre-trained term embeddings

To save your time, you can make use of  pre-trained term embeddings. In this homework, we are using one of the great pre-trained models from [GloVe](https://nlp.stanford.edu/projects/glove/) based on 2 billion tweets. GloVe is quite similar to word2vec. Unzip the *glove.6B.50d.txt.zip* file and run the code below. You will be able to load the term embeddings model, with which each word can be represented with a 50-dimension vector.

In [38]:
# reload the pre-trained term embeddings
import numpy as np

zip_ref = zipfile.ZipFile("hw2_data/glove.6B.50d.txt.zip", 'r')
zip_ref.extractall()
zip_ref.close()

with open("glove.6B.50d.txt", "rb") as lines:
    model = {line.split()[0]: np.array( list( map(float, line.split( )[1:]) ))
           for line in lines} 

Now, you have a vector representation for each word. First, we use the simple (arithmetic) **mean** of these vectors of words in a review to represent the review. *Note: Just ignore those words which are not in the corpus of this pre-trained model.*

In [39]:
# Please figure out the vector representation for each review in the training data and testing data.
testReviewVector  =[]
trainReviewVector =[]

def makeReviewVector(review):
    x=[]
    x=review.split()
    sumReviewVector = 0
    countWords = 0
    for word in x:
        if word.encode() in model:
            encodedWord = word.encode()
            sumReviewVector+= model[encodedWord]
        countWords+=1
    return (sumReviewVector/countWords)

for i in range(0,len(reviewsTest)):
    vect = makeReviewVector(reviewsTest[i]['review'])
    testReviewVector.append({'id': reviewsTest[i]['id'], 'review': vect,'label':reviewsTest[i]['label']})

for i in range(0,len(reviewsTrain)):
    vect = makeReviewVector(reviewsTrain[i]['review'])
    trainReviewVector.append({'id': reviewsTrain[i]['id'], 'review': vect,'label':reviewsTrain[i]['label']})


### SVM

With the vector representations you get for each review, please train an SVM model to predict whether a given review is food-relevant or not. **You do not need to implement any classifier from scratch. You may use scikit-learn's built-in capabilities.** You can only train your model with reviews in *review_train.json*.

In [40]:
# SVM model training
from sklearn import svm

xTrain =[]
xTest =[]
yTrain =[]
yTest=[]

for i in range(0,len(trainReviewVector)):
    xTrain.append(trainReviewVector[i]['review'])
    yTrain.append(trainReviewVector[i]['label'])

for i in range(0,len(testReviewVector)):
    xTest.append(testReviewVector[i]['review'])
    yTest.append(testReviewVector[i]['label'])
    
clf = svm.SVC(kernel='linear')
clf.fit(xTrain, yTrain)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [41]:
yPredict = clf.predict(xTest)

Your goal is to predict whether a given review is food-relevant or not. Please report the overall accuracy, precision and recall of your model on the **testing data**. You should **implement the functions for accuracy, precision, and recall**.

In [42]:
tp = 0 #true positive
fp = 0 #false positive
tn = 0 #true negative
fn = 0  #false negative

for i in range(0,len(yTest)):
    if(yTest[i] == 1 and  yPredict[i]== 1):
        tp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==1):
        fp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==0):
        tn+=1
    elif(yTest[i] ==1 and  yPredict[i] ==0):
        fn+=1
        
print("tp:  ",tp)
print("fp:  ",fp)
print("tn:  ",tn)
print("fn:  ",fn)


accuracy = (tp + tn)/len(yTest)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1Score = 2*((precision*recall)/(precision+recall))
print()
print("accuracy:  ",accuracy*100)
print("precision:  ",precision*100)
print("recall:  ",recall*100)
print("F1 Score:  ",f1Score*100)


tp:   5487
fp:   650
tn:   5325
fn:   458

accuracy:   90.70469798657717
precision:   89.4085057845853
recall:   92.29604709840203
F1 Score:   90.82933289190531


### Document-based embeddings

Instead of taking the mean of term embeddings, you can directly train a **doc2vec** model for paragraph or document embeddings. You can refer to the paper [Distributed Representations of Sentences and Documents](https://arxiv.org/pdf/1405.4053v2.pdf) for more details. And in this homework, you can make use of the implementation in [gensim](https://radimrehurek.com/gensim/models/doc2vec.html).

Now, you need to:
* Train a doc2vec model based on all reviews you have (training + testing sets).
* Use the embeddings from your doc2vec model to represent each review and train a new SVM model.
* Report the overall accuracy, precision and recall of your model on the testing data.

In [34]:
# Train a doc2vec

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile

allReviews = []

for i in range(0,len(reviewsTrain)):
    allReviews.append(reviewsTrain[i]['review'].split())

for i in range(0,len(reviewsTest)):
    allReviews.append(reviewsTest[i]['review'].split())
 

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(allReviews)]
model = Doc2Vec(documents, vector_size=50, window=2, min_count=1, workers=4)


fname = get_tmpfile("my_doc2vec_model")
model.save(fname)
model = Doc2Vec.load(fname)


In [53]:
max_epochs = 40
vec_size = 20
alpha = 0.025

new_model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
new_model.build_vocab(documents)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    new_model.train(documents,
                total_examples=new_model.corpus_count,
                epochs=new_model.iter)
    # decrease the learning rate
    new_model.alpha -= 0.0002
    # fix the learning rate, no decay
    new_model.min_alpha = model.alpha

new_model.save("d2v.model")
print("Model Saved")



iteration 0




iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
Model Saved


In [54]:
# Train a SVM
xTrain = []
xTest = []

for i in range(0,len(reviewsTrain)):
    xTrain.append(new_model.infer_vector(reviewsTrain[i]['review'].split()))
    
for i in range(0,len(reviewsTest)):
    xTest.append(new_model.infer_vector(reviewsTest[i]['review'].split()))

clf = svm.SVC(kernel='linear')
clf.fit(xTrain, yTrain)
yPredict = clf.predict(xTest)

In [55]:
# Report the performance
tp = 0 #true positive
fp = 0 #false positive
tn = 0 #true negative
fn = 0  #false negative

for i in range(0,len(yTest)):
    if(yTest[i] == 1 and  yPredict[i]== 1):
        tp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==1):
        fp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==0):
        tn+=1
    elif(yTest[i] ==1 and  yPredict[i] ==0):
        fn+=1
        
print("tp:  ",tp)
print("fp:  ",fp)
print("tn:  ",tn)
print("fn:  ",fn)


accuracy = (tp + tn)/len(yTest)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1Score = 2*((precision*recall)/(precision+recall))

print()
print("accuracy:  ", accuracy*100)
print("precision:  ", precision*100)
print("recall:  ",recall*100)
print("F1 Score:  ",f1Score*100)


tp:   5469
fp:   606
tn:   5369
fn:   476

accuracy:   90.92281879194631
precision:   90.02469135802468
recall:   91.9932716568545
F1 Score:   90.99833610648919


What do you observe? How different are your results for the term-based average approach vs. the doc2vec approach? Why do you think this is?

The values for accuracy and F1 scores for both my approaches (Glove (term based approach) and doc2vec approach) are quite close. The doc2vec (trained over 40 epochs) model has performed only slightly better than my term-based approach.

*Results:*
1. For term based approach(vector size:50): <br>
    a. Accuracy:  90.704   
    b. F1 Score:  90.704 <br>
    
2. For doc2vec (vector size:20) trained over 40 epochs: <br>
    a. Accuracy:  90.704    
    b. F1 Score:  90.704 <br>
   

*Reasoning:*
Each Doc2Vec vector captures the semantic meaning of all the words in the context. But the same can be achieved with term based embeddings as well since each word in Glove preserves its own semantic meaning. Summing up all the vectors and averaging them will result in a vector which could have all the semantics preserved. For example, when we add the vectors (transport+water) the result nearly equals ship or boat, which means summing the vectors sums up the semantics. 

Summing up all the individual word vectors and averaging them to represent each document as vector isn't very computationally strenuos because our document lengths are relatively small. Therefore, both approaches are equally good.

However, it's debatable whether both of these approaches should be compared directly because:
1. The Glove model has been trained over 6 billion words whereas the Doc2Vec model was trained exclusively with a vocabulary containing words all the available reviews (test + train). 
2. The Glove model couldn't handle "out of vocabulary" words. In such cases, I had to assume the word vector to be 0. 

### Can you do better?

Finally, see if you can do better than either the word- or doc- based embeddings approach for classification. You may explore new features, new classifiers, etc. Whatever you like. Just provide your code and a justification.

In [29]:
import io
import numpy as np
def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = np.array(list(map(float, tokens[1:])))
    return data

vect_fastText= load_vectors("crawl-300d-2M-subword/crawl-300d-2M-subword.vec")

In [32]:
testReviewVector  =[]
trainReviewVector =[]

def makeReviewVector(review):
    x=[]
    x=review.split()
    sumReviewVector = 0
    countWords = 0
    for word in x:
        if word in vect_fastText:
            sumReviewVector+= vect_fastText[word]
        countWords+=1
    return (sumReviewVector/countWords)

for i in range(0,len(reviewsTest)):
    vect = makeReviewVector(reviewsTest[i]['review'])
    testReviewVector.append({'id': reviewsTest[i]['id'], 'review': vect,'label':reviewsTest[i]['label']})

for i in range(0,len(reviewsTrain)):
    vect = makeReviewVector(reviewsTrain[i]['review'])
    trainReviewVector.append({'id': reviewsTrain[i]['id'], 'review': vect,'label':reviewsTrain[i]['label']})
    
    
from sklearn import svm
xTrain =[]
xTest =[]
yTrain =[]
yTest=[]

for i in range(0,len(trainReviewVector)):
    xTrain.append(trainReviewVector[i]['review'])
    yTrain.append(trainReviewVector[i]['label'])

for i in range(0,len(testReviewVector)):
    xTest.append(testReviewVector[i]['review'])
    yTest.append(testReviewVector[i]['label'])
    
clf = svm.SVC(kernel='linear')
clf.fit(xTrain, yTrain)

yPredict = clf.predict(xTest)

tp = 0 #true positive
fp = 0 #false positive
tn = 0 #true negative
fn = 0  #false negative

for i in range(0,len(yTest)):
    if(yTest[i] == 1 and  yPredict[i]== 1):
        tp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==1):
        fp+=1
    elif(yTest[i] ==0 and  yPredict[i] ==0):
        tn+=1
    elif(yTest[i] ==1 and  yPredict[i] ==0):
        fn+=1
        
print("tp:  ",tp)
print("fp:  ",fp)
print("tn:  ",tn)
print("fn:  ",fn)


accuracy = (tp + tn)/len(yTest)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1Score = 2*((precision*recall)/(precision+recall))

print()
print("accuracy:  ", accuracy*100)
print("precision:  ", precision*100)
print("recall:  ",recall*100)
print("F1 Score:  ",f1Score*100)

<class 'dict'>
tp:   5684
fp:   546
tn:   5429
fn:   261

accuracy:   93.22986577181209
precision:   91.23595505617978
recall:   95.60975609756098
F1 Score:   93.37166324435317


*Justification:*

Glove and Doc2Vec learn vectors for complete words in the training corpus whereas FastText learns vectors for character n-grams that are found within each word as well as each complete word. The word vector that we get in FastText already contains embedded sub-word information. 
This is advantageous because:
1. Character n-grams are capable of handling out of vocabulary words. The vector for a rare word can be formed from its character n-grams even if word doesn't appear in training corpus. 
2. Rare words have better word embeddings because their character n-grams are still shared with other words. 

# Part 2: NDCG (20 points)

You calculated the recall and precision in Part 1 and now you get a chance to implement NDCG. 

Assume that Amy searches for "food-relevant" reviews in the **testing set** on two search engines `A` and `B`. Since the ground-truth labels for the reviews are unknown to A and B, they need to make a prediction for each review and then return a ranked list of results based on their probabilities. The results from A are in *search_result_A.json*, and the results from B are in *search_result_B.json*. Each line contains the id of a review and its corresponding ranking.

You can check their labels in *review_test.json* while calculating the NDCG scores. If a review is "food-relevant", the relevance score is 1. Otherwise, the relevance score is 0.

In [50]:
searchResultA = []
searchResultB = []

for line in open('hw2_data/search_result_A.json', 'r'):
    searchResultA.append(json.loads(line))
    
for line in open('hw2_data/search_result_B.json', 'r'):
    searchResultB.append(json.loads(line))
     
relevanceTest = {}

for i in range(0,len(reviewsTest)):
    relevanceTest[reviewsTest[i]['id']] = reviewsTest[i]['label']

In [51]:
# NDCG for search_result_A.json

import math
from operator import itemgetter 

dcg_A = 0

for i in range(0,len(searchResultA)): 
    num = 0
    den = 0
    num = (pow(2,relevanceTest[searchResultA[i]['id']]) - 1)
    den = math.log2(i+2)
    dcg_A += (num/den)

In [52]:
# NDCG for search_result_B.json

dcg_B = 0

for i in range(0,len(searchResultB)): 
    num = 0
    den = 0
    num = (pow(2,relevanceTest[searchResultB[i]['id']]) - 1)
    den = math.log2(i+2)
    dcg_B+= (num/den)

In [53]:
relevanceReviewTest = {}
#monotonically decreasing sort of all known relevance judgments
relevanceReviewTest = sorted(relevanceTest.items(), key = itemgetter(1), reverse = True)

lengthSearchA = len(searchResultA) #to find relevant results for search engine A
lengthSearchB = len(searchResultB) #to find relevant results for search engine B 

idcg_a = 0
idcg_b = 0

i1 = 1
i2 = 1

for key,value in relevanceReviewTest: 
    if i1<lengthSearchA:
        num = 0
        den = 0
        num = (pow(2,value) - 1)
        den = math.log2(i1+1)
        i1+=1
        idcg_a+= (num/den)
    if i2<lengthSearchB:
        num = 0
        den = 0
        num = (pow(2,value) - 1)
        den = math.log2(i2+1)
        i2+=1
        idcg_b+= (num/den)
    
ndcg_a = dcg_A/idcg_a
ndcg_b = dcg_B/idcg_b


print("NDCG for search engine A:",ndcg_a)
print("NDCG for search engine B:",ndcg_b)

NDCG for search engine A: 0.9182854705835336
NDCG for search engine B: 0.9866215613328989


## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*

In [None]:
 I found helpful code on Medium at https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5 for doc2vec