# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




# [1]. Reading Data

## [1.1] Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [174]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [175]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite') 

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3""", con) 

# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

Number of data points in our data (525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [176]:
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)

In [177]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B007Y59HVM,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ET0,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B007Y59HVM,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ET0,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBE1U,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [178]:
display[display['UserId']=='AZY10LLTJ71NX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B006P7E5ZI,"undertheshrine ""undertheshrine""",1334707200,5,I was recommended to try green tea extract to ...,5


In [179]:
display['COUNT(*)'].sum()

393063

#  [2] Exploratory Data Analysis

## [2.1] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [180]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As it can be seen above that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [181]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [182]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(364173, 10)

In [183]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [184]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [185]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [186]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


1    307061
0     57110
Name: Score, dtype: int64

#  [3] Preprocessing

## [3.1].  Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [187]:
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
I was really looking forward to these pods based on the reviews.  Starbucks is good, but I prefer bolder taste.... imagine my surprise when I ordered 2 boxes - both were expired! One expired back in 2005 for gosh sakes.  I admit that Amazon agreed to credit me for cost plus part of shipping, but geez, 2 years expired!!!  I'm hoping to find local San Diego area shoppe that carries pods so that I can try something different than starbucks.
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a do

In [188]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [189]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
I was really looking forward to these pods based on the reviews.  Starbucks is good, but I prefer bolder taste.... imagine my surprise when I ordered 2 boxes - both were expired! One expired back in 2005 for gosh sakes.  I admit that Amazon agreed to credit me for cost plus part of shipping, but geez, 2 years expired!!!  I'm hoping to find local San Diego area shoppe that carries pods so that I can try something different than starbucks.
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a do

In [190]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [191]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)

Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it, it would poison them. Today is Food industries have convinced the masses that Canola oil is a safe and even better oil than olive or virgin coconut, facts though say otherwise. Until the late 70 is it was poisonous until they figured out a way to fix that. I still like it but it could be better.


In [192]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [193]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

Great ingredients although chicken should have been 1st rather than chicken broth the only thing I do not think belongs in it is Canola oil Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it it would poison them Today is Food industries have convinced the masses that Canola oil is a safe and even better oil than olive or virgin coconut facts though say otherwise Until the late 70 is it was poisonous until they figured out a way to fix that I still like it but it could be better 


In [194]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [195]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [02:55<00:00, 2073.27it/s]


In [196]:
preprocessed_reviews[1500]

'great ingredients although chicken rather chicken broth thing not think belongs canola oil canola rapeseed not someting dog would ever find nature find rapeseed nature eat would poison today food industries convinced masses canola oil safe even better oil olive virgin coconut facts though say otherwise late poisonous figured way fix still like could better'

<h2><font color='red'>[3.2] Preprocessing Review Summary</font></h2>

In [197]:
## Similartly you can do preprocessing for review summary also.

preprocessed_summary = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Summary'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_summary.append(sentance.strip())

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [02:16<00:00, 2674.23it/s]


In [198]:
#Splitting the data into train,CV and test

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(preprocessed_reviews,final['Score'],random_state=100,test_size=0.30,shuffle=False)

X_train,X_CV,y_train,y_CV = train_test_split(X_train,y_train,random_state=100,test_size=0.30,shuffle=False)

# [4] Featurization

## [4.1] BAG OF WORDS

In [199]:
#BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(X_train)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

bow_features = count_vect.get_feature_names()
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

#Vectorizing the train2 (splitted from X_train to be used with CV data)
final_countsTrain = count_vect.transform(X_train)

#Vectorizing the CV data
final_countsCV = count_vect.transform(X_CV)

#Vectorizing the test data

final_countsTEST = count_vect.transform(X_test)

some feature names  ['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaa', 'aaaaaaaaaaaaaaa', 'aaaaaaaaagghh']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (2443, 9468)
the number of unique words  9468


## [4.2] Bi-Grams and n-Grams.

In [200]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect_bi = CountVectorizer(ngram_range=(2,2), min_df=10, max_features=5000)
count_vect_bi.fit(X_train)
features_bigrams = count_vect_bi.get_feature_names()
print("some feature names ",features_bigrams[:10])
final_bigram_countsTrain =count_vect_bi.transform(X_train)
print("the type of count vectorizer ",type(final_bigram_countsTrain))
print("the shape of out text BOW vectorizer ",final_bigram_countsTrain.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_countsTrain.get_shape()[1])

#vectorizing train,test and cv data

final_bigram_countsCV = count_vect_bi.transform(X_CV)
final_bigram_countsTEST = count_vect_bi.transform(X_test)

some feature names  ['able buy', 'able eat', 'able enjoy', 'able find', 'able get', 'able make', 'able order', 'able purchase', 'able use', 'absolute best']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (178443, 5000)
the number of unique words including both unigrams and bigrams  5000


## [4.3] TF-IDF

In [201]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(X_train)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

features_tf_idf = tf_idf_vect.get_feature_names()

final_tf_idfTrain = tf_idf_vect.transform(X_train)
print("the type of count vectorizer ",type(final_tf_idf_train))
print("the shape of out text TFIDF vectorizer ",final_tf_idf_train.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf_train.get_shape()[1])

#vectorizing train,test and cv data

final_tf_idfCV = tf_idf_vect.transform(X_CV)
final_tf_idfTEST = tf_idf_vect.transform(X_test)

some sample features(unique words in the corpus) ['aa', 'aaa', 'aafco', 'ab', 'aback', 'abandon', 'abandoned', 'abby', 'abc', 'abdomen']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (2443, 1699)
the number of unique words including both unigrams and bigrams  1699


## [4.4] Word2Vec

In [202]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [203]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentence_train=[]
for sentence in X_train:
    list_of_sentence_train.append(sentence.split())

In [None]:
w2v_model=Word2Vec(list_of_sentence_train,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

## [4.4.1] Converting text into vectors using Avg W2V, TFIDF-W2V

#### [4.4.1.1] Avg W2v

In [None]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentence_train): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
sent_vectors_train = np.array(sent_vectors_train)
print(sent_vectors_train.shape)
print(sent_vectors_train[0])

# Converting CV data text

In [None]:
i=0
list_of_sentence_cv=[]
for sentence in X_CV:
    list_of_sentence_cv.append(sentence.split())

# average Word2Vec
# compute average word2vec for each review.
sent_vectors_cv = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentence_cv): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_cv.append(sent_vec)
sent_vectors_cv = np.array(sent_vectors_cv)
print(sent_vectors_cv.shape)
print(sent_vectors_cv[0])

# Converting Test data text

In [None]:
i=0
list_of_sentence_test=[]
for sentence in X_test:
    list_of_sentence_test.append(sentence.split())

# average Word2Vec
# compute average word2vec for each review.
sent_vectors_test = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentence_test): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_test.append(sent_vec)
sent_vectors_test = np.array(sent_vectors_test)
print(sent_vectors_test.shape)
print(sent_vectors_test[0])

#### [4.4.1.2] TFIDF weighted W2v

In [None]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model_tf = TfidfVectorizer()
tf_idf_matrix = model_tf.fit_transform(X_train)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model_tf.get_feature_names(), list(model_tf.idf_)))

### TFIDF W2V vectorization of  train data

In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat = model_tf.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors_train = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentence_train): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors_train.append(sent_vec)
    row += 1

# TFIDF W2V vectorization of  CV data

In [None]:

tfidf_sent_vectors_CV = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentence_cv): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors_CV.append(sent_vec)
    row += 1

# TFIDF W2V vectorization of  test data

In [None]:
tfidf_sent_vectors_test = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentence_test): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors_test.append(sent_vec)
    row += 1

In [None]:
tfidf_sent_vectors_train = np.array(tfidf_sent_vectors_train)
tfidf_sent_vectors_CV = np.array(tfidf_sent_vectors_CV)
tfidf_sent_vectors_test = np.array(tfidf_sent_vectors_test)

# [5] Assignment 3: KNN

<ol>
    <li><strong>Apply Knn(brute force version) on these feature sets</strong>
        <ul>
            <li><font color='red'>SET 1:</font>Review text, preprocessed one converted into vectors using (BOW)</li>
            <li><font color='red'>SET 2:</font>Review text, preprocessed one converted into vectors using (TFIDF)</li>
            <li><font color='red'>SET 3:</font>Review text, preprocessed one converted into vectors using (AVG W2v)</li>
            <li><font color='red'>SET 4:</font>Review text, preprocessed one converted into vectors using (TFIDF W2v)</li>
        </ul>
    </li>
    <br>
    <li><strong>Apply Knn(kd tree version) on these feature sets</strong>
        <br><font color='red'>NOTE: </font>sklearn implementation of kd-tree accepts only dense matrices, you need to convert the sparse matrices of CountVectorizer/TfidfVectorizer into dense matices. You can convert sparse matrices to dense using .toarray() attribute. For more information please visit this <a href='https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.toarray.html'>link</a>
        <ul>
            <li><font color='red'>SET 5:</font>Review text, preprocessed one converted into vectors using (BOW) but with restriction on maximum features generated.
            <pre>
            count_vect = CountVectorizer(min_df=10, max_features=500) 
            count_vect.fit(preprocessed_reviews)
            </pre>
            </li>
            <li><font color='red'>SET 6:</font>Review text, preprocessed one converted into vectors using (TFIDF) but with restriction on maximum features generated.
            <pre>
                tf_idf_vect = TfidfVectorizer(min_df=10, max_features=500)
                tf_idf_vect.fit(preprocessed_reviews)
            </pre>
            </li>
            <li><font color='red'>SET 3:</font>Review text, preprocessed one converted into vectors using (AVG W2v)</li>
            <li><font color='red'>SET 4:</font>Review text, preprocessed one converted into vectors using (TFIDF W2v)</li>
        </ul>
    </li>
    <br>
    <li><strong>The hyper paramter tuning(find best K)</strong>
        <ul>
    <li>Find the best hyper parameter which will give the maximum <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/receiver-operating-characteristic-curve-roc-curve-and-auc-1/'>AUC</a> value</li>
    <li>Find the best hyper paramter using k-fold cross validation or simple cross validation data</li>
    <li>Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning</li>
        </ul>
    </li>
    <br>
    <li>
    <strong>Representation of results</strong>
        <ul>
    <li>You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure
    <img src='train_cv_auc.JPG' width=300px></li>
    <li>Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
    <img src='train_test_auc.JPG' width=300px></li>
    <li>Along with plotting ROC curve, you need to print the <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/confusion-matrix-tpr-fpr-fnr-tnr-1/'>confusion matrix</a> with predicted and original labels of test data points
    <img src='confusion_matrix.png' width=300px></li>
        </ul>
    </li>
    <br>
    <li><strong>Conclusion</strong>
        <ul>
    <li>You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library<a href='http://zetcode.com/python/prettytable/'> link</a> 
        <img src='summary.JPG' width=400px>
    </li>
        </ul>
</ol>

<h4><font color='red'>Note: Data Leakage</font></h4>

1. There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
2. To avoid the issue of data-leakag, make sure to split your data first and then vectorize it. 
3. While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
4. For more details please go through this <a href='https://soundcloud.com/applied-ai-course/leakage-bow-and-tfidf'>link.</a>

## [5.1] Applying KNN brute force

### [5.1.1] Applying KNN brute force on BOW,<font color='red'> SET 1</font>

In [None]:
# Please write all the code with proper documentation

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
                 
hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30))))}

knn_brute_bow = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_brute_bow.fit(final_countsTrain,y_train)

In [None]:
print('Best Estimator : ',knn_brute_bow.best_estimator_)
print('\nClasses : ',knn_brute_bow.classes_)
print('\nBest Score : ',knn_brute_bow.best_score_)
print('\nNo of Splits : ',knn_brute_bow.n_splits_)
print('\nBest params : ',knn_brute_bow.best_params_)

In [None]:
optimal_k = knn_brute_bow.best_params_['n_neighbors']
AUC_max_brute_bow = knn_brute_bow.best_score_
print('optimal_k ',optimal_k)
print('AUC_max_brute_bow ',AUC_max_brute_bow)

In [None]:
#storing the AUC scores for each grid
AUC_train_bow = knn_brute_bow.cv_results_['mean_test_score']

In [None]:
#storing AUC for CV data for each neighbor
AUC_cv_bow = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(final_countsCV,y_CV).cv_results_['mean_test_score']

In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_bow,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_bow,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()


In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_bow_train = KNeighborsClassifier(n_neighbors=optimal_k)
knn_bow_train = knn_bow_train.fit(final_countsTrain,y_train)

y_pred_knn_bow_train = knn_bow_train.predict_proba(final_countsTrain)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_bow_train)

y_pred_knn_bow_test = knn_bow_train.predict_proba(final_countsTEST)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_bow_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()

In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_brute = knn_bow_train.predict(final_countsTEST)

cm = metrics.confusion_matrix(y_test,pred_knn_brute)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')

In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_brute.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t = t[ratio.index(max(ratio))]
print(optimal_t)



pred_class = binarize(pred_knn_brute.reshape(-1,1),threshold = optimal_t)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.1.2] Applying KNN brute force on TFIDF,<font color='red'> SET 2</font>

In [None]:
# Please write all the code with proper documentation
                 
hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30))))}

knn_brute_tfidf = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_brute_tfidf.fit(final_tf_idfTrain,y_train)

In [None]:
print('Best Estimator : ',knn_brute_tfidf.best_estimator_)
print('\nClasses : ',knn_brute_tfidf.classes_)
print('\nBest Score : ',knn_brute_tfidf.best_score_)
print('\nNo of Splits : ',knn_brute_tfidf.n_splits_)
print('\nBest params : ',knn_brute_tfidf.best_params_)

optimal_k_tfidf = knn_brute_tfidf.best_params_['n_neighbors']
AUC_max_brute_tfidf = knn_brute_tfidf.best_score_

print('\noptimal_k_tfidf : ',optimal_k_tfidf)
print('\nAUC_max_brute_tfidf : ',AUC_max_brute_tfidf)

In [None]:
#storing the AUC scores for each grid
AUC_train_tfidf = knn_brute_tfidf.cv_results_['mean_test_score']

In [None]:
#storing AUC for CV data for each neighbor
AUC_cv_tfidf = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(final_tf_idfCV,y_CV).cv_results_['mean_test_score']

#plotting AUC for train and cv data

plt.plot(AUC_train_tfidf,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_tfidf,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_tfidf_train = KNeighborsClassifier(n_neighbors=optimal_k_tfidf)
knn_tfidf_train = knn_tfidf_train.fit(final_tf_idfTrain,y_train)

y_pred_knn_tfidf_train = knn_tfidf_train.predict_proba(final_tf_idfTrain)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_tfidf_train)

y_pred_knn_tfidf_test = knn_tfidf_train.predict_proba(final_tf_idfTEST)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_tfidf_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()



In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_brute_tfidf = knn_tfidf_train.predict(final_tf_idfTEST)

cm = metrics.confusion_matrix(y_test,pred_knn_brute_tfidf)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')

In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_brute_tfidf.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_tfidf = t[ratio.index(max(ratio))]
print(optimal_t_tfidf)

pred_class = binarize(pred_knn_brute_tfidf.reshape(-1,1),threshold = optimal_t_tfidf)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.1.3] Applying KNN brute force on AVG W2V,<font color='red'> SET 3</font>

In [None]:
hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30))))}

knn_brute_avgw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_brute_avgw2v.fit(sent_vectors_train,y_train)

print('Best Estimator : ',knn_brute_avgw2v.best_estimator_)
print('\nClasses : ',knn_brute_avgw2v.classes_)
print('\nBest Score : ',knn_brute_avgw2v.best_score_)
print('\nNo of Splits : ',knn_brute_avgw2v.n_splits_)
print('\nBest params : ',knn_brute_avgw2v.best_params_)

optimal_k_avgw2v = knn_brute_avgw2v.best_params_['n_neighbors']
AUC_max_brute_avgw2v = knn_brute_avgw2v.best_score_

print('optimal_k_avgw2v ',optimal_k_avgw2v)

#storing the AUC scores for each grid
AUC_train_avgw2v = knn_brute_avgw2v.cv_results_['mean_test_score']


In [None]:
#storing AUC for CV data for each neighbor
AUC_cv_avgw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(sent_vectors_cv,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_avgw2v,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_avgw2v,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_avgw2v_train = KNeighborsClassifier(n_neighbors=optimal_k_avgw2v)
knn_avgw2v_train = knn_avgw2v_train.fit(sent_vectors_train,y_train)

y_pred_knn_avgw2v_train = knn_avgw2v_train.predict_proba(sent_vectors_train)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_avgw2v_train)

y_pred_knn_avgw2v_test = knn_avgw2v_train.predict_proba(sent_vectors_test)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_avgw2v_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()


In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_brute_avgw2v = knn_avgw2v_train.predict(sent_vectors_test)

cm = metrics.confusion_matrix(y_test,pred_knn_brute_avgw2v)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')


In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_brute_avgw2v.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_avgw2v = t[ratio.index(max(ratio))]
print(optimal_t_avgw2v)

pred_class = binarize(pred_knn_brute_avgw2v.reshape(-1,1),threshold = optimal_t_avgw2v)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.1.4] Applying KNN brute force on TFIDF W2V,<font color='red'> SET 4</font>

In [None]:
# Please write all the code with proper documentation

hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30))))}

knn_brute_tfw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_brute_tfw2v.fit(tfidf_sent_vectors_train,y_train)

print('Best Estimator : ',knn_brute_tfw2v.best_estimator_)
print('\nClasses : ',knn_brute_tfw2v.classes_)
print('\nBest Score : ',knn_brute_tfw2v.best_score_)
print('\nNo of Splits : ',knn_brute_tfw2v.n_splits_)
print('\nBest params : ',knn_brute_tfw2v.best_params_)

optimal_k_tfw2v = knn_brute_tfw2v.best_params_['n_neighbors']
print(optimal_k_tfw2v)

In [None]:
AUC_max_brute_tfw2v = knn_brute_tfw2v.best_score_

In [None]:
#storing the AUC scores for train data for each grid
AUC_train_tfw2v = knn_brute_tfw2v.cv_results_['mean_test_score']

#storing AUC for CV data for each neighbor
AUC_cv_tfw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(tfidf_sent_vectors_CV,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_tfw2v,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_tfw2v,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()


In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_tfw2v_train = KNeighborsClassifier(n_neighbors=optimal_k_tfw2v)
knn_tfw2v_train = knn_tfw2v_train.fit(tfidf_sent_vectors_train,y_train)

y_pred_knn_tfw2v_train = knn_tfw2v_train.predict_proba(tfidf_sent_vectors_train)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_tfw2v_train)

y_pred_knn_tfw2v_test = knn_tfw2v_train.predict_proba(tfidf_sent_vectors_test)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_tfw2v_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()

In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_brute_tfw2v = knn_tfw2v_train.predict(tfidf_sent_vectors_test)

cm = metrics.confusion_matrix(y_test,pred_knn_brute_tfw2v)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')


In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_brute_tfw2v.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_tfw2v = t[ratio.index(max(ratio))]
print(optimal_t_tfw2v)

pred_class = binarize(pred_knn_brute_tfw2v.reshape(-1,1),threshold = optimal_t_tfw2v)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

## [5.2] Applying KNN kd-tree

# Vectorizing data with different parameters

In [None]:
#BoW
count_vect_kd = CountVectorizer(min_df=10,max_features=500) #in scikit-learn
count_vect_kd.fit(X_train)
print("some feature names ", count_vect_kd.get_feature_names()[:10])
print('='*50)

#Vectorizing the train 
final_countsTrain_kd = count_vect_kd.transform(X_train)

bow_features_kd = count_vect_kd.get_feature_names()

print("the type of count vectorizer ",type(final_countsTrain))
print("the shape of out text BOW vectorizer ",final_countsTrain.get_shape())
print("the number of unique words ", final_countsTrain.get_shape()[1])

#Vectorizing the CV data
final_countsCV_kd = count_vect_kd.transform(X_CV)

#Vectorizing the test data

final_countsTEST_kd = count_vect_kd.transform(X_test)

print('\n...........................TFIDF Vectorization starts.............................\n')
## [4.3] TF-IDF

tf_idf_vect_kd = TfidfVectorizer(ngram_range=(1,2), min_df=10,max_features=500)
tf_idf_vect_kd.fit(X_train)
print("some sample features(unique words in the corpus)",tf_idf_vect_kd.get_feature_names()[0:10])
print('='*50)

features_tf_idf_kd = tf_idf_vect_kd.get_feature_names()

final_tf_idfTrain_kd = tf_idf_vect_kd.transform(X_train)
print("the type of count vectorizer ",type(final_tf_idfTrain_kd))
print("the shape of out text TFIDF vectorizer ",final_tf_idfTrain_kd.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idfTrain_kd.get_shape()[1])

#vectorizing train,test and cv data

final_tf_idfCV_kd = tf_idf_vect_kd.transform(X_CV)
final_tf_idfTEST_kd = tf_idf_vect_kd.transform(X_test)

### [5.2.1] Applying KNN kd-tree on BOW,<font color='red'> SET 5</font>

In [None]:
#converting bow sparse matrix to dense

final_countsTrain_kd = final_countsTrain_kd.toarray()
final_countsCV_kd = final_countsCV_kd.toarray()
final_countsTEST_kd = final_countsTEST_kd.toarray()

In [None]:
# Please write all the code with proper documentation

hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30)))),'algorithm':['kd_tree']}

knn_kd_bow = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_kd_bow.fit(final_countsTrain_kd,y_train)

print('Best Estimator : ',knn_kd_bow.best_estimator_)
print('\nClasses : ',knn_kd_bow.classes_)
print('\nBest Score : ',knn_kd_bow.best_score_)
print('\nNo of Splits : ',knn_kd_bow.n_splits_)
print('\nBest params : ',knn_kd_bow.best_params_)

optimal_k_bow_kd = knn_kd_bow.best_params_['n_neighbors']
print(optimal_k_bow_kd)

In [None]:
AUC_max_bow_kd = knn_kd_bow.best_score_

In [None]:
#storing the AUC scores for each grid
AUC_train_bow_kd = knn_kd_bow.cv_results_['mean_test_score']

#storing AUC for CV data for each neighbor
AUC_cv_bow_kd = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(final_countsCV_kd,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_bow_kd,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_bow_kd,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_bow_train_kd = KNeighborsClassifier(n_neighbors=optimal_k_bow_kd)
knn_bow_train_kd = knn_bow_train_kd.fit(final_countsTrain_kd,y_train)

y_pred_knn_bow_train_kd = knn_bow_train_kd.predict_proba(final_countsTrain_kd)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_bow_train_kd)

y_pred_knn_bow_test = knn_bow_train_kd.predict_proba(final_countsTEST_kd)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_bow_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()

In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_kd_bow = knn_bow_train_kd.predict(final_countsTEST_kd)

cm = metrics.confusion_matrix(y_test,pred_knn_kd_bow)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')

In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_kd_bow.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_bow_kd = t[ratio.index(max(ratio))]
print(optimal_t_bow_kd)

pred_class = binarize(pred_knn_kd_bow.reshape(-1,1),threshold = optimal_t_bow_kd)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.2.2] Applying KNN kd-tree on TFIDF,<font color='red'> SET 6</font>

In [None]:
#converting bow sparse matrix to dense

final_tf_idfTrain_kd = final_tf_idfTrain_kd.toarray()
final_tf_idfCV_kd = final_tf_idfCV_kd.toarray()
final_tf_idfTEST_kd = final_tf_idfTEST_kd.toarray()

In [None]:
# Please write all the code with proper documentation
hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30)))),'algorithm':['kd_tree']}

knn_kd_tfidf = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_kd_tfidf.fit(final_tf_idfTrain_kd,y_train)

print('Best Estimator : ',knn_kd_tfidf.best_estimator_)
print('\nClasses : ',knn_kd_tfidf.classes_)
print('\nBest Score : ',knn_kd_tfidf.best_score_)
print('\nNo of Splits : ',knn_kd_tfidf.n_splits_)
print('\nBest params : ',knn_kd_tfidf.best_params_)

optimal_k_tfidf_kd = knn_kd_tfidf.best_params_['n_neighbors']
print(optimal_k_tfidf_kd)


In [None]:
AUC_max_tfidf_kd = knn_kd_tfidf.best_score_

In [None]:
#storing the AUC scores for each grid
AUC_train_tfidf_kd = knn_kd_tfidf.cv_results_['mean_test_score']

#storing AUC for CV data for each neighbor
AUC_cv_tfidf_kd = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(final_tf_idfCV_kd,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_tfidf_kd,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_tfidf_kd,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_tfidf_train_kd = KNeighborsClassifier(n_neighbors=optimal_k_tfidf_kd)
knn_tfidf_train_kd = knn_tfidf_train_kd.fit(final_tf_idfTrain_kd,y_train)

y_pred_knn_tfidf_train_kd = knn_tfidf_train_kd.predict_proba(final_tf_idfTrain_kd)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_tfidf_train_kd)

y_pred_knn_tfidf_test = knn_tfidf_train_kd.predict_proba(final_tf_idfTEST_kd)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_tfidf_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()


In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_kd_tfidf = knn_tfidf_train_kd.predict(final_tf_idfTEST_kd)

cm = metrics.confusion_matrix(y_test,pred_knn_kd_tfidf)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')

In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_kd_tfidf.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_tfidf_kd = t[ratio.index(max(ratio))]
print(optimal_t_tfidf_kd)

pred_class = binarize(pred_knn_kd_tfidf.reshape(-1,1),threshold = optimal_t_tfidf_kd)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.2.3] Applying KNN kd-tree on AVG W2V,<font color='red'> SET 3</font>

In [None]:
# Please write all the code with proper documentation

# Please write all the code with proper documentation
                 
hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30)))),'algorithm':['kd_tree']}

knn_kd_avgw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_kd_avgw2v.fit(sent_vectors_train,y_train)

print('Best Estimator : ',knn_kd_avgw2v.best_estimator_)
print('\nClasses : ',knn_kd_avgw2v.classes_)
print('\nBest Score : ',knn_kd_avgw2v.best_score_)
print('\nNo of Splits : ',knn_kd_avgw2v.n_splits_)
print('\nBest params : ',knn_kd_avgw2v.best_params_)

optimal_k_avgw2v_kd = knn_kd_avgw2v.best_params_['n_neighbors']
print('optimal_k for avgw2v kd tree version ',optimal_k_avgw2v_kd)


In [None]:
AUC_max_avgw2v_kd = knn_kd_avgw2v.best_score_

In [None]:
#storing the AUC scores for each grid
AUC_train_avgw2v_kd = knn_kd_avgw2v.cv_results_['mean_test_score']

#storing AUC for CV data for each neighbor
AUC_cv_avgw2v_kd = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(sent_vectors_cv,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_avgw2v_kd,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_avgw2v_kd,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_avgw2v_train_kd = KNeighborsClassifier(n_neighbors=optimal_k_avgw2v_kd)
knn_avgw2v_train_kd = knn_avgw2v_train_kd.fit(sent_vectors_train,y_train)

y_pred_knn_avgw2v_train_kd = knn_avgw2v_train_kd.predict_proba(sent_vectors_train)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_avgw2v_train_kd)

y_pred_knn_avgw2v_test = knn_avgw2v_train_kd.predict_proba(sent_vectors_test)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_avgw2v_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()

In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_kd_avgw2v = knn_avgw2v_train_kd.predict(sent_vectors_test)

cm = metrics.confusion_matrix(y_test,pred_knn_kd_avgw2v)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')

In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_kd_avgw2v.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_avgw2v_kd = t[ratio.index(max(ratio))]
print(optimal_t_avgw2v_kd)

pred_class = binarize(pred_knn_kd_avgw2v.reshape(-1,1),threshold = optimal_t_avgw2v_kd)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

### [5.2.4] Applying KNN kd-tree on TFIDF W2V,<font color='red'> SET 4</font>

In [None]:
# Please write all the code with proper documentation

hyperparameters = {'n_neighbors':list(filter(lambda x : x % 2 != 0 , list(range(2,30)))),'algorithm':['kd_tree']}

knn_kd_tfw2v = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc')
knn_kd_tfw2v.fit(tfidf_sent_vectors_train,y_train)

print('Best Estimator : ',knn_kd_tfw2v.best_estimator_)
print('\nClasses : ',knn_kd_tfw2v.classes_)
print('\nBest Score : ',knn_kd_tfw2v.best_score_)
print('\nNo of Splits : ',knn_kd_tfw2v.n_splits_)
print('\nBest params : ',knn_kd_tfw2v.best_params_)

optimal_k_tfw2v_kd = knn_kd_tfw2v.best_params_['n_neighbors']
print(optimal_k_tfw2v_kd)

#storing the AUC scores for each grid
AUC_train_tfw2v_kd = knn_kd_tfw2v.cv_results_['mean_test_score']


In [None]:
AUC_max_tfw2v_kd = knn_kd_tfw2v.best_score_

In [None]:
#storing AUC for CV data for each neighbor
AUC_cv_tfw2v_kd = GridSearchCV(KNeighborsClassifier(),param_grid=hyperparameters,cv=5,scoring='roc_auc').fit(tfidf_sent_vectors_CV,y_CV).cv_results_['mean_test_score']


In [None]:
#plotting AUC for train and cv data

plt.plot(AUC_train_tfw2v_kd,hyperparameters['n_neighbors'],label='train_curve')
plt.plot(AUC_cv_tfw2v_kd,hyperparameters['n_neighbors'],label='cv_curve')
plt.title('ROC curve for train and cv data')
plt.xlabel('Hyperparameters : neighbors')
plt.ylabel('AUC Score')
plt.legend()
plt.show()

In [None]:
#plotting ROC curve for train and test data tuned with best hyperparameter
knn_tfw2v_train_kd = KNeighborsClassifier(n_neighbors=optimal_k_tfw2v_kd)
knn_tfw2v_train_kd = knn_tfw2v_train_kd.fit(tfidf_sent_vectors_train,y_train)

y_pred_knn_tfw2v_train_kd = knn_tfw2v_train_kd.predict_proba(tfidf_sent_vectors_train)[:,1]
fpr_bi,tpr_bi,t = metrics.roc_curve(y_train,y_pred_knn_tfw2v_train_kd)

y_pred_knn_tfw2v_test = knn_tfw2v_train_kd.predict_proba(tfidf_sent_vectors_test)[:,1]
fpr_bi2,tpr_bi2,t = metrics.roc_curve(y_test,y_pred_knn_tfw2v_test)

plt.plot(fpr_bi,tpr_bi,label='Train ROC Curve')
plt.plot(fpr_bi2,tpr_bi2,label='Test ROC Curve')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of train and test data')
plt.show()

In [None]:
#Confusion matrix with default value of threshold i.e. 0.5

pred_knn_kd_tfw2v = knn_tfw2v_train_kd.predict(tfidf_sent_vectors_test)

cm = metrics.confusion_matrix(y_test,pred_knn_kd_tfw2v)

ax2 = plt.subplot()
sns.heatmap(cm,annot=True,ax=ax2,fmt='g')


In [None]:
#finding the best threshold for probability to be considered for confusion matrix to classify reviews
#decided on the basis of ratio of correct against wrong prediction

from sklearn.preprocessing import binarize

t = list(np.arange(0.1,1.1,0.1))
ratio=[]
for x in t:
    pred_class = binarize(pred_knn_kd_tfw2v.reshape(-1,1),threshold = x)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred_class).ravel()
    ratio.append((tp+tn)/(fp+fn))
#print(ratio)
optimal_t_tfw2v_kd = t[ratio.index(max(ratio))]
print(optimal_t_tfw2v_kd)

pred_class = binarize(pred_knn_kd_tfw2v.reshape(-1,1),threshold = optimal_t_tfw2v_kd)
cm2 = metrics.confusion_matrix(y_test,pred_class)
plt.figure()
ax2 = plt.subplot()
sns.heatmap(cm2,annot=True,ax=ax2,fmt='g')

# [6] Conclusions

In [None]:
# Please compare all your models using Prettytable library

from prettytable import PrettyTable

x = PrettyTable(border=True)

x.field_names = ['Vectorizer','Model','Version','Hyper Parameter','AUC','Threshold(CM)']

x.add_row(['BOW','KNN','Brute Force',optimal_k,AUC_max_brute_bow,optimal_t])
x.add_row(['TF IDF','KNN','Brute Force',optimal_k_tfidf,AUC_max_brute_tfidf,optimal_t_tfidf])
x.add_row(['AVG W2V','KNN','Brute Force',optimal_k_avgw2v,AUC_max_brute_avgw2v,optimal_t_avgw2v])
x.add_row(['TF-IDF W2V','KNN','Brute Force',optimal_k_tfw2v,AUC_max_brute_tfw2v,optimal_t_avgw2v])

x.add_row(['----------------','----------------','---------------','-------------------','---------------','------------------'])

x.add_row(['BOW','KNN','KD Tree',optimal_k_bow_kd,AUC_max_bow_kd,optimal_t_bow_kd])
x.add_row(['TF IDF','KNN','KD Tree',optimal_k_tfidf_kd,AUC_max_tfidf_kd,optimal_t_tfidf_kd])
x.add_row(['AVG W2V','KNN','KD Tree',optimal_k_avgw2v_kd,AUC_max_avgw2v_kd,optimal_t_avgw2v_kd])
x.add_row(['TF-IDF W2V','KNN','KD Tree',optimal_k_tfw2v_kd,AUC_max_tfidf_kd,optimal_t_avgw2v_kd])


print(x)