## Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

1 Id

2 ProductId - unique identifier for the product

3 UserId - unqiue identifier for the user

4 ProfileName

5 HelpfulnessNumerator - number of users who found the review helpful

6 HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not

7 Score - rating between 1 and 5

8 Time - timestamp for the review

9 Summary - brief summary of the review

10  Text - text of the review


Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).


[Q] How to determine if a review is positive or negative?

[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

[1]. Reading Data
[1.1] Loading the data
The dataset is available in two forms

.csv file
SQLite Database
In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualise the data efficiently.

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [2]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os



In [3]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite') 

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000""", con) 

In [4]:
# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

Number of data points in our data (5000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [5]:
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)

In [6]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [7]:

display[display['UserId']=='AZY10LLTJ71NX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B001ATMQK2,"undertheshrine ""undertheshrine""",1296691200,5,I bought this 6 pack because for the price tha...,5


In [8]:
display['COUNT(*)'].sum()

393063

## [2] Exploratory Data Analysis
[2.1] Data Cleaning: Deduplication
It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

In [9]:

display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As it can be seen above that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [10]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [11]:

#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(4986, 10)

In [12]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

99.72

Observation:- It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [13]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [14]:

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]


#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(4986, 10)


1    4178
0     808
Name: Score, dtype: int64


## [3] Preprocessing
[3.1]. Preprocessing Review Text
Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1 Begin by removing the html tags

2 Remove any punctuations or limited set of special characters like , or . or # etc.

3 Check if the word is made up of english letters and is not alpha-numeric

4 Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)

5 Convert the word to lowercase

6 Remove Stopwords

7 Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)

8 After which we collect the words used to describe positive and negative reviews


In [15]:
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the flavor is still very good.  I like them better than the yogurt and green onion flavor because they do not seem to be as salty, and the onion flavor is better.  If you haven't eaten Kettle chips before, I recommend that you try a bag before buying bulk.  They are thicker and crunchier than Lays but just as fresh out of the bag.
Wow.  So far, two two-star reviews.  One obviously had no 

In [16]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [17]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)

Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the flavor is still very good.  I like them better than the yogurt and green onion flavor because they do not seem to be as salty, and the onion flavor is better.  If you haven't eaten Kettle chips before, I recommend that you try a bag before buying bulk.  They are thicker and crunchier than Lays but just as fresh out of the bag.
Wow.  So far, two two-star reviews.  One obviously had no idea what they were ordering; the other wants crispy cookies.  Hey, I'm sorry; b

In [18]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [19]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)

Wow.  So far, two two-star reviews.  One obviously had no idea what they were ordering; the other wants crispy cookies.  Hey, I am sorry; but these reviews do nobody any good beyond reminding us to look  before ordering.<br /><br />These are chocolate-oatmeal cookies.  If you do not like that combination, do not order this type of cookie.  I find the combo quite nice, really.  The oatmeal sort of "calms" the rich chocolate flavor and gives the cookie sort of a coconut-type consistency.  Now let is also remember that tastes differ; so, I have given my opinion.<br /><br />Then, these are soft, chewy cookies -- as advertised.  They are not "crispy" cookies, or the blurb would say "crispy," rather than "chewy."  I happen to like raw cookie dough; however, I do not see where these taste like raw cookie dough.  Both are soft, however, so is this the confusion?  And, yes, they stick together.  Soft cookies tend to do that.  They are not individually wrapped, which would add to the cost.  Oh y

In [20]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor  and  traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [21]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

Wow So far two two star reviews One obviously had no idea what they were ordering the other wants crispy cookies Hey I am sorry but these reviews do nobody any good beyond reminding us to look before ordering br br These are chocolate oatmeal cookies If you do not like that combination do not order this type of cookie I find the combo quite nice really The oatmeal sort of calms the rich chocolate flavor and gives the cookie sort of a coconut type consistency Now let is also remember that tastes differ so I have given my opinion br br Then these are soft chewy cookies as advertised They are not crispy cookies or the blurb would say crispy rather than chewy I happen to like raw cookie dough however I do not see where these taste like raw cookie dough Both are soft however so is this the confusion And yes they stick together Soft cookies tend to do that They are not individually wrapped which would add to the cost Oh yeah chocolate chip cookies tend to be somewhat sweet br br So if you wa

In [22]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [23]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|██████████| 4986/4986 [00:09<00:00, 529.37it/s]


In [24]:
preprocessed_reviews[1500]

'wow far two two star reviews one obviously no idea ordering wants crispy cookies hey sorry reviews nobody good beyond reminding us look ordering chocolate oatmeal cookies not like combination not order type cookie find combo quite nice really oatmeal sort calms rich chocolate flavor gives cookie sort coconut type consistency let also remember tastes differ given opinion soft chewy cookies advertised not crispy cookies blurb would say crispy rather chewy happen like raw cookie dough however not see taste like raw cookie dough soft however confusion yes stick together soft cookies tend not individually wrapped would add cost oh yeah chocolate chip cookies tend somewhat sweet want something hard crisp suggest nabiso ginger snaps want cookie soft chewy tastes like combination chocolate oatmeal give try place second order'

## [4] Featurization
## [4.1] BAG OF WORDS

In [25]:
#BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(preprocessed_reviews)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

final_counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

some feature names  ['aa', 'aahhhs', 'aback', 'abandon', 'abates', 'abbott', 'abby', 'abdominal', 'abiding', 'ability']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (4986, 12997)
the number of unique words  12997


## [4.2] Bi-Grams and n-Grams.

In [26]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (4986, 3144)
the number of unique words including both unigrams and bigrams  3144


## [4.3] TF-IDF

In [27]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

some sample features(unique words in the corpus) ['ability', 'able', 'able find', 'able get', 'absolute', 'absolutely', 'absolutely delicious', 'absolutely love', 'absolutely no', 'according']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (4986, 3144)
the number of unique words including both unigrams and bigrams  3144


## [4.4] Word2Vec

In [28]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())

In [29]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need

is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model=Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
    
elif want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv.most_similar('great'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")

[('alternative', 0.9953393340110779), ('regular', 0.9945716857910156), ('snack', 0.9943894147872925), ('popchips', 0.9937492609024048), ('especially', 0.9936256408691406), ('healthier', 0.9933803081512451), ('cravings', 0.9931631088256836), ('crispy', 0.9931365847587585), ('licorice', 0.99297034740448), ('incredible', 0.9929677248001099)]
[('beef', 0.9994416236877441), ('choice', 0.9993866086006165), ('de', 0.9993864297866821), ('popcorn', 0.9993237257003784), ('become', 0.9992977976799011), ('wow', 0.9992910623550415), ('wife', 0.9992722272872925), ('varieties', 0.9992690682411194), ('helps', 0.9992531538009644), ('opinion', 0.9992381930351257)]


In [30]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

number of words that occured minimum 5 times  3817
sample words  ['product', 'available', 'course', 'total', 'pretty', 'stinky', 'right', 'nearby', 'used', 'ca', 'not', 'beat', 'great', 'received', 'shipment', 'could', 'hardly', 'wait', 'try', 'love', 'call', 'instead', 'removed', 'easily', 'daughter', 'designed', 'printed', 'use', 'car', 'windows', 'beautifully', 'shop', 'program', 'going', 'lot', 'fun', 'everywhere', 'like', 'tv', 'computer', 'really', 'good', 'idea', 'final', 'outstanding', 'window', 'everybody', 'asks', 'bought', 'made']


## [4.4.1] Converting text into vectors using Avg W2V, TFIDF-W2V
[4.4.1.1] Avg W2v

In [31]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|██████████| 4986/4986 [00:20<00:00, 237.67it/s]


4986
50


## [4.4.1.2] TFIDF weighted W2v

In [32]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model = TfidfVectorizer()
tf_idf_matrix = model.fit_transform(preprocessed_reviews)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

In [33]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentance): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

100%|██████████| 4986/4986 [02:07<00:00, 38.98it/s]


## [5] Assignment 3: KNN

1. Apply Knn(brute force version) on these feature sets

    * Review text, preprocessed one converted into vectors using (BOW)
    * Review text, preprocessed one converted into vectors using (TFIDF)
    * Review text, preprocessed one converted into vectors using (AVG W2v)
    * Review text, preprocessed one converted into vectors using (TFIDF W2v)

2 .Apply Knn(kd tree version) on these feature sets
* sklearn implementation of kd-tree accepts only dense matrices, you need to convert the sparse matrices of CountVectorizer/TfidfVectorizer into dense matices. You can convert sparse matrices to dense using .toarray() attribute. For more information please visit this link
* Review text, preprocessed one converted into vectors using (BOW) but with restriction on maximum features generated.
            count_vect = CountVectorizer(min_df=10, max_features=500) 
            count_vect.fit(preprocessed_reviews)
            
* Review text, preprocessed one converted into vectors using (TFIDF) but with restriction on maximum features generated.
                tf_idf_vect = TfidfVectorizer(min_df=10, max_features=500)
                tf_idf_vect.fit(preprocessed_reviews)
            
* Review text, preprocessed one converted into vectors using (AVG W2v)
* Review text, preprocessed one converted into vectors using (TFIDF W2v)

3.The hyper paramter tuning(find best K)
* Find the best hyper parameter which will give the maximum AUC value
* Find the best hyper paramter using k-fold cross validation or simple cross validation data
* Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning

4. Representation of results
* You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure
* Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
* Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points

5.Conclusion
*  need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library link
* There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
* To avoid the issue of data-leakag, make sure to split your data first and then vectorize it.
* While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
For more details please go through this link.
[5.1] Applying KNN brute force
[5.1.1] Applying KNN brute force on BOW,

##  [1] BoW for KNN algorithm

The vectorization should be done on training datasets , not the entire datasets.So we have to first distribute our datsets into test and train data

In [41]:
## Splitting the datasets into Train and Test data
##  train_test_splitter is used to split the data into test and train datasets
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(preprocessed_reviews,final['Score'],test_size =0.2)

In [43]:
import os

### load the vector if a document_term_matrix was already computed earlier, and saved in local dir
### if not saved earlier, then fit a CountVectorizer, and then transform on the dataset to obtain the
### document_term_matrix, and then save the document_term_matrix

if os.path.exists('document_term_matrix_pickle_bow.pkl'):
    with open('document_term_matrix_pickle_bow.pkl', 'rb') as document_term_matrix_pickle:
        document_term_matrix = pickle.load(document_term_matrix_pickle)
else:
    count_vectorizer = CountVectorizer()
    document_term_matrix = count_vectorizer.fit_transform(x_train)
    with open('document_term_matrix_pickle_bow.pkl', 'wb') as document_term_matrix_pickle:
        pickle.dump(document_term_matrix, document_term_matrix_pickle)

In [44]:
y_train.shape

(3988,)

In [46]:
document_term_matrix.shape

(3988, 11777)

In [47]:
type(document_term_matrix)

scipy.sparse.csr.csr_matrix

In [48]:
x_train_bow = document_term_matrix
y_train_bow = y_train

## [1.1] Splitting the data into Test and cross validation

In [50]:
##train_test_split will not work when we have to split the data time based.
x_train_bow,x_train_cv,y_train_bow,y_train_cv = train_test_split(x_train_bow,y_train_bow,test_size=0.2)

##  [1.2]Train and Test

In [51]:
from sklearn.neighbors import KNeighborsClassifier

In [52]:
from sklearn.metrics import f1_score
##F1 Score is needed when you want to seek a balance between Precision and Recall

## [1.3] Brute Force KNN (Euclidean distance)

In [54]:
def knn_trainer_and_cross_validator(k, X_train, y_train, X_cv, y_cv, algorithm, save_name):
    
    # Side note: note that X_train and y_train are sparse matrices, and not numpy arrays
#     knn = KNeighborsClassifier(n_neighbors = k, algorithm = algorithm)
#     knn.fit(X_train, y_train)
    
    if os.path.exists(save_name + '.pkl'):
        with open(save_name + '.pkl', 'rb') as trained_knn_pkl:
            knn = pickle.load(trained_knn_pkl)
    else:
        knn = KNeighborsClassifier(n_neighbors = k, algorithm = algorithm)
        knn.fit(X_train, y_train)
        
        with open(save_name + '.pkl', 'wb') as trained_knn_pkl:
            pickle.dump(knn, trained_knn_pkl)
        
    
    y_pred_cv = knn.predict(X_cv)
    
    f1score = f1_score(y_cv, y_pred_cv) * 100
    
    return f1score

In [57]:
 save_name = 'bow_brute_knn'
f1scores_for_diff_k = []
for i in range(1, 30, 2):
    f1score = knn_trainer_and_cross_validator(i, x_train_bow, y_train_bow, x_train_cv, y_train_cv, algorithm = 'brute', save_name = save_name + str(i))
    f1scores_for_diff_k.append(f1score)
    print('F1-score for k = ' + str(i) + ' is ' + str(f1score))

F1-score for k = 1 is 86.94063926940639
F1-score for k = 3 is 90.37294015611448
F1-score for k = 5 is 91.50214592274679
F1-score for k = 7 is 91.50214592274679
F1-score for k = 9 is 91.76672384219555
F1-score for k = 11 is 91.71648163962426
F1-score for k = 13 is 91.71648163962426
F1-score for k = 15 is 91.88727583262168
F1-score for k = 17 is 91.90110826939471
F1-score for k = 19 is 91.96581196581197
F1-score for k = 21 is 91.88727583262168
F1-score for k = 23 is 91.88727583262168
F1-score for k = 25 is 91.88727583262168
F1-score for k = 27 is 91.88727583262168
F1-score for k = 29 is 91.97952218430035


In [59]:
max_f1score = max(f1scores_for_diff_k)
print(max_f1score, f1scores_for_diff_k.index(max_f1score))

91.97952218430035 14


## [1.4]Testing the data 

In [61]:
knn = KNeighborsClassifier(n_neighbors = 11, algorithm = 'brute')

In [64]:
knn.fit(x_train_bow,y_train_bow)

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=11, p=2,
           weights='uniform')

In [65]:
x_test_bow_repr = count_vectorizer.transform(x_test)

In [70]:
y_pred_test = knn.predict(x_test_bow_repr)

In [72]:
print(knn.predict_proba(x_test_bow_repr))

[[0.27272727 0.72727273]
 [0.45454545 0.54545455]
 [0.09090909 0.90909091]
 ...
 [0.36363636 0.63636364]
 [0.18181818 0.81818182]
 [0.18181818 0.81818182]]


In [73]:
print(knn.score(x_test_bow_repr,y_test))

0.8266533066132264


In [91]:
f1_score(y_test, y_pred_test) * 100

90.41551246537396

## [2] Tf-Idf for KNN-Algorithm

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [75]:
### load the vector if a document_term_matrix was already computed earlier, and saved in local dir
### if not saved earlier, then fit a TfIdfVectorizer, and then transform on the dataset to obtain the
### document_term_matrix, and then save the document_term_matrix

if os.path.exists('document_term_matrix_pickle_tfidf.pkl'):
    with open('document_term_matrix_pickle_tfidf.pkl', 'rb') as document_term_matrix_pickle:
        document_term_matrix = pickle.load(document_term_matrix_pickle)
else:
    tfidf_vectorizer = TfidfVectorizer(ngram_range = (1,1))
#     X_train, X_test, y_train, y_test = train_test_splitter(stemmed_filtered_corpus_sorted, stemm, test_size = 0.2, return_only_training_split = True)
    tfidf_vectorizer = tfidf_vectorizer.fit(x_train)
    document_term_matrix = tfidf_vectorizer.transform(x_train)
    with open('document_term_matrix_pickle_tfidf.pkl', 'wb') as document_term_matrix_pickle:
        pickle.dump(document_term_matrix, document_term_matrix_pickle)

In [76]:
type(document_term_matrix)
document_term_matrix.shape

(3988, 11777)

In [77]:
x_train_tfidf_repr = document_term_matrix

y_train_tfidf_repr = y_train

In [79]:
x_train_tfidf_repr, x_cv_tfidf_repr, y_train_tfidf_repr, y_cv_tfidf_repr = train_test_split(x_train_tfidf_repr, y_train_tfidf_repr, test_size = 0.2)

## [2.1] Brute-force KNN (Euclidean distance)

In [80]:
save_name = 'tfidf_brute_knn'
f1scores_for_diff_k = []
for i in range(1, 30, 2):
    f1score = knn_trainer_and_cross_validator(i, x_train_tfidf_repr, y_train_tfidf_repr, x_cv_tfidf_repr, y_cv_tfidf_repr, algorithm = 'brute', save_name = save_name + str(i))
    f1scores_for_diff_k.append(f1score)
    print('F1-score for k = ' + str(i) + ' is ' + str(f1score))

F1-score for k = 1 is 91.41689373297002
F1-score for k = 3 is 91.72320217096338
F1-score for k = 5 is 91.72320217096338
F1-score for k = 7 is 91.72320217096338
F1-score for k = 9 is 91.72320217096338
F1-score for k = 11 is 91.72320217096338
F1-score for k = 13 is 91.72320217096338
F1-score for k = 15 is 91.72320217096338
F1-score for k = 17 is 91.72320217096338
F1-score for k = 19 is 91.72320217096338
F1-score for k = 21 is 91.72320217096338
F1-score for k = 23 is 91.72320217096338
F1-score for k = 25 is 91.72320217096338
F1-score for k = 27 is 91.72320217096338
F1-score for k = 29 is 91.72320217096338


In [81]:
print(max(f1scores_for_diff_k), f1scores_for_diff_k.index(max(f1scores_for_diff_k)), sep = '\t')

91.72320217096338	1


In [85]:
knn = KNeighborsClassifier(n_neighbors = 2, algorithm = 'brute')

In [86]:
knn = knn.fit(x_train_tfidf_repr, y_train_tfidf_repr)

In [106]:
x_test_tfidf_repr = tfidf_vectorizer.fit_transform(x_test)

In [108]:
knn = KNeighborsClassifier(n_neighbors = 2, algorithm = 'brute')

In [109]:
knn = knn.fit(x_test_tfidf_repr, y_test)

In [110]:
y_pred = knn.predict(x_test_tfidf_repr)
print(y_pred)

[1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1
 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1
 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0
 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1
 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1
 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0
 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1
 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0
 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 

In [111]:
### f1_score score

f1_score(y_test, y_pred) * 100

99.93920972644378

## [2.2] Kd-tree KNN Algorithm

In [100]:
save_name = 'tfidf_kd_knn'
f1scores_for_diff_k = []
for i in range(1, 30, 2):
    f1score = knn_trainer_and_cross_validator(i, x_train_tfidf_repr.toarray(), y_train_tfidf_repr, x_train_cv, y_train_cv, algorithm = 'kd_tree', save_name = save_name + str(i))
    f1scores_for_diff_k.append(f1score)
    print('F1-score for k = ' + str(i) + ' is ' + str(f1score))

F1-score for k = 1 is 97.41219963031423
F1-score for k = 3 is 94.3800178412132
F1-score for k = 5 is 93.32161687170475
F1-score for k = 7 is 92.96264118158123
F1-score for k = 9 is 92.73356401384082
F1-score for k = 11 is 92.56055363321799
F1-score for k = 13 is 92.42685025817556
F1-score for k = 15 is 92.1097770154374
F1-score for k = 17 is 92.04448246364413
F1-score for k = 19 is 91.88727583262168
F1-score for k = 21 is 91.97952218430035
F1-score for k = 23 is 91.90110826939471
F1-score for k = 25 is 91.97952218430035
F1-score for k = 27 is 91.80887372013652
F1-score for k = 29 is 91.73060528559252


In [99]:
print(max(f1scores_for_diff_k), f1scores_for_diff_k.index(max(f1scores_for_diff_k)), sep = '\t')

97.41219963031423	0


In [113]:
knn = KNeighborsClassifier(n_neighbors = 2, algorithm = 'kd_tree')

In [115]:
knn.fit(x_train_tfidf_repr.toarray(),y_train_tfidf_repr)

KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=2, p=2,
           weights='uniform')

In [121]:
##predicts the class label for each provided data
y_pred = knn.predict(x_train_tfidf_repr.toarray())

In [125]:
### f1_score score
knn.fit(x_test_tfidf_repr.toarray(),y_test)


KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=2, p=2,
           weights='uniform')

In [127]:
f1_score(y_train_tfidf_repr, y_pred) * 100

99.7380239520958

##  [3]Average W2V 

In [128]:
import gensim

## Tokenizing each document in the corpus of words, gensim W2V requires each document to be changed in the corpus of list.

In [129]:
stemmed_filtered_sorted_list_of_tokenized_sentences = []

for sentence in x_train:
    tokenized_sentence = sentence.split()
    stemmed_filtered_sorted_list_of_tokenized_sentences.append(tokenized_sentence)

In [130]:
len(stemmed_filtered_sorted_list_of_tokenized_sentences)

3988

In [137]:
### load the vector if a w2v model was already trained earlier, and saved in local dir
### if not saved earlier, then fit a Word2Vec model, on the dataset and then save it

if os.path.exists('word2vec_trained_model.pkl'):
    with open('word2vec_trained_model.pkl', 'rb') as word2vec_trained_model:
        word2vec_trained_model = pickle.load(word2vec_trained_model)
else:
    w2v = gensim.models.Word2Vec(stemmed_filtered_sorted_list_of_tokenized_sentences, min_count = 1, size = 50, workers = 4)
    with open('word2vec_trained_model.pkl', 'wb') as word2vec_trained_model:
        pickle.dump(w2v, word2vec_trained_model)

In [153]:

### load the vector if a avg_w2v representation of each sentence was already computed earlier and saved in local dir
### if not saved earlier, then compute an avg_w2v representation for all the sentences and then save the list

if os.path.exists('avg_word2vec_1.pkl'):
    with open('avg_word2vec.pkl', 'rb') as avg_w2v_pkl:
        avg_w2v = pickle.load(avg_w2v_pkl)
else:
    avg_w2v = []
    for tokenized_sentence in stemmed_filtered_sorted_list_of_tokenized_sentences:
        sum_of_vectors_for_each_word = 0
        #print(tokenized_sentence)
        for word in tokenized_sentence:
            sum_of_vectors_for_each_word += w2v.wv[word]
        if len(tokenized_sentence)!=0:
            avg_w2v.append(sum_of_vectors_for_each_word / len(tokenized_sentence))
    with open('avg_word2vec_1.pkl', 'wb') as avg_w2v_pkl:
        pickle.dump(avg_w2v, avg_w2v_pkl)

In [154]:
type(avg_w2v)
len(avg_w2v)

3977

In [157]:
x_train_avgw2v_repr = avg_w2v
y_train_avgw2v_repr = y_train[11:]

In [158]:
x_train_avgw2v_repr, x_cv_avgw2v_repr, y_train_avgw2v_repr, y_cv_avgw2v_repr = train_test_split(x_train_avgw2v_repr, y_train_avgw2v_repr, test_size = 0.2)

## [3.1]Brute Force K-NN

In [160]:
f1scores_for_diff_k = []
for i in range(1, 30, 2):
    f1score = knn_trainer_and_cross_validator(i, x_train_avgw2v_repr, y_train_avgw2v_repr, x_cv_avgw2v_repr, y_cv_avgw2v_repr, algorithm = 'brute', save_name = 'avgw2v_brute_knn')
    f1scores_for_diff_k.append(f1score)
    print('F1-score for k = ' + str(i) + ' is ' + str(f1score))

F1-score for k = 1 is 84.61538461538461
F1-score for k = 3 is 84.61538461538461
F1-score for k = 5 is 84.61538461538461
F1-score for k = 7 is 84.61538461538461
F1-score for k = 9 is 84.61538461538461
F1-score for k = 11 is 84.61538461538461
F1-score for k = 13 is 84.61538461538461
F1-score for k = 15 is 84.61538461538461
F1-score for k = 17 is 84.61538461538461
F1-score for k = 19 is 84.61538461538461
F1-score for k = 21 is 84.61538461538461
F1-score for k = 23 is 84.61538461538461
F1-score for k = 25 is 84.61538461538461
F1-score for k = 27 is 84.61538461538461
F1-score for k = 29 is 84.61538461538461


In [162]:
max_f1score = max(f1scores_for_diff_k)
print(max_f1score, f1scores_for_diff_k.index(max_f1score))

84.61538461538461 0
