# Applying W2V on Amazon Fine Food Reviews Dataset

### Word2Vec (Word 2 Vector) :
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space.

#### Dataset Source : https://www.kaggle.com/snap/amazon-fine-food-reviews

In [1]:
#importing requirement statements
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns

import re
import string

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [2]:
#Connect to database
con = sqlite3.connect('./database.sqlite')

f_data = pd.read_sql_query('''
SELECT * FROM Reviews WHERE Score != 3 ''',con)

In [3]:
#change score column from number to str
def change(x):
    if x < 3:
        return 'negative'
    else:
        return 'positive'

temp = f_data['Score']
str_val = temp.map(change)
f_data['Score'] = str_val

In [4]:
f_data.shape

(525814, 10)

In [5]:
f_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
#sort entries order by ProductID
sort_data = f_data.sort_values('ProductId',axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')


In [7]:
#remove duplicate
after_dup = sort_data.drop_duplicates(subset={'UserId', 'ProfileName', 'Time', 'Text'}, keep='first', inplace=False)


In [8]:
after_dup.shape

(364173, 10)

In [9]:
#Check how much data is there
(after_dup['Id'].size*1.0)/(f_data['Id'].size*1.0)*100

69.25890143662969

In [10]:
#Remove rows where num. > dem.
after_dup = after_dup[after_dup.HelpfulnessNumerator <= after_dup.HelpfulnessDenominator]

In [11]:
after_dup.shape

(364171, 10)

In [12]:
after_dup['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

## Reviews text pre-processing

In [13]:
def cleanhtmltag(sentence):
    cleaner = re.compile('<.*?>')
    cleantext = re.sub(cleaner,' ',sentence)
    return cleantext
def cleanpunc(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return cleaned

## Training the W2V model

In [14]:
i=0
list_of_sent = []
for sent in after_dup['Text'].values:
    filtered_sentence = []
    sent = cleanhtmltag(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sent.append(filtered_sentence)

In [15]:
print(after_dup['Text'].values[0])
print('--------------------------------------------')
print(list_of_sent[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
--------------------------------------------
['this', 'witty', 'little', 'book', 'makes', 'my', 'son', 'laugh', 'at', 'loud', 'i', 'recite', 'it', 'in', 'the', 'car', 'as', 'were', 'driving', 'along', 'and', 'he', 'always', 'can', 'sing', 'the', 'refrain', 'hes', 'learned', 'about', 'whales', 'india', 'drooping', 'i', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'of', 'it', 'all', 'this', 'is', 'a', 'classic', 'book', 'i', 'am', 'willing', 'to', 'bet', 'my', 'son', 'will', 'still', 'be', 'able', 'to', 'recite', 'from', 'memory', 'when', 'he', 'is', 'in', 'college']


In [None]:
import gensim
w2v_model = gensim.models.Word2Vec(list_of_sent, min_count=5, size=50, workers=4)

In [None]:
words = list(w2v_model.wv.vocab)
print(len(words))

In [None]:
#check the similarities of words
w2v_model.wv.most_similar('tasty')