# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




## Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")



import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

# [1]. Reading Data

In [None]:
# Using SQL lite to connect to SQL dump database.sqlite
con = sqlite3.connect("./database.sqlite")
#We have got connection object i.e. con now, We will be using this connection object for further querying

In [None]:
# Let's take 5K examples first for faster computation. 
# Since our target is to remove the neutral rating as this can lead to confusion. 
filtered_data = pd.read_sql_query("SELECT * FROM REVIEWS WHERE SCORE != 3 LIMIT 500000",con)

# Function to for categorization of class attribute i.e. score, if score < 3 then negative reviews and if score > 3
# then positive reviews.

def distinguish_ratings(x):
    if x < 3:
        return 0
    elif x > 3:
        return 1

#changing the last score columns value to the decision values i.e. positive and negative
filtered_data["Score"] = filtered_data["Score"].map(distinguish_ratings)
print("The shape of filtered data is ",filtered_data.shape)
filtered_data.head(5)

# [2] Data Cleaning 
### Hunch 1 : Deduplication

In [None]:
display = pd.read_sql_query(""" SELECT USERID,PRODUCTID,PROFILENAME,TIME,SCORE,TEXT,COUNT(*)
FROM REVIEWS GROUP BY USERID HAVING COUNT(*) >1""",con)
print("There are ", display.shape[0], "duplicate values in our dataset")
display.head(10)

In [None]:
#Let's find out the occurence of christopher P. Presta as we can that it's been repeated twice.
filtered_data[filtered_data["UserId"]=="#oc-R12KPBODL2B5ZD"]

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId (Because we wants to keep NAN values to bottom of our data set) and then just keep the first similar product review and delete the others. 

In [None]:
# sorting the data frame first
# DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
sorted_df = filtered_data.sort_values(by='ProductId')
sorted_df.head(2)

In [None]:
# de-duplication of entries 
# DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
final_df = sorted_df.drop_duplicates(subset = ['UserId','ProfileName','Time','Text'])
final_df.shape

In [None]:
#How much % of data we have dropped
final_df.shape[0] / filtered_data.shape[0] * 100

We have dropped almost 30% of our data

### Hunch 2 : How about Helpfulness Numerator and Denominator 
we all know that helpfulnes numerator < helpfulness denominator because numerator denotes the positive reviews and denominator denotes the negative reviews. <br>

Hence, we will be dropping few more rows now with this logic


In [None]:
final_df = final_df[final_df.HelpfulnessNumerator <= final_df.HelpfulnessDenominator]
final_df.shape

In [None]:
# Now we will check if dataset is balanced or not
# Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
final_df["Score"].value_counts()

So, clearly we can see that the positive reviews are almost five times more than negative reviews. That means we are dealing with imbalanced dataset. 

### Hunch 3: Did this dataset has rows describing only about Food ? Outliers ?
Let's find out if some rows are not related to food because dataset can be corrupted. while going through dataset I found out that there are some books reviews also in this dataset which we have to drop it.
But the problem here is to we have to 

In [None]:
"""import re
def apply_mask_summary(final_data,regex_string):
    mask = final_data.Summary.str.lower().str.contains(regex_string)
    final_data.drop(final_data[mask].index, inplace=True)

def apply_mask_text(final_data,regex_string):
    mask = final_data.Text.str.lower().str.contains(regex_string)
    final_data.drop(final_data[mask].index, inplace=True)

apply_mask_summary(final_data,re.compile(r"\bbook\b"))
apply_mask_summary(final_data,re.compile(r"\bread\b"))
apply_mask_text(final_data,re.compile(r"\bbook\b"))
apply_mask_text(final_data,re.compile(r"\bread\b"))
apply_mask_summary(final_data,re.compile(r"\bbooks\b"))
apply_mask_summary(final_data,re.compile(r"\breads\b"))
apply_mask_text(final_data,re.compile(r"\bbooks\b"))
apply_mask_text(final_data,re.compile(r"\breads\b"))
apply_mask_summary(final_data,re.compile(r"\breading\b"))
apply_mask_text(final_data,re.compile(r"\breading\b"))

final_df.shape """
# we will deal with this later as there is problem it seems this can remove sentences like reading while eating x food etc.

# [3] Text Pre-Processing
Now we have removed the redundant data almost, We will be proceeding with text preprocessing which can be procedural as below - 
1. Removal of HTML Tags.
2. Removal of special characters as they contribute nothing to a Machine Learning models
3. Checking for alphanumeric words as we want only english words
4. Convert every word to lower case, so that there won't be any difference between Pasta and pasta. 
5. Stop word removal 
6. Stemming of Words, we will be using snowball stemmer as they are more powerful than porter stemmer.

   so let's begin 

### Techniques for Text Pre-Processing 

In [None]:
# Printing some random reviews to check for further redudancy in data
for count in [0,2000,4000,6000,80000,100000]:
    view = final_df['Text'].values[count]
    print(view)
    print("_"*50)

As we can see that there punctuations,commas,dashes,forward slashes etc. are redudant text because they don't hold any any information. We will have to get rid of them. So, Let's start with the 1st step HTML tags.

In [None]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
for count in [0,2000,4000,6000,80000,100000]:
    view = re.sub(r"http\S+","",final_df['Text'].values[count])
    print(view)
    print("-"*50)
# Clearly there are no url's so we can't see if it works or not but trust me it works.

In [None]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
# Python web scraping library for getting texts from HTML tags
from bs4 import BeautifulSoup

for count in [0,2000,4000,6000,80000,100000]:
    view = BeautifulSoup(final_df['Text'].values[count],'lxml')
    text = view.get_text()
    print(text)
    print("-"*80)

In [None]:
# https://stackoverflow.com/a/47091490/4084039
# re.sub(pattern,repl,string)
def decontracted(phrase):
    #specific
    phrase = re.sub(r"won't","will not",phrase)
    phrase = re.sub(r"can\'t","can not",phrase)
    
    #general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
for count in [0,2000,4000,6000,80000,100000]:
    view = decontracted(phrase = final_df['Text'].values[count])
    print(view)
    print("-"*70)

Now we can see that the words like won't,can't has been converted into will not, can not etc.

In [None]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
for count in [0,2000,4000,6000,80000,100000]:
    view = re.sub("\S*\d\S*","",final_df['Text'].values[count]).strip()
    print(view)
    print("-"*70)

In [None]:
#remove special character: https://stackoverflow.com/a/5843547/4084039
for count in [0,2000,4000,6000,80000,100000]:
    view = re.sub("[^A-Za-z0-9]+"," ",final_df['Text'].values[count])
    print(view)
    print("-"*80)
    
#Here is a regex to match a string of characters that are not a letters or numbers:
#[^A-Za-z0-9]+ -- ^ match any single characters which ae not in brackets [] i.e. exclusion

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [None]:
# We will combine all the above stunts that we have learnt here -
from tqdm import tqdm
preprocessed_reviews=[]
for sentences in tqdm(final_df['Text'].values):
    sentence = re.sub(r"http\S+", "", sentences)
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = decontracted(sentence)
    sentence = re.sub("\S*\d\S*", "", sentence).strip()
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    # https://gist.github.com/sebleier/554280
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentence)
preprocessed_reviews[1:5]

In [None]:
# Now we will do pre-processing for review summary also in a exact same way

from tqdm import tqdm
preprocessed_summary=[]
for sentences in tqdm(final_df['Summary'].values):
    sentence = re.sub(r"http\S+", "", sentences)
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = decontracted(sentence)
    sentence = re.sub("\S*\d\S*", "", sentence).strip()
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    # https://gist.github.com/sebleier/554280
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_summary.append(sentence)
preprocessed_summary[1:5]

Now, we have done with the text cleaning. It's time to converts all the relevant text into features vectors

# [4] Featurization

## [4.1] Bag of Words

In [None]:
#BOW
# CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
count_vect = CountVectorizer()
count_vect.fit(preprocessed_reviews)
print("Some of the features are ",count_vect.get_feature_names()[:10])
print("-"*100)

final_vect = count_vect.transform(preprocessed_reviews)
print("The Type of final vectors created by BOW is ",type(final_vect))
print("The dimensions of final vectors is ",final_vect.get_shape())
print("The number of unique words are ",final_vect.get_shape()[1])


As we can see that we have created the 113898 dimensions now, each dimension represents unique words present in corpus.

In [None]:
# let's check elements in sparse matrix. let's say for 1st row how many non zeros column it has
a=final_vect[0,:].toarray()
np.count_nonzero(a,axis=1)

## [4.2] Bi-Grams and n-Grams

In [None]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams || very important

count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=10000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

## [4.3] TF - IDF

In [None]:
"""
TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’,
strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, 
analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), 
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, 
norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
"""

tfid_vect = TfidfVectorizer(input='preprocessed_reviews',ngram_range=(1,2),min_df=5)
tfid_vect.fit(preprocessed_reviews)
print("Some samples features are ",tfid_vect.get_feature_names()[:10])

final_tfid_vect = tfid_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tfid_vect))
print("the shape of out text BOW vectorizer ",final_tfid_vect.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tfid_vect.get_shape()[1])

## [4.4] Word2Vec

In [None]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())

In [None]:
# Can't use Google trained word2vec due to ram limitation on my mac, requires around 9GB but i have only 8Gb
# let's cut short our w2v input by only taking few thousand in corpus as it will take huge time for 500K datapoints
# Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
w2vmodel = Word2Vec(list_of_sentance,size=50,min_count=3,workers=4)


In [None]:
print(w2vmodel.wv.most_similar('great'))
print('='*50)
print(w2vmodel.wv.most_similar('king'))

In [None]:
w2v_words = list(w2vmodel.wv.vocab)
print("number of words that occured minimum 3 times ",len(w2v_words))
print("sample words ", w2v_words[0:20])

Let's check if similar words has vectors parallel to each other. So basic math here is 
* If a.b = 0 --> Orthogonal
* If a.b = 1 --> Parallel
* If a.b =-1 --> Anti parallel

In [None]:
# checking for great and terrific
grt_w2v = w2vmodel.wv['great'] 
a = np.copy(grt_w2v)
a /= np.sqrt(grt_w2v.dot(grt_w2v)) # Calculating the unit vector 
terrific_w2v = w2vmodel.wv['terrific']
b = np.copy(terrific_w2v)
b /= np.sqrt(terrific_w2v.dot(terrific_w2v))
print("The dot product is ",np.dot(a,b))

Clearly we can see that the dot product of unit vector is almost equal to 1.

In [None]:
# checking for great and terrific
grt_w2v = w2vmodel.wv['great'] 
a = np.copy(grt_w2v)
a /= np.sqrt(grt_w2v.dot(grt_w2v)) # Calculating the unit vector 
library = w2vmodel.wv['library']
b = np.copy(library)
b /= np.sqrt(library.dot(library))
print("The dot product is ",np.dot(a,b))

Here the dot product is closer towards 0 and make sense. As the library and great are not related at all

### [4.4.1] Avg w2v

In [None]:
# Let's create a sentence vectors
sent_vector = []
for sentence in tqdm(list_of_sentance):
    sent_vec = np.zeros(50) #50 dimensions we have mentioned above
    count = 0
    for word in sentence:
        if word in w2v_words:
            sent_vec += w2vmodel.wv[word]
            count +=1
    if count !=0:
        sent_vec /= count
    sent_vector.append(sent_vec)
        
print(len(sent_vector))
print(len(sent_vector[0]))
            

In [None]:
print(len(sent_vector))
print(len(sent_vector[0]))

### [4.4.2]  TFIDF weighted W2V

In [None]:
# we already have IDF object i.e. tfid_vect
# we need to create dictionary for faster lookup of idf value
idf_dict = dict(zip(tfid_vect.get_feature_names(),list(tfid_vect.idf_)))

In [None]:
idf_dict['terrific']

In [None]:
w2v_tfidf_sentvec = []
tfidf_featname = tfid_vect.get_feature_names()
for sentence in tqdm(list_of_sentance):
    sent_vec = np.zeros(50)
    weighted_sum = 0
    for word in sentence:
        if word in tfidf_featname and word in w2v_words:
            tfidf = idf_dict[word] * (sentence.count(word)/len(sentence)) #idf * tf
            sent_vec += tfidf *  w2vmodel.wv[word] #calculating the weighted sum
            weighted_sum +=tfidf 
    if weighted_sum!=0:
        sent_vec /= weighted_sum # sentence vector = tfidf * w2v / sum(tfidf) for particular document in corpus 
    w2v_tfidf_sentvec.append(sent_vec)
print(len(w2v_tfidf_sentvec))
print(len(w2v_tfidf_sentvec[0]))

Can't train it on 500K examples, RAM issue.

# [5] Applying TSNE