# Amazon Fine Food Reviews Analysis 


<b> The Amazon Fine Food reviews dataset consists of reviews of fine foods from Amazon </b>

| Attributes | Counts |
| --- | --- | 
| Number of reviews: | 568,454 | 
| Number of users: | 256,059 | 
| Number of products: | 74,258 |
| Timespan: | Oct 1999 - Oct 2012 | 

<b> Attribute Information: </b>

| | Attributes | Information |
| --- | --- | --- | 
|1.|Id |  | 
|2.| Product Id | unique identifier for the product | 
|3. | User Id | unqiue identifier for the user |
|4.| Profile Name |  | 
|5. | HelpfulnessNumerator |number of users who found the review helpful |
|6. | HelpfulnessDenominator | number of users who indicated whether they found the review helpful or not | 
|7. |  Score | rating between 1 and 5|
|8. | Time | timestamp for the review | 
|9. | Summary | brief summary of the review |
|10. | Text | text of the review | 



<b> Objective </b>

Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating 1 or 2).

<b> How to determine if a review is positive or negative ? </b>

<b> Ans : </b> We could use the score/rating. A rating of 4 or 5 could be considered a positive review. A review of 1 or 2 could be considered as negative. A review of 3 is ignored by considering it as neutral. This is an approxiamate and proxy way of determining the polarity (positive/negative) of a review.

## Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

## Importing the required packages

In [2]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re  # Refer Python Regular Expression tutorial : https://pymotw.com/2/re/
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os


# [1]. Reading Data

In [7]:
# Using SqLite table to read data
con = sqlite3.connect('database.sqlite')

#Filtering only positive and negative reviews 
# i.e. not taking consideration of those reviews with Score = 3 
# SELECT * from reviews WHERE Score != 3LIMIT 5000, will get top 5000 data points
# We are considering 5000 data points due to computation process.

#filtered_data = pd.read_sql_query(""" SELECT * from reviews WHERE Score != 3 LIMIT 5000 """, con)


filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000 """, con)


# Assigning by giving the reviews with score > 3 a positive rating, and reviews with a score < 3 a negative rating

def partition(x):
    if x < 3:
        return 0
    return 1

#Changing reviews with Score less than 3 to be negative and score greater than 3 as positive

actualScore = filtered_data["Score"]
positiveNegative = actualScore.map(partition)
filtered_data["Score"] = positiveNegative
print("The Number of data points in our data ", filtered_data.shape)
filtered_data.head()

The Number of data points in our data  (5000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [8]:
display = pd.read_sql_query(""" 
SELECT UserId, ProductId, ProfileName, Time, Score, Text , COUNT(*)
FROM Reviews 
GROUP BY UserId 
HAVING COUNT(*)>1

""", con)

In [9]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B007Y59HVM,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ET0,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B007Y59HVM,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ET0,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBE1U,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [10]:
display[display["UserId"] == "AZY10LLTJ71NX"]

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B006P7E5ZI,"undertheshrine ""undertheshrine""",1334707200,5,I was recommended to try green tea extract to ...,5


In [11]:
display['COUNT(*)'].sum()

393063

# Exploratory Data Analysis

## [2]. Data Cleaning : Deduplication

In [12]:
display = pd.read_sql_query("""
SELECT * FROM Reviews 
WHERE Score !=3 AND UserId = "AR5J8UI46CURR"
ORDER BY ProductId
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


It is Observed from the above table that the reviews data is having many duplicate entries. Hence it is mandatory to remove duplicates in order to get unbiased results during data analysis. 

As can be seen above the same users have multiple reviews with same values witgh HelpfulnessNumerator, HelpfulnessDenominator Score, Time, Summary, and Text. It was found during data analysis.

The ProductId = B000HDL1RQ for LOACKER QUADRATINI VANILLA WAFERS, with some characteriestics 8.82 - Ounce Packages(Pack of 8).

The ProductId = B000HDL1RQ for LOACKER QUADRATINI LEMON WAFERS Cookies, with some characteristics 8.82 - Ounce Pakages (Pack of 8). and so on... 

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and deleting others. In the above just the review of ProductId = B000HDL1RQ remains. This method ensures that representatives still existing for the same product.



In [13]:
# Sorting the data according to the ProductId in ascending order
sorted_data = filtered_data.sort_values('ProductId', axis = 0, ascending = True, inplace = False,
                                        kind = 'quicksort', na_position = "last")

In [14]:
# deduplication of entries
final = sorted_data.drop_duplicates(subset={"UserId", "ProfileName", "Time", "Text"}, keep = "first", inplace = False)
final.shape

(4986, 10)

In [15]:
# Checking how much data still remains

(final["Id"].size*1.0)/(filtered_data['Id'].size*1.0)*100

99.72

### Observation :

It was also seen that in two rows given below the value of HelpfullnessNumerator is greater than HelpfullnessDenominator which is not prcatically possible hence these two rows too are removed from calculations.


In [16]:
display = pd.read_sql_query(""" 
SELECT * 
FROM Reviews 
WHERE Score != 3 AND Id= 44737 or Id = 64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [17]:
final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [18]:
# Before starting the next phase of preprocessing lets see the number of entries left

print(final.shape)

# Knowing how many positive and negative reviews are present in our dataset?
final["Score"].value_counts()


(4986, 10)


1    4178
0     808
Name: Score, dtype: int64

# 3. Text Preprocessing

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [21]:
# Printing some random reviews :
sent_0 = final["Text"].values[0]
print(sent_0)
print("\n","*"*50)

sent_500 = final["Text"].values[500]
print(sent_500)
print("\n", "*"*50)

sent_1000 = final["Text"].values[1000]
print(sent_1000)
print("\n", "*"*50)



Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.

 **************************************************
This is a very good snack that I feel great about offering my 10 month old.  She can easily  pick up the small pieces and definitely enjoys the apple cinnamon taste.  My 5 year old loves them as well!  I would definitely recommend these to anyone with small (or larger) kids.  :)

 **************************************************
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the f

In [22]:
# Removing urls from text 
print(sent_0)
sent_0 = re.sub(r"http\S+","", sent_0)
sent_500 = re.sub(r"http\S+","",sent_500)
sent_1000 = re.sub(r"http\S+","",sent_1000)

print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [24]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, "lxml")
text = soup.get_text()
print(text)
print("\n","*"*50)

soup = BeautifulSoup(sent_500, "lxml")
text = soup.get_text()
print(text)
print("\n","*"*50)

soup = BeautifulSoup(sent_1000, "lxml")
text = soup.get_text()
print(text)
print("\n", "*"*50)

Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.

 **************************************************
This is a very good snack that I feel great about offering my 10 month old.  She can easily  pick up the small pieces and definitely enjoys the apple cinnamon taste.  My 5 year old loves them as well!  I would definitely recommend these to anyone with small (or larger) kids.  :)

 **************************************************
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the flavor is still very good.  I like them better than the yogurt and green onion fl

In [25]:
import re

def decontracted(phrase):
    # Specific 
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    
    # General
    phrase = re.sub(r"n\'", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    
    return phrase



In [28]:
print("Before calling Decontracted function")
print(sent_500)
print("\n", "*"*50)
sent_500 = decontracted(sent_500)
print("\nAfter Calling Decontracted function")
print("\n",sent_500)


Before calling Decontracted function
This is a very good snack that I feel great about offering my 10 month old.  She can easily  pick up the small pieces and definitely enjoys the apple cinnamon taste.  My 5 year old loves them as well!  I would definitely recommend these to anyone with small (or larger) kids.  :)

 **************************************************

After Calling Decontracted function

 This is a very good snack that I feel great about offering my 10 month old.  She can easily  pick up the small pieces and definitely enjoys the apple cinnamon taste.  My 5 year old loves them as well!  I would definitely recommend these to anyone with small (or larger) kids.  :)


In [29]:
# removing words with numbers

sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor  and  traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [33]:
# removing special character

sent_0 = re.sub('[^A-Za-z0-9]+', ' ', sent_0)
print(sent_0)

Why is this when the same product is available for here br br The Victor and traps are unreal of course total fly genocide Pretty stinky but only right nearby 


In [34]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [35]:
## Combining all the above statements

from tqdm import tqdm

preprocessed_reviews = []

#tqdm for printing the status bar
for sentence in tqdm(final['text'].values):
    sentence = re.sub(r"https\S+", "", sentence)
    sentence = BeautifulSoup(sentence, "lxml").get_text()
    sentence = decontracted(sentence)
    senetnce = re.sub("\S*\d\S*", "", sentence).strip()
    sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence)
    
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentence.strip())


KeyError: 'text'

In [None]:
preprocessed_reviews