# Amazon Home Kitchen Product Reviews Analysis


Data Source: http://jmcauley.ucsd.edu/data/amazon/index_2014.html

The Amazon Home Kitchen Product Reviews dataset consists of reviews of home and kitchen products from Amazon website.<br>

Number of reviews: 551,682<br>
Timespan: May 1996 - July 2014<br>
Number of Attributes/Columns in data: 9

#### Attribute Information:

1. reviewerId - unqiue identifier of the reviewer
2. asin - unique identifier for the product
3. reviewerName
4. Helpfulness numerator and Helpfulness denominator
   HelpfulnessNumerator - number of users who found the review helpful
   HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
5. reviewText - text of the review
6. overall - the overall rating of the reviewer
7. summary - brief summary of the review
8. unixReviewTime - timestamp for the review
9. Time - Date of the review

#### Objective
* Determining the polarity of the review (whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2)) using the reviews given by the user.

#### Ground truth 
* We will use Overall score to determine the ground truth of the review. If the score is 4 or 5 , we will consider that review as positive review. If the score is 1 or 2 , we will consider that review as negative review. we will ignore the reviews with the rating of 3.

# Loading the data

The data is available is in .json file form in data source link and we converted that into .csv file using the code below

import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Home_and_Kitchen_5.json.gz')

df.to_csv('amazon_home_kitchen_product_data', encoding='utf-8', index=False)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('amazon_home_kitchen_product_data')

In [2]:
# displyaing the first few rows of data
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,5.0,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,5.0,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,4.0,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,5.0,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,4.0,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


In [3]:
# The shape of the data before filtering the score rating 3
data.shape

(551682, 9)

In [4]:
# filtering the data by removing the overall rating score - 3 
filtered_data = data[data.overall!= 3]

In [5]:
# The shape of the data after filtering the score rating 3
filtered_data.shape

(506623, 9)

In [6]:
# changing the overall rating column to positive and negative categories
import warnings
warnings.filterwarnings("ignore")

def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

actualScore = filtered_data['overall']
positiveNegative = actualScore.map(partition) 
filtered_data['overall'] = positiveNegative

# changing the overall column name to score for better understanding
filtered_data = filtered_data.rename(columns={"overall": "score", "asin":"productId"})

In [7]:
# displyaing the first few rows of filtered data
filtered_data.head()


Unnamed: 0,reviewerID,productId,reviewerName,helpful,reviewText,score,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,positive,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,positive,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,positive,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,positive,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,positive,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


# Exploratory Data Analysis

In [8]:
# splitting the helpful column to helpful numerator and helpful denominator
helpfulness_num =[]
helpfulness_denom =[]
for i in filtered_data['helpful']:
    m = i[1:-1]
    k = m.split(',')
    helpfulness_num.append(k[0]);
    helpfulness_denom.append(k[1]);

filtered_data['helpfulnessNumerator'] = helpfulness_num          # adding helpfulness numerator column
filtered_data['helpfulnessDenominator'] = helpfulness_denom      # adding helpfulness denominator column

filtered_data['helpfulnessNumerator'] = pd.to_numeric(filtered_data["helpfulnessNumerator"])
filtered_data['helpfulnessDenominator'] = pd.to_numeric(filtered_data["helpfulnessDenominator"])

del filtered_data['helpful']                                     # deleting helpful column

In [9]:
# displyaing the first few rows of filtered data
filtered_data.head()

Unnamed: 0,reviewerID,productId,reviewerName,reviewText,score,summary,unixReviewTime,reviewTime,helpfulnessNumerator,helpfulnessDenominator
0,APYOBQE6M18AA,615391206,Martin Schwartz,My daughter wanted this book and the price on ...,positive,Best Price,1382140800,"10 19, 2013",0,0
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,I bought this zoku quick pop for my daughterr ...,positive,zoku,1403049600,"06 18, 2014",0,0
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,There is no shortage of pop recipes available ...,positive,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013",26,27
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""",This book is a must have if you get a Zoku (wh...,positive,Creative Combos,1312416000,"08 4, 2011",14,18
4,AHAI85T5C2DH3,615391206,PugLover,This cookbook is great. I have really enjoyed...,positive,A must own if you own the Zoku maker...,1402099200,"06 7, 2014",0,0


## Data cleaning 

### check 
duplicate values

 We need to remove the duplicate reviews text (if any) as it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.

In [10]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('productId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [11]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"reviewerID","reviewerName","unixReviewTime","reviewText"}, keep='first', inplace=False)
final.shape

(506623, 10)

The number of rows does not change after removing duplicates which shows that there are no duplicate reviews.

### check 
if there are any data points where helpfulnessNumerator is higher than helpfulnessDenominator

In [12]:
final[final.helpfulnessNumerator>final.helpfulnessDenominator]

Unnamed: 0,reviewerID,productId,reviewerName,reviewText,score,summary,unixReviewTime,reviewTime,helpfulnessNumerator,helpfulnessDenominator


### Observation 
There are no data points where helpfulnessNumerator is higher than helpfulnessDenominator

In [13]:
# finding the number of positive and negative reviews
final['score'].value_counts()

positive    455204
negative     51419
Name: score, dtype: int64

In [14]:
final.dtypes

reviewerID                object
productId                 object
reviewerName              object
reviewText                object
score                     object
summary                   object
unixReviewTime             int64
reviewTime                object
helpfulnessNumerator       int64
helpfulnessDenominator     int64
dtype: object

 # Text Preprocessing: Stemming, stop-word removal and Lemmatization.

In [15]:
# printing the text example to check for punctuation
final['reviewText'] = final["reviewText"].astype('str')
import re
i=0;
for sent in final['reviewText'].values:
    if (len(re.findall('.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;

6581
About the temperatures it can handle (which for some reason aren't on the product description; you'd think this would be a key bullet point):Max recommended temp: 392F.  400F seems to be the extreme max according to the packaging, but > 392F will deteriorate the life span faster, says the instructions.  I've seen it go as low as 39F for refridgerated water; it could probably handle lower.  There's a switch on the bottom to measure in Celcius.  392F may be low for some people, so be warned.  I wonder if other reviewers with problems have been pushing this limit?  If the probe is off by as much as 10 degrees as some claim, that might explain the short lifespans.I've liked this probe so far, although admittedly I haven't used it yet for its primary purpose (measuring internal meat temperatures).  Instead I've used it as a kitchen timer and oven thermometer to monitor the oven air temps when proofing dough in it (don't want to kill my yeast!) by lodging the probe in a rack.  I haven't

In [16]:
!pip install nltk
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned



In [17]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase

i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['reviewText'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags using the cleanhtml function
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

In [18]:
final['CleanedText']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 
final['CleanedText']=final['CleanedText'].str.decode("utf-8")

In [19]:
# displaying the first few rows of the column after adding cleanedText column
final.head()

Unnamed: 0,reviewerID,productId,reviewerName,reviewText,score,summary,unixReviewTime,reviewTime,helpfulnessNumerator,helpfulnessDenominator,CleanedText
0,APYOBQE6M18AA,615391206,Martin Schwartz,My daughter wanted this book and the price on ...,positive,Best Price,1382140800,"10 19, 2013",0,0,daughter want book price amazon best alreadi t...
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,I bought this zoku quick pop for my daughterr ...,positive,zoku,1403049600,"06 18, 2014",0,0,bought zoku quick pop daughterr zoku quick mak...
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,There is no shortage of pop recipes available ...,positive,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013",26,27,shortag pop recip avail free web purchas zoku ...
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""",This book is a must have if you get a Zoku (wh...,positive,Creative Combos,1312416000,"08 4, 2011",14,18,book must get zoku also high recommend larg va...
4,AHAI85T5C2DH3,615391206,PugLover,This cookbook is great. I have really enjoyed...,positive,A must own if you own the Zoku maker...,1402099200,"06 7, 2014",0,0,cookbook great realli enjoy review recip sure ...


# Bag of Words (BoW)

In [20]:
#BoW
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer() #in scikit-learn
final_counts = count_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (506623, 109473)
the number of unique words  109473


# Bi-Grams and n-Grams.

In [21]:
freq_dist_positive=nltk.FreqDist(all_positive_words)
freq_dist_negative=nltk.FreqDist(all_negative_words)
print("Most Common Positive Words : ",freq_dist_positive.most_common(20))
print("Most Common Negative Words : ",freq_dist_negative.most_common(20))

Most Common Positive Words :  [(b'use', 376632), (b'one', 240634), (b'like', 180411), (b'great', 167067), (b'make', 151994), (b'work', 148127), (b'get', 144973), (b'well', 142686), (b'time', 122941), (b'good', 122636), (b'easi', 121049), (b'would', 120577), (b'love', 119267), (b'clean', 114567), (b'look', 111218), (b'need', 94779), (b'realli', 93427), (b'nice', 92397), (b'littl', 90573), (b'set', 90205)]
Most Common Negative Words :  [(b'use', 41371), (b'one', 33417), (b'get', 23222), (b'would', 22322), (b'like', 21090), (b'work', 20304), (b'time', 19202), (b'product', 15966), (b'make', 15301), (b'dont', 13662), (b'look', 13294), (b'even', 13170), (b'good', 12995), (b'tri', 12955), (b'coffe', 12286), (b'water', 12190), (b'well', 11831), (b'back', 11769), (b'review', 11240), (b'thing', 11187)]


In [22]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
count_vect = CountVectorizer(ngram_range=(1,2) ) #in scikit-learn
final_bigram_counts = count_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (506623, 4098533)
the number of unique words including both unigrams and bigrams  4098533


# TF-IDF

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['CleanedText'].values)
print("the type of count vectorizer ",type(final_tf_idf))
print('**************************************************************************')
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print('**************************************************************************')
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
**************************************************************************
the shape of out text TFIDF vectorizer  (506623, 4098533)
**************************************************************************
the number of unique words including both unigrams and bigrams  4098533


In [29]:
features = tf_idf_vect.get_feature_names()
print("some sample features(unique words in the corpus)",features[100000:100010])

some sample features(unique words in the corpus) ['alway findpiec', 'alway fine', 'alway finest', 'alway finger', 'alway fingertip', 'alway finicki', 'alway finish', 'alway fire', 'alway firm', 'alway first']


In [30]:
# source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],features,25)

In [31]:
top_tfidf

Unnamed: 0,feature,tfidf
0,zoku quick,0.478449
1,zoku,0.409519
2,pop daughterr,0.271762
3,daughterr,0.263565
4,daughterr zoku,0.263565
5,quick maker,0.257749
6,bought zoku,0.249552
7,quick pop,0.203686
8,love fun,0.199884
9,maker love,0.180019


# Word2Vec

In [32]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

# we use a pretrained model by google
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300-002.bin', binary=True)

In [33]:
print("the vector representation of word 'computer'",model.wv['computer'])
print("the similarity between the words 'woman' and 'man'",model.wv.similarity('woman', 'man'))
print("the most similar words to the word 'woman'",model.wv.most_similar('woman'))

the vector representation of word 'computer' [ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e

In [34]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sent=[]
for sent in final['CleanedText'].values:
    list_of_sent.append(sent.split())

In [35]:
print(final['CleanedText'].values[0])
print("*****************************************************************")
print(list_of_sent[0])

daughter want book price amazon best alreadi tri one recip day receiv book seem happi
*****************************************************************
['daughter', 'want', 'book', 'price', 'amazon', 'best', 'alreadi', 'tri', 'one', 'recip', 'day', 'receiv', 'book', 'seem', 'happi']


In [36]:
# min_count = 5 considers only words that occured atleast 5 times
w2v_model=Word2Vec(list_of_sent,min_count=5,size=50, workers=4)

In [37]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

number of words that occured minimum 5 times  23019
sample words  ['daughter', 'want', 'book', 'price', 'amazon', 'best', 'alreadi', 'tri', 'one', 'recip', 'day', 'receiv', 'seem', 'happi', 'bought', 'zoku', 'quick', 'pop', 'maker', 'love', 'fun', 'make', 'ice', 'cream', 'shortag', 'avail', 'free', 'web', 'purchas', 'good', 'fruit', 'blog', 'hope', 'came', 'emphas', 'sweet', 'dessert', 'howev', 'total', 'fresh', 'fruiti', 'chapter', 'follow', 'three', 'entitl', 'scream', 'bake', 'shop', 'coco', 'might']


In [41]:
w2v_model.wv.most_similar('daughter')

[('son', 0.9587377309799194),
 ('granddaught', 0.9308164715766907),
 ('grandson', 0.9132821559906006),
 ('niec', 0.8836356997489929),
 ('sister', 0.8833993077278137),
 ('mom', 0.8804718255996704),
 ('boyfriend', 0.8591934442520142),
 ('nephew', 0.8586499094963074),
 ('girlfriend', 0.8392627239227295),
 ('mother', 0.8379369378089905)]

In [43]:
w2v_model.wv.most_similar('price')

[('valu', 0.6238385438919067),
 ('pricewis', 0.6231014728546143),
 ('cost', 0.6230157613754272),
 ('bargain', 0.600115180015564),
 ('pricepoint', 0.5674911737442017),
 ('expens', 0.5652279853820801),
 ('thepric', 0.5578619241714478),
 ('inexpens', 0.5563632249832153),
 ('dollar', 0.5561596155166626),
 ('buck', 0.5493758916854858)]

In [45]:
w2v_model.wv.most_similar('home')

[('vacat', 0.7221050262451172),
 ('hous', 0.7096303105354309),
 ('cabin', 0.6979228854179382),
 ('condo', 0.6637091636657715),
 ('visit', 0.6442372798919678),
 ('offic', 0.6426874399185181),
 ('motorhom', 0.6325644254684448),
 ('town', 0.6314110159873962),
 ('workplac', 0.6265959739685059),
 ('kitchenett', 0.6188104152679443)]

In [46]:
w2v_model.wv.most_similar('girl')

[('teenag', 0.8579732179641724),
 ('nephew', 0.8420219421386719),
 ('grandchildren', 0.8169942498207092),
 ('granddaught', 0.8049652576446533),
 ('teen', 0.7999324798583984),
 ('boyfriend', 0.7955741882324219),
 ('grandson', 0.791936993598938),
 ('niec', 0.787968635559082),
 ('preschool', 0.7823643684387207),
 ('youngest', 0.7819018363952637)]