# Amazon Home Kitchen Product Reviews Analysis


Data Source: http://jmcauley.ucsd.edu/data/amazon/index_2014.html

The Amazon Home Kitchen Product Reviews dataset consists of reviews of home and kitchen products from Amazon website.<br>

Number of reviews: 551,682<br>
Timespan: May 1996 - July 2014<br>
Number of Attributes/Columns in data: 9

#### Attribute Information:

1. reviewerId - unqiue identifier of the reviewer
2. asin - unique identifier for the product
3. reviewerName
4. Helpfulness numerator and Helpfulness denominator
   HelpfulnessNumerator - number of users who found the review helpful
   HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
5. reviewText - text of the review
6. overall - the overall rating of the reviewer
7. summary - brief summary of the review
8. unixReviewTime - timestamp for the review
9. Time - Date of the review

#### Objective
* Determining the polarity of the review (whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2)) using the reviews given by the user.

#### Ground truth 
* We will use Overall score to determine the ground truth of the review. If the score is 4 or 5 , we will consider that review as positive review. If the score is 1 or 2 , we will consider that review as negative review. we will ignore the reviews with the rating of 3.

# Loading the data

The data is available is in .json file form in data source link and we converted that into .csv file using the code below

import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Home_and_Kitchen_5.json.gz')

df.to_csv('amazon_home_kitchen_product_data', encoding='utf-8', index=False)

In [356]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('amazon_home_kitchen_product_data')

In [357]:
# displyaing the first few rows of data
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,5.0,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,5.0,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,4.0,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,5.0,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,4.0,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


In [358]:
# The shape of the data before filtering the score rating 3
data.shape

(551682, 9)

In [359]:
# filtering the data by removing the overall rating score - 3 
filtered_data = data[data.overall!= 3]

In [360]:
# The shape of the data after filtering the score rating 3
filtered_data.shape

(506623, 9)

In [361]:
# changing the overall rating column to positive and negative categories
import warnings
warnings.filterwarnings("ignore")

def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

actualScore = filtered_data['overall']
positiveNegative = actualScore.map(partition) 
filtered_data['overall'] = positiveNegative

# changing the overall column name to score for better understanding
filtered_data = filtered_data.rename(columns={"overall": "score", "asin":"productId"})

In [362]:
# displyaing the first few rows of filtered data
filtered_data.head()


Unnamed: 0,reviewerID,productId,reviewerName,helpful,reviewText,score,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,positive,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,positive,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,positive,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,positive,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,positive,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


# Exploratory Data Analysis

In [363]:
# splitting the helpful column to helpful numerator and helpful denominator
helpfulness_num =[]
helpfulness_denom =[]
for i in filtered_data['helpful']:
    m = i[1:-1]
    k = m.split(',')
    helpfulness_num.append(k[0]);
    helpfulness_denom.append(k[1]);

filtered_data['helpfulnessNumerator'] = helpfulness_num          # adding helpfulness numerator column
filtered_data['helpfulnessDenominator'] = helpfulness_denom      # adding helpfulness denominator column

filtered_data['helpfulnessNumerator'] = pd.to_numeric(filtered_data["helpfulnessNumerator"])
filtered_data['helpfulnessDenominator'] = pd.to_numeric(filtered_data["helpfulnessDenominator"])

del filtered_data['helpful']                                     # deleting helpful column

In [364]:
# displyaing the first few rows of filtered data
filtered_data.head()

Unnamed: 0,reviewerID,productId,reviewerName,reviewText,score,summary,unixReviewTime,reviewTime,helpfulnessNumerator,helpfulnessDenominator
0,APYOBQE6M18AA,615391206,Martin Schwartz,My daughter wanted this book and the price on ...,positive,Best Price,1382140800,"10 19, 2013",0,0
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,I bought this zoku quick pop for my daughterr ...,positive,zoku,1403049600,"06 18, 2014",0,0
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,There is no shortage of pop recipes available ...,positive,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013",26,27
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""",This book is a must have if you get a Zoku (wh...,positive,Creative Combos,1312416000,"08 4, 2011",14,18
4,AHAI85T5C2DH3,615391206,PugLover,This cookbook is great. I have really enjoyed...,positive,A must own if you own the Zoku maker...,1402099200,"06 7, 2014",0,0


## Data cleaning 

### check 
duplicate values

 We need to remove the duplicate reviews text (if any) as it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.

In [365]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('productId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [366]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"reviewerID","reviewerName","unixReviewTime","reviewText"}, keep='first', inplace=False)
final.shape

(506623, 10)

The number of rows does not change after removing duplicates which shows that there are no duplicate reviews.

### check 
if there are any data points where helpfulnessNumerator is higher than helpfulnessDenominator

In [367]:
final[final.helpfulnessNumerator>final.helpfulnessDenominator]

Unnamed: 0,reviewerID,productId,reviewerName,reviewText,score,summary,unixReviewTime,reviewTime,helpfulnessNumerator,helpfulnessDenominator


### Observation 
There are no data points where helpfulnessNumerator is higher than helpfulnessDenominator

In [368]:
# finding the number of positive and negative reviews
final['score'].value_counts()

positive    455204
negative     51419
Name: score, dtype: int64

In [369]:
final.dtypes

reviewerID                object
productId                 object
reviewerName              object
reviewText                object
score                     object
summary                   object
unixReviewTime             int64
reviewTime                object
helpfulnessNumerator       int64
helpfulnessDenominator     int64
dtype: object

 # Text Preprocessing: Stemming, stop-word removal and Lemmatization.

In [385]:
# printing the text example to check for punctuation
final['reviewText'] = final["reviewText"].astype('str')
import re
i=0;
for sent in final['reviewText'].values:
    if (len(re.findall('.*>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;

6581
About the temperatures it can handle (which for some reason aren't on the product description; you'd think this would be a key bullet point):Max recommended temp: 392F.  400F seems to be the extreme max according to the packaging, but > 392F will deteriorate the life span faster, says the instructions.  I've seen it go as low as 39F for refridgerated water; it could probably handle lower.  There's a switch on the bottom to measure in Celcius.  392F may be low for some people, so be warned.  I wonder if other reviewers with problems have been pushing this limit?  If the probe is off by as much as 10 degrees as some claim, that might explain the short lifespans.I've liked this probe so far, although admittedly I haven't used it yet for its primary purpose (measuring internal meat temperatures).  Instead I've used it as a kitchen timer and oven thermometer to monitor the oven air temps when proofing dough in it (don't want to kill my yeast!) by lodging the probe in a rack.  I haven't

In [390]:
!pip install nltk
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
Collecting regex
  Downloading regex-2020.11.13-cp37-cp37m-win_amd64.whl (269 kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py): started
  Building wheel for nltk (setup.py): finished with status 'done'
  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434677 sha256=bc5b16e77b59853c57a28451ca5c7afde6309ee56a9361acffefa3f32b9948cf
  Stored in directory: c:\users\dell\appdata\local\pip\cache\wheels\45\6c\46\a1865e7ba706b3817f5d1b2ff7ce8996aabdd0d03d47ba0266
Successfully built nltk
Installing collected packages: regex, nltk
Successfully installed nltk-3.5 regex-2020.11.13


In [392]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase

i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['reviewText'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

In [393]:
final['CleanedText']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 
final['CleanedText']=final['CleanedText'].str.decode("utf-8")