# Amazon Home Kitchen Product Reviews Analysis


Data Source: http://jmcauley.ucsd.edu/data/amazon/index_2014.html

The Amazon Home Kitchen Product Reviews dataset consists of reviews of home and kitchen products from Amazon website.<br>

Number of reviews: 551,682<br>
#Number of users: 256,059<br>#**********
#Number of products: 74,258<br>#*************
#Timespan: May 1996 - July 2014<br>
Number of Attributes/Columns in data: 9

#### Attribute Information:

1. reviewerId - unqiue identifier of the reviewer
2. asin - unique identifier for the product
3. reviewerName
4. Helpfulness numerator and Helpfulness denominator
   HelpfulnessNumerator - number of users who found the review helpful
   HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
5. reviewText - text of the review
6. overall - the overall rating of the reviewer
7. summary - brief summary of the review
8. unixReviewTime - timestamp for the review
9. Time - Date of the review

#### Objective
* Determining the polarity of the review (whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2)) using the reviews given by the user.

#### Ground truth 
* We will use Overall score to determine the ground truth of the review. If the score is 4 or 5 , we will consider that review as positive review. If the score is 1 or 2 , we will consider that review as negative review. we will ignore the reviews with the rating of 3.

# Loading the data

The data is available is in .json file form in data source link and we converted that into .csv file using the code below

import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Home_and_Kitchen_5.json.gz')

df.to_csv('amazon_home_kitchen_product_data', encoding='utf-8', index=False)

In [152]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('amazon_home_kitchen_product_data')

In [153]:
# displyaing the first few rows of data
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,5.0,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,5.0,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,4.0,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,5.0,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,4.0,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


In [154]:
# The shape of the data before filtering the score rating 3
data.shape

(551682, 9)

In [155]:
# filtering the data by removing the overall rating score - 3 
filtered_data = data[data.overall!= 3]

In [156]:
# The shape of the data after filtering the score rating 3
filtered_data.shape

(506623, 9)

In [157]:
filtered_data

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,0615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,5.0,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,0615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,5.0,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,0615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,4.0,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,0615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,5.0,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,0615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,4.0,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"
...,...,...,...,...,...,...,...,...,...
551677,A11J1FHCK5U06J,B00LBFUU12,Karinna Ball,"[0, 0]",These ice pop molds are awesome! Bright kid-ha...,5.0,Summer fun for everyone!,1404950400,"07 10, 2014"
551678,A537XC69FAD3J,B00LBFUU12,L Green,"[0, 0]",great popsicle molds - very nice quality - and...,5.0,Five Stars,1405382400,"07 15, 2014"
551679,AWHZOUIQ0VO7M,B00LBFUU12,Richard N,"[0, 0]",My kids and I are loving these - putting our c...,5.0,... these - putting our creativity to the test...,1405468800,"07 16, 2014"
551680,A1KQNP8MOJDJKC,B00LBFUU12,RS,"[1, 1]","I love these ice pop makers. First off, I love...",5.0,love them,1405209600,"07 13, 2014"


In [158]:
# changing the overall rating column to positive and negative categories
import warnings
warnings.filterwarnings("ignore")

def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

actualScore = filtered_data['overall']
positiveNegative = actualScore.map(partition) 
filtered_data['overall'] = positiveNegative

In [159]:
filtered_data['overall'].value_counts()

positive    455204
negative     51419
Name: overall, dtype: int64