# User reviews classifier to predict of a product review will be useful for other users. (Data set Amazon)


**Use Case**: As user prepares and submits a review, how can companies proactively identify reviews not to be posted towards an item that as not a helpful for other users?

**Target Variable** Helpful response from other reviewers (this is a target variable that is constructed by a rating on actual reviews done by other users using the scale from vote or helpful)

**Data Source** https://nijianmo.github.io/amazon/index.html



In [5]:
import os
import json
import gzip
import wget
import pandas as pd
from urllib.request import urlopen

#tested links
http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Arts_Crafts_and_Sewing_5.json.gz - works
http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Grocery_and_Gourmet_Food_5.json.gz - works


In [12]:
##download data from url
### randomly selected file to model
url = 'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Grocery_and_Gourmet_Food_5.json.gz'
filename = wget.download(url)


100% [......................................................................] 146631394 / 146631394

In [13]:
#load metadata
data = []
with gzip.open('Grocery_and_Gourmet_Food_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

1143860
{'overall': 5.0, 'verified': True, 'reviewTime': '11 19, 2014', 'reviewerID': 'A1QVBUH9E1V6I8', 'asin': '4639725183', 'reviewerName': 'Jamshed Mathur', 'reviewText': 'No adverse comment.', 'summary': 'Five Stars', 'unixReviewTime': 1416355200}


In [14]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))


1143860


In [15]:
#look at dataframe
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143860 entries, 0 to 1143859
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1143860 non-null  float64
 1   verified        1143860 non-null  bool   
 2   reviewTime      1143860 non-null  object 
 3   reviewerID      1143860 non-null  object 
 4   asin            1143860 non-null  object 
 5   reviewerName    1143722 non-null  object 
 6   reviewText      1143470 non-null  object 
 7   summary         1143641 non-null  object 
 8   unixReviewTime  1143860 non-null  int64  
 9   vote            158202 non-null   object 
 10  style           592086 non-null   object 
 11  image           9510 non-null     object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 97.1+ MB


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5.0,True,"11 19, 2014",A1QVBUH9E1V6I8,4639725183,Jamshed Mathur,No adverse comment.,Five Stars,1416355200,,,
1,5.0,True,"10 13, 2016",A3GEOILWLK86XM,4639725183,itsjustme,Gift for college student.,Great product.,1476316800,,,
2,5.0,True,"11 21, 2015",A32RD6L701BIGP,4639725183,Krystal Clifton,"If you like strong tea, this is for you. It mi...",Strong,1448064000,,,
3,5.0,True,"08 12, 2015",A2UY1O1FBGKIE6,4639725183,U. Kane,Love the tea. The flavor is way better than th...,Great tea,1439337600,,,
4,5.0,True,"05 28, 2015",A3QHVBQYDV7Z6U,4639725183,The Nana,I have searched everywhere until I browsed Ama...,This is the tea I remembered!,1432771200,,,


### Data catalogue

- __overall:__- Rating of the Product
- __reviewTime:__- Time of the review (raw)
- __reviewerID:__- ID of the reviewer, e.g. A2SUAM1J3GNN3B
- __asin:__- ID of the product, e.g. 0000013714
- __style:__- A disctionary of the product metadata, e.g., "Format" is "Hardcover"
- __reviewerName:__- Name of the reviewer
- __reviewerText:__- Text of the review
- __summary:__- Summary of the review
- __vote:__- Helpful votes of the review
- __unixReviewTime:__- Time of the review (unix time)
- __reviewText:__- Text of the review
- __image:__- Images that users post after they have received the product