# Helpful Reviews

### Preliminary Data Wrangling

### Corey J Wade, WM^3

## Introduction

This Jupyter Notebook will perform preliminary data wrangling on the the following dataset courtesy of Julian McAuley, UCSC: 

http://jmcauley.ucsd.edu/data/amazon/

Since Amazon started out as a bookseller, we have chosen to restrict our focus to the subset of books that only includes reviews. Our goal is to create a new metric, Helpful Rating, that can be generated immediately after a review is written.

The reason for preliminary data wrangling is the size of the file. It takes a long time to convert the json file into a pandas dataframe. It's only necessary to do it once. 

## Open Json File

In [1]:
import pandas as pd
import json

with open('reviews_Books_5.json', 'r') as f:
    reviews = f.readlines()

I took a break due to the size of the file.

In [2]:
reviews[0:5]

['{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"}\n',
 '{"reviewerID": "A2S166WSCFIFP5", "asin": "000100039X", "reviewerName": "adead_poet@hotmail.com \\"adead_poet@hotmail.com\\"", "helpful": [0, 2], "reviewText": "This is one my must have books. It is a masterpiece of spirituality. I\'ll be the first to admit, its literary quality isn\'t much. It is rather simplistically written, but the message behind it is so powerful that you have to read it. It will take you to enlightenment.", "overall": 5.0, "summary": "close to god", "unixReviewTime": 1071100800, "reviewTime": "12 11, 2003"}\n',
 '{"reviewerID": "A1BM81XB4QHOA3", "asin": "000100039X", "reviewerName": "Ahoro Blethends \\"Seriousl

The file contains a list of dictonaries separated by '/n'.

### Eliminate /n

In [3]:
data = [json.loads(item.strip('\n')) for item in reviews]

In [4]:
data[0:5]

[{'asin': '000100039X',
  'helpful': [0, 0],
  'overall': 5.0,
  'reviewText': 'Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!',
  'reviewTime': '12 16, 2012',
  'reviewerID': 'A10000012B7CGYKOMPQ4L',
  'reviewerName': 'Adam',
  'summary': 'Wonderful!',
  'unixReviewTime': 1355616000},
 {'asin': '000100039X',
  'helpful': [0, 2],
  'overall': 5.0,
  'reviewText': "This is one my must have books. It is a masterpiece of spirituality. I'll be the first to admit, its literary quality isn't much. It is rather simplistically written, but the message behind it is so powerful that you have to read it. It will take you to enlightenment.",
  'reviewTime': '12 11, 2003',
  'reviewerID': 'A2S166WSCFIFP5',
  'reviewerName': 'adead_poet@hotmail.com "adead_poet@hotmail.com"',
  'summary': 'close to god',
  'unixReviewTime': 1071100800},
 {'asin': '000100039X',
  'helpful': [0, 0],
  'overall': 5.0,
  'reviewText': 'Thi

## Convert to DataFrame

In [5]:
df = pd.DataFrame(data)

In [6]:
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,000100039X,"[0, 0]",5.0,Spiritually and mentally inspiring! A book tha...,"12 16, 2012",A10000012B7CGYKOMPQ4L,Adam,Wonderful!,1355616000
1,000100039X,"[0, 2]",5.0,This is one my must have books. It is a master...,"12 11, 2003",A2S166WSCFIFP5,"adead_poet@hotmail.com ""adead_poet@hotmail.com""",close to god,1071100800
2,000100039X,"[0, 0]",5.0,This book provides a reflection that you can a...,"01 18, 2014",A1BM81XB4QHOA3,"Ahoro Blethends ""Seriously""",Must Read for Life Afficianados,1390003200
3,000100039X,"[0, 0]",5.0,I first read THE PROPHET in college back in th...,"09 27, 2011",A1MOSTXNIO5MPJ,Alan Krug,Timeless for every good and bad time in your l...,1317081600
4,000100039X,"[7, 9]",5.0,A timeless classic. It is a very demanding an...,"10 7, 2002",A2XQ5LZHTD4AFT,Alaturka,A Modern Rumi,1033948800


Initial columns of interest include 'helpful', 'overall', and 'reviewText'.

In [7]:
df.shape

(8898041, 9)

In [8]:
df.to_csv('Amazon_Data_Frame.csv')

I save the data frame to a csv file for future reference.

## Explore Original Columns

In [9]:
df.describe()

Unnamed: 0,overall,unixReviewTime
count,8898041.0,8898041.0
mean,4.249932,1320212000.0
std,1.057733,101851600.0
min,1.0,832550400.0
25%,4.0,1296864000.0
50%,5.0,1362182000.0
75%,5.0,1385942000.0
max,5.0,1406074000.0


A median review score of 5.0 and mean review score of 4.25 indicates a left skewed distribution.

### Column Types

In [10]:
type(df.helpful[0])

list

In [11]:
type(df.helpful[0][0])

int

The helpful column is a list of ints, presumably [helpfulVotes, totalVotes]. Amazon only  shows helpfulVotes on their website.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8898041 entries, 0 to 8898040
Data columns (total 9 columns):
asin              object
helpful           object
overall           float64
reviewText        object
reviewTime        object
reviewerID        object
reviewerName      object
summary           object
unixReviewTime    int64
dtypes: float64(1), int64(1), object(7)
memory usage: 611.0+ MB
