# Amazon Product Data Exercises

In [1]:
## import statements
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import bokeh
from bokeh.io import output_notebook, show
output_notebook()

<hr/>

## Read data and explore

## Loading some data

We have some data from an [Amazon Product](http://jmcauley.ucsd.edu/data/amazon/) scrape by [Julian McAuley](http://cseweb.ucsd.edu/~jmcauley/). If you use this data please cite Julian!

Let's see what we have, note we can use some bash commands in Jupyter by starting a line with `!`

In [2]:
! ls -al data

ls: data: No such file or directory


But in reality you don't have to us it for some commands like `ls` and `cd`

In [3]:
ls -alh data/reviews/

ls: data/reviews/: No such file or directory


The complete data is pretty large, but the individual departments are not so bad. But let's try to get a feel for how to build this data:

In [4]:
!gzcat ../data/reviews/reviews_Clothing_Shoes_and_Jewelry_5.json.gz | head -2

{"reviewerID": "A1KLRMWW2FWPL4", "asin": "0000031887", "reviewerName": "Amazon Customer \"cameramom\"", "helpful": [0, 0], "reviewText": "This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++", "overall": 5.0, "summary": "Great tutu-  not cheaply made", "unixReviewTime": 1297468800, "reviewTime": "02 12, 2011"}
{"reviewerID": "A2G5TCU2WDFZ65", "asin": "0000031887", "reviewerName": "Amazon Customer", "helpful": [0, 0], "reviewText": "I bought this for my 4 yr old daughter for dance class, she wore it today for the first time and the teacher thought it was adorable. I bought this to go with a light blue long sleeve leotard and was happy the colors matched up great. Price was very good too since some of these go for over $15.00 dollars.", "overall": 5.0, "summary": "Very Cute!!", "unixReviewTime": 1358553600, "reviewTime": "01 19, 2013"}
gzcat: error writing to output: 

Oph! This is JSON Lines not JSON! Let's load it into lines

In [5]:
import gzip
review_lines = gzip.open('../data/reviews/reviews_Clothing_Shoes_and_Jewelry_5.json.gz', 'rt').readlines()
len(review_lines)

278677

Now we have something, but it's a lot of lines. So we load it into a `DataFrame`, an object that is the data hacker's go to for manipulating structured data. 

Unfortunately, its not as straight forward as `pd.read_json`, since first we have to turn all those JSON strings into Python objects, so let's turn each into a dict and then build a `DataFrame`

In [6]:
import json
df_reviews = pd.DataFrame(list(map(json.loads, review_lines)))

Now let's see what we have:

In [7]:
df_reviews.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,31887,"[0, 0]",5.0,This is a great tutu and at a really great pri...,"02 12, 2011",A1KLRMWW2FWPL4,"Amazon Customer ""cameramom""",Great tutu- not cheaply made,1297468800
1,31887,"[0, 0]",5.0,I bought this for my 4 yr old daughter for dan...,"01 19, 2013",A2G5TCU2WDFZ65,Amazon Customer,Very Cute!!,1358553600
2,31887,"[0, 0]",5.0,What can I say... my daughters have it in oran...,"01 4, 2013",A1RLQXYNCMWRWN,Carola,I have buy more than one,1357257600
3,31887,"[0, 0]",5.0,"We bought several tutus at once, and they are ...","04 27, 2014",A8U3FAMSJVHS5,Caromcg,"Adorable, Sturdy",1398556800
4,31887,"[0, 0]",5.0,Thank you Halo Heaven great product for Little...,"03 15, 2014",A3GEOILWLK86XM,CJ,Grammy's Angels Love it,1394841600


Where to go from here:

- split the helpful column into helpful_votes and overall_votes
- add a computed column `percent_helpful` 
- add a computed column `helpful_review` with True if helpful and False otherwise
- remove the reviews with no votes
- plot the distribution of helpful votes

<hr/>

## Build some exploratory plots

Build a bokeh plot highlighting length versus helpfulness of the review.

<hr/>

## Model the review helpfulness

Build a model that predicts if a review is helpful from the word counted vectorizer. See `sklearn.feature_extraction.text.CountVectorizer`