In [1]:
import pandas as pd
import json
from __future__ import print_function

# Load Amazon review data

This notebook shows the basic Python required to load Amazon product review data. If you're only interested in topic modelling, feel free to ignore this!

Public, cleaned Amazon product review data from [http://jmcauley.ucsd.edu/data/amazon/](http://jmcauley.ucsd.edu/data/amazon/).

This notebook uses the 5-core Pet Supplies data [http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Pet_Supplies_5.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Pet_Supplies_5.json.gz).

I don't know how easy it is to uncompress this file on Windows, so I have included the first 100MB of that file decompressed as the file `reviews.json`.

Let's read that file into a list of strings:

In [2]:
with open("reviews.json", 'r') as f:
    lines = f.readlines()

And take a look at the first item in the list (i.e. the first line in the file):

In [3]:
print(lines[0], type(lines[0]))

{"reviewerID": "A14CK12J7C7JRK", "asin": "1223000893", "reviewerName": "Consumer in NorCal", "helpful": [0, 0], "reviewText": "I purchased the Trilogy with hoping my two cats, age 3 and 5 would be interested.  The 3 yr old cat was fascinated for about 15 minutes but when the same pictures came on, she got bored.  The 5 year old watched for about a few minutes but then walked away. It is possible that because we have a wonderful courtyard full of greenery and trees and one of my neighbors has a bird feeder, that there is enough going on outside that they prefer real life versus a taped version.  I will more than likely pass this on to a friend who has cats that don't have as much wildlife to watch as mine do.", "overall": 3.0, "summary": "Nice Distraction for my cats for about 15 minutes", "unixReviewTime": 1294790400, "reviewTime": "01 12, 2011"}
 <class 'str'>


So we have something that looks like a Python dictionary. If you've not seen these before, they're structured collections of data. Each dictionary contains items, but unlike a regular list, each item is referred to by a name (a "key"), rather than a number.

But `lines[0]` isn't _exactly_ a Python dictionary. It's actually of type string right now; which reflects its origin as json data. We need to load it as a native Python dictionary to work with it.

We'll do that in one line with a list comprehension and the `loads` ("load string") function in `json`.

In [4]:
records = [json.loads(l) for l in lines]

If you're not comfortable with the syntax of list comprehensions, this one means exactly the same thing as this:

    records = []
    for l in lines:
        records.append(json.loads(l))
        
We now have a list of Python dictionaries

In [5]:
first_record = records[0]
print(first_record, type(first_record))

{'overall': 3.0, 'summary': 'Nice Distraction for my cats for about 15 minutes', 'reviewerName': 'Consumer in NorCal', 'helpful': [0, 0], 'unixReviewTime': 1294790400, 'reviewText': "I purchased the Trilogy with hoping my two cats, age 3 and 5 would be interested.  The 3 yr old cat was fascinated for about 15 minutes but when the same pictures came on, she got bored.  The 5 year old watched for about a few minutes but then walked away. It is possible that because we have a wonderful courtyard full of greenery and trees and one of my neighbors has a bird feeder, that there is enough going on outside that they prefer real life versus a taped version.  I will more than likely pass this on to a friend who has cats that don't have as much wildlife to watch as mine do.", 'reviewTime': '01 12, 2011', 'reviewerID': 'A14CK12J7C7JRK', 'asin': '1223000893'} <class 'dict'>


We can extract individual pieces of data using the keys

In [6]:
first_record['reviewText']

"I purchased the Trilogy with hoping my two cats, age 3 and 5 would be interested.  The 3 yr old cat was fascinated for about 15 minutes but when the same pictures came on, she got bored.  The 5 year old watched for about a few minutes but then walked away. It is possible that because we have a wonderful courtyard full of greenery and trees and one of my neighbors has a bird feeder, that there is enough going on outside that they prefer real life versus a taped version.  I will more than likely pass this on to a friend who has cats that don't have as much wildlife to watch as mine do."

Let's go one step further and just stick the whole list of dictionaries in a pandas DataFrame, and use `head` to look at the first few records.

In [7]:
reviews = pd.DataFrame(records)
reviews.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1223000893,"[0, 0]",3.0,I purchased the Trilogy with hoping my two cat...,"01 12, 2011",A14CK12J7C7JRK,Consumer in NorCal,Nice Distraction for my cats for about 15 minutes,1294790400
1,1223000893,"[0, 0]",5.0,There are usually one or more of my cats watch...,"09 14, 2013",A39QHP5WLON5HV,Melodee Placial,Entertaining for my cats,1379116800
2,1223000893,"[0, 0]",4.0,I bought the triliogy and have tested out all ...,"12 19, 2012",A2CR37UY3VR7BN,Michelle Ashbery,Entertaining,1355875200
3,1223000893,"[2, 2]",4.0,My female kitty could care less about these vi...,"05 12, 2011",A2A4COGL9VW2HY,Michelle P,Happy to have them,1305158400
4,1223000893,"[6, 7]",3.0,"If I had gotten just volume two, I would have ...","03 5, 2012",A2UBQA85NIGLHA,"Tim Isenhour ""Timbo""",You really only need vol 2,1330905600
