### Useful links:
- [Dataset documentation](https://nijianmo.github.io/amazon/index.html)
- [Complete Metadata files](http://deepyeti.ucsd.edu/jianmo/amazon/index.html)
- [Pandas reference sheet](https://ds100.org/sp21/resources/assets/exams/sp20/sp20_checkpoint_reference_sheet.pdf)
- [Data-200 Google Doc](https://docs.google.com/document/d/19HWODy5kpWoUB7BEKEmKLbRnK8MC1fBmRat_WP7vfNc/edit)
- [Grad Project Guidelines](https://ds100.org/sp21/grad_proj/gradproject/)

In [None]:
import os
import numpy as np
import pandas as pd
import json
import gzip
import urllib.request
from urllib.request import urlopen

In [None]:
url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Gift_Cards_5.json.gz"
filename = 'Gift_Cards.json.gz'
if not os.path.exists(filename):
    urllib.request.urlretrieve(url,filename)


In [None]:
### load the data

data = []
with gzip.open(filename) as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of reviews
print(len(data))

# first row of the list
print(data[0])

#### Convert to dataframe:

In [None]:
reviews = pd.DataFrame.from_dict(data)
reviews.head()

In [None]:
# Check score-wise values
reviews[(reviews['overall'] == 5)]

#### Column labels:
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- vote - helpful votes of the review
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
- image - images that users post after they have received the product

#### Checking 5-core:
A 5-core dataset contains only those users with at least 5 reviews.

In [None]:
reviews.groupby(by="reviewerID").size().sort_values()

#### Import metadata:

In [None]:
url = "http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz"
filename = 'Meta_Gift_Cards.json.gz'
if not os.path.exists(filename):
    urllib.request.urlretrieve(url,filename)

In [None]:
### load the data

data = []
with gzip.open(filename) as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

#### Convert to dataframe:

In [None]:
metadata = pd.DataFrame.from_dict(data)
metadata.head()

In [None]:
metadata['asin'].value_counts().sort_values()

#### Merging the reviews and metadata on `asin`:

In [None]:
df = reviews.merge(metadata,how="left",left_on = "asin",right_on = "asin")

In [None]:
df.shape

In [None]:
df.head()

#### Column labels:
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- feature - bullet-point format features of the product
- description - description of the product
- price - price in US dollars (at time of crawl)
- image - url of the product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
- tech1 - the first technical detail table of the product
- tech2 - the second technical detail table of the product
- similar - similar product table

#### We can clean the data a little: 
- Change `overall` column name to `rating`
- `asin` to `productid`
- Extract `gift_amount` from `style`
- Extract `rank#` from `rank`
- Rempve rows with no information on the price either from `price` column or from `gift_amount` as both should be the same
- `price` missing? Change dataset?