# Prepare Dataset for Model Training and Evaluating

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

## Schema

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## Download

Let's start by retrieving a subset of the Amazon Customer Reviews dataset.

In [None]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

In [None]:
import csv

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

In [None]:
df.head(5)

# Select `customer_id`, `product_id`, and `review_date` for Training

In [None]:
df_train = df[['customer_id', 'product_id', 'review_date']].rename(columns={"customer_id": "user_id", 
                                                                            "product_id": "item_id", 
                                                                            "review_date": "timestamp"})

df_train.shape

In [None]:
df_train.head(5)

# Convert YYYY-MM-DD to unix epoch time (number of seconds since 1970)

In [None]:
df_user_item = df[['customer_id', 'product_id', 'review_date']].rename(columns={"customer_id": "user_id", 
                                                                            "product_id": "item_id", 
                                                                            "review_date": "timestamp"})

df_user_item.shape

In [None]:
#df_train['timestamp'].apply(lambda x: pd.to_datetime(x, origin='unix', unit='s', infer_datetime_format=True))

df_user_item['timestamp'] = 0

In [None]:
df_user_item

# Write a Train CSV with Header for AutoPilot 

In [None]:
personalize_user_item_path = './amazon_reviews_us_Digital_Software_v1_00_personalize_user_item.csv'
df_train.to_csv(personalize_user_item_path, index=False, header=True)

# Upload Train Data to S3

In [None]:
train_s3_prefix = 'data'
personalize_user_item_s3_uri = sess.upload_data(path=personalize_user_item_path, key_prefix=train_s3_prefix)
personalize_user_item_s3_uri

In [None]:
!aws s3 ls $personalize_user_item_s3_uri

In [None]:
df_user = df[['customer_id']].rename(columns={"customer_id": "user_id"})   
df_user.shape

In [None]:
df_user.head(5)

In [None]:
personalize_user_path = './amazon_reviews_us_Digital_Software_v1_00_personalize_user.csv'
df_user.to_csv(personalize_user_path, index=False, header=True)

In [None]:
train_s3_prefix = 'data'
personalize_user_s3_uri = sess.upload_data(path=personalize_user_path, key_prefix=train_s3_prefix)
personalize_user_s3_uri

In [None]:
!aws s3 ls $personalize_user_s3_uri

# Item Features

In [None]:
df_item = df[['product_id', 'product_title', 'product_category']].rename(columns={"product_id": "item_id"})   
df_item.shape

In [None]:
df_item.head(5)

In [None]:
personalize_item_path = './amazon_reviews_us_Digital_Software_v1_00_personalize_item.csv'
df_item.to_csv(personalize_item_path, index=False, header=True)

In [None]:
train_s3_prefix = 'data'
personalize_item_s3_uri = sess.upload_data(path=personalize_item_path, key_prefix=train_s3_prefix)
personalize_item_s3_uri

In [None]:
!aws s3 ls $personalize_item_s3_uri

# TODO:  Prepare Personalize Datasets

# Store the location of our train data in our notebook server to be used next

In [None]:
%store personalize_user_item_s3_uri

In [None]:
%store personalize_item_s3_uri

In [None]:
%store personalize_user_s3_uri

In [None]:
%store 

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();