# Prepare Dataset for Model Training and Evaluating

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

## Schema

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## Download

Let's start by retrieving a subset of the Amazon Customer Reviews dataset.

In [None]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

In [None]:
import csv

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

In [None]:
# df = pd.read_csv('amazon_reviews_us_Digital_Software_v1_00_train.csv')
# df.shape

In [None]:
df.head(5)

### Enrich the data with `is_positive_sentiment` label
* Positive (`1`):  `star_rating >= 4`
* Negative (`0`) :  `star_rating <= 3`

In [None]:
df['is_positive_sentiment'] = (df['star_rating'] >= 4).astype(int)

df.head()

# Split the data into `train` (90%) and `test` (10%) datasets

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.10, stratify=df['is_positive_sentiment'])

In [None]:
train_path = 'amazon_reviews_us_Digital_Software_v1_00_train.csv'

df_train.to_csv(train_path, index=False, header=True)

train_s3_prefix = 'data'

train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)

In [None]:
test_path = 'amazon_reviews_us_Digital_Software_v1_00_test.csv'

df_test.to_csv(test_path, index=False, header=True)

test_s3_prefix = 'data'

test_s3_uri = sess.upload_data(path=test_path, key_prefix=test_s3_prefix)

# Upload Train Data to S3

In [None]:
train_s3_prefix = 'data'
train_s3_uri = sess.upload_data(path='amazon_reviews_us_Digital_Software_v1_00_train.csv', key_prefix=train_s3_prefix)

In [None]:
print(train_s3_uri)

!aws s3 ls $train_s3_uri

# Store the location of our train data in our notebook server to be used next

In [None]:
%store train_s3_uri