# Prepare Dataset for Model Training and Evaluating

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

## Schema

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## Download

Let's start by retrieving a subset of the Amazon Customer Reviews dataset.

In [None]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

In [None]:
import csv

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

In [None]:
df.head(5)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

df[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

# Balance the dataset

In [None]:
from sklearn.utils import resample

five_star_df = df.query('star_rating == 5')
four_star_df = df.query('star_rating == 4')
three_star_df = df.query('star_rating == 3')
two_star_df = df.query('star_rating == 2')
one_star_df = df.query('star_rating == 1')

# TODO:  check which sentiment has the least number of samples

five_star_df = resample(five_star_df,
                        replace = False,
                        n_samples = len(two_star_df),
                        random_state = 27)

four_star_df = resample(four_star_df,
                        replace = False,
                        n_samples = len(two_star_df),
                        random_state = 27)

three_star_df = resample(three_star_df,
                        replace = False,
                        n_samples = len(two_star_df),
                        random_state = 27)

two_star_df = resample(two_star_df,
                        replace = False,
                        n_samples = len(two_star_df),
                        random_state = 27)

one_star_df = resample(one_star_df,
                        replace = False,
                        n_samples = len(two_star_df),
                        random_state = 27)

df_balanced = pd.concat([five_star_df, four_star_df, three_star_df, two_star_df, one_star_df])
df_balanced = df_balanced.reset_index(drop=True)

In [None]:
df_balanced[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

In [None]:
df_train = df_balanced[['star_rating', 'review_body']]
df_train.shape

In [None]:
df_train.head(5)

# Write a CSV with Header for AutoPilot 

In [None]:
autopilot_train_path = './amazon_reviews_us_Digital_Software_v1_00_autopilot_header.csv'
df_train.to_csv(autopilot_train_path, index=False, header=True)

# Upload Train Data to S3 for AutoPilot

In [None]:
train_s3_prefix = 'data'
autopilot_train_s3_uri = sess.upload_data(path=autopilot_train_path, key_prefix=train_s3_prefix)
autopilot_train_s3_uri

# Write a CSV without Header for Comprehend 

In [None]:
comprehend_train_path = './amazon_reviews_us_Digital_Software_v1_00_comprehend_noheader.csv'
df_train.to_csv(comprehend_train_path, index=False, header=True)

In [None]:
!aws s3 ls $train_s3_uri

# Upload Train Data to S3 for AutoPilot

In [None]:
train_s3_prefix = 'data'
comprehend_train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)
comprehend_train_s3_uri

# Store the location of our train data in our notebook server to be used next

In [None]:
%store autopilot_train_s3_uri

In [None]:
%store comprehend_train_s3_uri