# Prepare Dataset for Model Training and Evaluating

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

## Schema

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## Download

Let's start by retrieving a subset of the Amazon Customer Reviews dataset.

In [None]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

In [None]:
import csv

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

In [None]:
df.head(5)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

df[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

# Balance the Dataset

In [None]:
from sklearn.utils import resample

five_star_df = df.query('star_rating == 5')
four_star_df = df.query('star_rating == 4')
three_star_df = df.query('star_rating == 3')
two_star_df = df.query('star_rating == 2')
one_star_df = df.query('star_rating == 1')

# Check which sentiment has the least number of samples
minority_count = min(five_star_df.shape[0], 
                     four_star_df.shape[0], 
                     three_star_df.shape[0], 
                     two_star_df.shape[0], 
                     one_star_df.shape[0]) 

five_star_df = resample(five_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

four_star_df = resample(four_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

three_star_df = resample(three_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

two_star_df = resample(two_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

one_star_df = resample(one_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

df_balanced = pd.concat([five_star_df, four_star_df, three_star_df, two_star_df, one_star_df])
df_balanced = df_balanced.reset_index(drop=True)

df_balanced.shape

In [None]:
df_balanced[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

In [None]:
df_balanced.head(5)

# Split the Data into Train, Validation, and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

# Split all data into 90% train and 10% holdout
df_train, df_holdout = train_test_split(df_balanced, 
                                        test_size=0.10,
                                        stratify=df_balanced['star_rating'])

# Split holdout data into 50% validation and 50% test
df_validation, df_test = train_test_split(df_holdout,
                                          test_size=0.50, 
                                          stratify=df_holdout['star_rating'])


In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = ['Train', 'Validation', 'Test']
sizes = [len(df_train.index), len(df_validation.index), len(df_test.index)]
explode = (0.1, 0, 0)  

fig1, ax1 = plt.subplots()

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)

# Equal aspect ratio ensures that pie is drawn as a circle.
ax1.axis('equal')  

plt.show()

# Show 90% Train Data Split

In [None]:
df_train.shape

In [None]:
df_train[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='90% Train Breakdown by Star Rating')

# Show 5% Validation Data Split

In [None]:
df_validation.shape

In [None]:
df_validation[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Validation Breakdown by Star Rating')

# Show 5% Test Data Split

In [None]:
df_test.shape

In [None]:
df_test[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Test Breakdown by Star Rating')

# Select `star_rating` and `review_body` for Training

In [None]:
df_train = df_train[['star_rating', 'review_body']]
df_train.shape

In [None]:
df_train.head(5)

# Write a Train CSV with Header for AutoPilot 

In [None]:
header_train_path = './amazon_reviews_us_Digital_Software_v1_00_header.csv'
df_train.to_csv(header_train_path, index=False, header=True)

# Upload Train Data to S3 for AutoPilot

In [None]:
train_s3_prefix = 'data'
header_train_s3_uri = sess.upload_data(path=header_train_path, key_prefix=train_s3_prefix)
header_train_s3_uri

In [None]:
!aws s3 ls $header_train_s3_uri

# Write a CSV With No Header for Comprehend 

In [None]:
noheader_train_path = './amazon_reviews_us_Digital_Software_v1_00_noheader.csv'
df_train.to_csv(noheader_train_path, index=False, header=False)

# Upload Train Data to S3 for Comprehend

In [None]:
train_s3_prefix = 'data'
noheader_train_s3_uri = sess.upload_data(path=noheader_train_path, key_prefix=train_s3_prefix)
noheader_train_s3_uri

In [None]:
!aws s3 ls $noheader_train_s3_uri

# Store the location of our train data in our notebook server to be used next

In [None]:
%store header_train_s3_uri

In [None]:
%store noheader_train_s3_uri

In [None]:
%store

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();