# Prepare Dataset for Model Training and Evaluating

# Amazon Customer Reviews Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

## Schema

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.



In [None]:
!pip install -q scikit-learn==0.20.3

In [4]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

## Download

Let's start by retrieving a subset of the Amazon Customer Reviews dataset.

In [5]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz


In [6]:
import csv

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

(102084, 15)

In [12]:
df.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,is_positive_sentiment
0,US,17747349,R2EI7QLPK4LF7U,B00U7LCE6A,106182406,CCleaner Free [Download],Digital_Software,4,0,0,N,Y,Four Stars,So far so good,2015-08-31,1
1,US,10956619,R1W5OMFK1Q3I3O,B00HRJMOM4,162269768,ResumeMaker Professional Deluxe 18,Digital_Software,3,0,0,N,Y,Three Stars,Needs a little more work.....,2015-08-31,0
2,US,13132245,RPZWSYWRP92GI,B00P31G9PQ,831433899,Amazon Drive Desktop [PC],Digital_Software,1,1,2,N,Y,One Star,Please cancel.,2015-08-31,0
3,US,35717248,R2WQWM04XHD9US,B00FGDEPDY,991059534,Norton Internet Security 1 User 3 Licenses,Digital_Software,5,0,0,N,Y,Works as Expected!,Works as Expected!,2015-08-31,1
4,US,17710652,R1WSPK2RA2PDEF,B00FZ0FK0U,574904556,SecureAnywhere Intermet Security Complete 5 De...,Digital_Software,4,1,2,N,Y,Great antivirus. Worthless customer support,I've had Webroot for a few years. It expired a...,2015-08-31,1


### Enrich the data with `is_positive_sentiment` label
* Positive (`1`):  `star_rating >= 4`
* Negative (`0`) :  `star_rating <= 3`

In [8]:
df['is_positive_sentiment'] = (df['star_rating'] >= 4).astype(int)

df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,is_positive_sentiment
0,US,17747349,R2EI7QLPK4LF7U,B00U7LCE6A,106182406,CCleaner Free [Download],Digital_Software,4,0,0,N,Y,Four Stars,So far so good,2015-08-31,1
1,US,10956619,R1W5OMFK1Q3I3O,B00HRJMOM4,162269768,ResumeMaker Professional Deluxe 18,Digital_Software,3,0,0,N,Y,Three Stars,Needs a little more work.....,2015-08-31,0
2,US,13132245,RPZWSYWRP92GI,B00P31G9PQ,831433899,Amazon Drive Desktop [PC],Digital_Software,1,1,2,N,Y,One Star,Please cancel.,2015-08-31,0
3,US,35717248,R2WQWM04XHD9US,B00FGDEPDY,991059534,Norton Internet Security 1 User 3 Licenses,Digital_Software,5,0,0,N,Y,Works as Expected!,Works as Expected!,2015-08-31,1
4,US,17710652,R1WSPK2RA2PDEF,B00FZ0FK0U,574904556,SecureAnywhere Intermet Security Complete 5 De...,Digital_Software,4,1,2,N,Y,Great antivirus. Worthless customer support,I've had Webroot for a few years. It expired a...,2015-08-31,1


In [9]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.10, stratify=df['is_positive_sentiment'])

In [10]:
train_path = 'amazon_reviews_us_Digital_Software_v1_00_train.csv'

df_train.to_csv(train_path, index=False, header=True)

train_s3_prefix = 'data'

train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)

In [11]:
test_path = 'amazon_reviews_us_Digital_Software_v1_00_test.csv'

df_test.to_csv(test_path, index=False, header=True)

test_s3_prefix = 'data'

test_s3_uri = sess.upload_data(path=test_path, key_prefix=test_s3_prefix)

# Compare Positive to Negative Sentiment

AutoPilot will automatically balance the data during feature engineering, so we don't need to manually balance.

_Note:  You may need to run this next cell twice to see the chart._


In [None]:
import seaborn as sns

sns.countplot(x='is_positive_sentiment', data=df)

In [None]:
is_positive_sentiment_count = len(df.query('is_positive_sentiment == 1'))
is_negative_sentiment_count = len(df.query('is_positive_sentiment == 0'))

print('Positive count: {}'.format(is_positive_sentiment_count))
print('Negative count: {}'.format(is_negative_sentiment_count))
print('Ratio of Positive to Negative: {}'.format(is_positive_sentiment_count / is_negative_sentiment_count))

# Reduce the dataset to just `is_positive_sentiment` and `review_body`
For now, we will only train the model with `review_body` feature and the `is_positive_sentiment` target.

In [None]:
df = df[['is_positive_sentiment', 'review_body']]

# Split the data into `train` and `test` datasets

Split into `90% train` data and `10% test` data using `is_positive_sentiment` to stratify the split.

Note that AutoPilot will automatically split the train data into `train` and `validation` datasets, so we only need to preserve `10% test` dataset on our end.

Also note that TF/IDF requires us to split before we generate the TF/IDF embeddings - otherwise, test and validation data will leak into the train dataset.

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.10, stratify=df['is_positive_sentiment'])

### Show the split details

In [None]:
print('df_train.shape: {}'.format(df_train.shape))
print('df_test.shape: {}'.format(df_test.shape))

In [None]:
df_train.head(5)

In [None]:
df_test.head(5)

# Save the the `Train` Dataset Locally and Upload  to S3 for AutoPilot
_Note:  AutoPilot requires a header, so we use `header=True`._

In [None]:
train_path = 'data/train.csv'
df_train.to_csv(train_path, index=False, header=True)

train_s3_prefix = 'data'

train_s3_uri = sess.upload_data(path=train_path, key_prefix=train_s3_prefix)

In [None]:
print(train_s3_uri)

!aws s3 ls $train_s3_uri

# Store the location of our train data in our notebook server to be used next

In [None]:
%store train_s3_uri

# Save the `Test` Dataset Locally to Use Later to Evaluate the AutoPilot Model

In [None]:
test_path = 'data/test.csv'

df_test.to_csv(test_path, index=False, header=True)

# Summary

## We have upload our `train` dataset to S3 to be used next

In [None]:
print(train_s3_uri)

!aws s3 ls $train_s3_uri

## We have saved the S3 location to our `train` dataset in Jupyter to be used later

In [None]:
%store -r train_s3_uri

print(train_s3_uri)

## We have our local `train` dataset to be used later

In [None]:
!ls -al ./data/train.csv

## We have our local `test` dataset to be used later

In [None]:
!ls -al ./data/test.csv