# Ingest Dataset

We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [None]:
!aws s3 ls s3://amazon-reviews-pds/tsv/

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [None]:
print('Our S3 bucket: {}'.format(bucket))

# Set S3 Source Location (Public S3 Bucket)

In [None]:
s3_public_path_tsv = 's3://amazon-reviews-pds/tsv'

In [None]:
%store s3_public_path_tsv

# Set S3 Destination Location (Our Private S3 Bucket)

In [None]:
s3_private_path_tsv = 's3://{}/pytorch/amazon-reviews-pds/tsv'.format(bucket)
print(s3_private_path_tsv)

# Copy Data From the Public S3 Bucket to our Private S3 Bucket in this Account
As the full dataset is pretty large, let's just copy 2 files into our bucket to speed things up later. 

In [None]:
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"

# List Files in our Private S3 Bucket in this Account

In [None]:
print(s3_private_path_tsv)

In [None]:
!aws s3 ls $s3_private_path_tsv/

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/sagemaker-{}-{}/DLAI/amazon-reviews-pds/?region={}&tab=overview">S3 Bucket</a></b>'.format(region, account_id, region)))


# Pass Variables to the Next Notebook(s)

In [None]:
raw_input_data_s3_uri = s3_private_path_tsv

In [None]:
%store raw_input_data_s3_uri

In [None]:
%store

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}