# Write TSV Data To S3

<img src="img/write_tsv_to_s3.png" width="45%" align="left">

#### We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [None]:
!aws s3 ls s3://amazon-reviews-pds/tsv/

In [None]:
!aws s3 ls s3://amazon-reviews-pds/parquet/

# To Simulate an Application Writing Into Our Data Lake, We Copy the Public TSV Dataset to a Private S3 Bucket in our Account

<img src="img/copy_data_to_s3.png" width="60%" align="left">

In [None]:
import boto3
import sagemaker

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set S3 Source Location

In [None]:
s3_source_path_tsv = 's3://amazon-reviews-pds/tsv'

# Set S3 Destination Location

In [None]:
s3_destination_path_tsv = 's3://{}/amazon-reviews-pds/tsv'.format(bucket)
print(s3_destination_path_tsv)

#### As the full dataset is pretty large, let's just copy 2 files into our bucket. 

In [None]:
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"

#### List files

In [None]:
!aws s3 ls $s3_destination_path_tsv/

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();