# Write TSV Data To S3

<img src="img/write_tsv_to_s3.png" width="45%" align="left">

#### We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [1]:
!aws s3 ls s3://amazon-reviews-pds/tsv/

2017-11-24 13:22:50          0 
2017-11-24 13:48:03  241896005 amazon_reviews_multilingual_DE_v1_00.tsv.gz
2017-11-24 13:48:17   70583516 amazon_reviews_multilingual_FR_v1_00.tsv.gz
2017-11-24 13:48:34   94688992 amazon_reviews_multilingual_JP_v1_00.tsv.gz
2017-11-24 13:49:14  349370868 amazon_reviews_multilingual_UK_v1_00.tsv.gz
2017-11-24 13:48:47 1466965039 amazon_reviews_multilingual_US_v1_00.tsv.gz
2017-11-24 13:49:53  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2017-11-24 13:56:36  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2017-11-24 14:04:02  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2017-11-24 14:08:11  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2017-11-24 14:17:41 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2017-11-24 14:45:50 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2017-11-24 15:10:21 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2017-11-24 15:22:13  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2017-11-24 15:27:13 2689739

In [2]:
!aws s3 ls s3://amazon-reviews-pds/parquet/

                           PRE product_category=Apparel/
                           PRE product_category=Automotive/
                           PRE product_category=Baby/
                           PRE product_category=Beauty/
                           PRE product_category=Books/
                           PRE product_category=Camera/
                           PRE product_category=Digital_Ebook_Purchase/
                           PRE product_category=Digital_Music_Purchase/
                           PRE product_category=Digital_Software/
                           PRE product_category=Digital_Video_Download/
                           PRE product_category=Digital_Video_Games/
                           PRE product_category=Electronics/
                           PRE product_category=Furniture/
                           PRE product_category=Gift_Card/
                           PRE product_category=Grocery/
                           PRE product_category=Health_&_Per

# To Simulate an Application Writing Into Our Data Lake, We Copy the Public TSV Dataset to a Private S3 Bucket in our Account

<img src="img/copy_data_to_s3.png" width="60%" align="left">

In [3]:
import boto3
import sagemaker

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set S3 Source Location

In [4]:
s3_source_path_tsv = 's3://amazon-reviews-pds/tsv'

# Set S3 Destination Location

In [5]:
s3_destination_path_tsv = 's3://{}/amazon-reviews-pds/tsv'.format(bucket)
print(s3_destination_path_tsv)

s3://sagemaker-us-west-2-393371431575/amazon-reviews-pds/tsv


#### As the full dataset is pretty large, let's just copy 2 files into our bucket. 

In [6]:
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"

copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to s3://sagemaker-us-west-2-393371431575/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz to s3://sagemaker-us-west-2-393371431575/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


#### List files

In [7]:
!aws s3 ls $s3_destination_path_tsv/

2020-07-25 17:13:26   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-07-25 17:13:29   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();