# Ingesting Data Into The Cloud

## TODO's

## Step 0
* COPY TSV data from S3 to our default bucket /prefix (!aws s3 cp...) (check if YEAR data was in TSV)
* CONVERT TSV to Parquet (with Athena)

## Step 2

BI
* LOAD TSV data from S3 to redshift 
* QUERY redshift (include YEAR to check if TSV has year)
* UNLOAD Parquet data from redshift to S3

Data Analytics
* CTAS (Athena -- creating pointer to S3 location)
* Query Athena
* Check 01_create_database.sql && 02_create_table.sql from Athena Repo


## Step 2/3 -- BATCH Automation
Ingestion of TSV through Glue and/or StepFunction & DataWrangler (ie. S3 Trigger + Lambda or perhap S3 has native integration to invoke StepFunction?  or native integration to invoke Glue?  or Glue cron job)
* Try to trigger upon simulating new data arrival into s3.  ie. we simulate adding a new partition for each year.
* Convert TSV to Parquet as it arrives
* LOAD TSV data from S3 to redshift as it arrives


## Step 1 -- STREAMING Automation
* Try to trigger upon simulating new data arrival into Kinesis.  ie. we simulate adding a new partition for each year (or smaller time frame)
* Convert TSV to Parquet as it arrives
* LOAD TSV data from S3 to redshift as it arrives


## Imports

In [15]:
# Imports & Settings
import boto3
# import botocore
import sagemaker

# import numpy as np
# import pandas as pd

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set S3 prefixes
tsv_prefix = 'amazon-reviews-pds/tsv'
parquet_prefix = 'amazon-reviews-pds/parquet'

# Set Redshift params 
database_name = 'dsoaws'
table_name = 'amazon_reviews_parquet'

## Step 0 -- Copy Dataset To S3 Bucket & Convert To Parquet

In [17]:
# Set S3 source paths
s3_source_path_tsv = 's3://amazon-reviews-pds/tsv'
s3_source_path_parquet = 's3://amazon-reviews-pds/parquet'

# Set S3 destination paths
s3_destination_path_tsv = 's3://{}/{}'.format(bucket, tsv_prefix)
s3_destination_path_parquet = 's3://{}/{}'.format(bucket, parquet_prefix)

In [10]:
# Download dataset
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/

copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Baby_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_us_Baby_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Automotive_v1_00.tsv.gz to s3://sa

In [11]:
!aws s3 ls $s3_destination_path_tsv/

2020-02-20 22:41:38  241896005 amazon_reviews_multilingual_DE_v1_00.tsv.gz
2020-02-20 22:41:38   70583516 amazon_reviews_multilingual_FR_v1_00.tsv.gz
2020-02-20 22:41:38   94688992 amazon_reviews_multilingual_JP_v1_00.tsv.gz
2020-02-20 22:41:38  349370868 amazon_reviews_multilingual_UK_v1_00.tsv.gz
2020-02-20 22:41:38 1466965039 amazon_reviews_multilingual_US_v1_00.tsv.gz
2020-02-20 22:41:40  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2020-02-20 22:41:40  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2020-02-20 22:41:43  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2020-02-20 22:41:45  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2020-02-20 22:41:50 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2020-02-20 22:41:53 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2020-02-20 22:41:53 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2020-02-20 22:42:06  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2020-02-20 22:42:11 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v

In [20]:
!aws s3 rm $s3_destination_path_tsv/index.txt
!aws s3 rm $s3_destination_path_tsv/sample_fr.tsv
!aws s3 rm $s3_destination_path_tsv/sample_us.tsv

!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz

delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/index.txt
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/sample_fr.tsv
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/sample_us.tsv
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz


### CONVERT TSV to Parquet (with Athena)

In [12]:
# Create Athena database first
