# Ingesting Data Into The Cloud

## TODO's

## Step 0
* COPY TSV data from S3 to our default bucket /prefix (!aws s3 cp...) (check if YEAR data was in TSV)
* CONVERT TSV to Parquet (with Athena)

## Step 2

BI
* LOAD TSV data from S3 to redshift 
* QUERY redshift (include YEAR to check if TSV has year)
* UNLOAD Parquet data from redshift to S3

Data Analytics
* CTAS (Athena -- creating pointer to S3 location)
* Query Athena
* Check 01_create_database.sql && 02_create_table.sql from Athena Repo


## Step 2/3 -- BATCH Automation
Ingestion of TSV through Glue and/or StepFunction & DataWrangler (ie. S3 Trigger + Lambda or perhap S3 has native integration to invoke StepFunction?  or native integration to invoke Glue?  or Glue cron job)
* Try to trigger upon simulating new data arrival into s3.  ie. we simulate adding a new partition for each year.
* Convert TSV to Parquet as it arrives
* LOAD TSV data from S3 to redshift as it arrives


## Step 1 -- STREAMING Automation
* Try to trigger upon simulating new data arrival into Kinesis.  ie. we simulate adding a new partition for each year (or smaller time frame)
* Convert TSV to Parquet as it arrives
* LOAD TSV data from S3 to redshift as it arrives


## Imports

In [1]:
# Imports & Settings
import boto3
# import botocore
import sagemaker

# import numpy as np
# import pandas as pd

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set S3 prefixes
tsv_prefix = 'amazon-reviews-pds/tsv'
parquet_prefix = 'amazon-reviews-pds/parquet'

# Set Redshift params 
database_name = 'dsoaws'
table_name = 'amazon_reviews_parquet'

## Step 0 -- Copy Dataset To S3 Bucket & Convert To Parquet

In [2]:
# Set S3 source paths
s3_source_path_tsv = 's3://amazon-reviews-pds/tsv'
s3_source_path_parquet = 's3://amazon-reviews-pds/parquet'

# Set S3 destination paths
s3_destination_path_tsv = 's3://{}/{}'.format(bucket, tsv_prefix)
s3_destination_path_parquet = 's3://{}/{}'.format(bucket, parquet_prefix)

In [3]:
# Download dataset
!aws s3 cp --recursive $s3_source_path_tsv/ $s3_destination_path_tsv/

copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Baby_v1_00.tsv.gz to s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_us_Baby_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Automotive_v1_00.tsv.gz to s3://sa

In [4]:
# List files
!aws s3 ls $s3_destination_path_tsv/

2020-03-01 20:12:48  241896005 amazon_reviews_multilingual_DE_v1_00.tsv.gz
2020-03-01 20:12:48   70583516 amazon_reviews_multilingual_FR_v1_00.tsv.gz
2020-03-01 20:12:48   94688992 amazon_reviews_multilingual_JP_v1_00.tsv.gz
2020-03-01 20:12:48  349370868 amazon_reviews_multilingual_UK_v1_00.tsv.gz
2020-03-01 20:12:48 1466965039 amazon_reviews_multilingual_US_v1_00.tsv.gz
2020-03-01 20:12:49  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2020-03-01 20:12:50  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2020-03-01 20:12:54  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2020-03-01 20:12:57  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2020-03-01 20:13:04 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2020-03-01 20:13:06 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2020-03-01 20:13:06 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2020-03-01 20:13:17  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2020-03-01 20:13:21 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v

#### Remove index.txt, sample and multilingual dataset files

In [5]:
!aws s3 rm $s3_destination_path_tsv/index.txt
!aws s3 rm $s3_destination_path_tsv/sample_fr.tsv
!aws s3 rm $s3_destination_path_tsv/sample_us.tsv

!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
!aws s3 rm $s3_destination_path_tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz

delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/index.txt
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/sample_fr.tsv
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/sample_us.tsv
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
delete: s3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz


### CONVERT TSV to Parquet (with Athena)

In [6]:
# Install PyAthena
!pip install -q PyAthena==1.8.0

[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [26]:
# Imports
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.util import as_pandas

In [27]:
# Create Athena database first
database_name = 'dsoaws'
table_name_tsv = 'amazon_reviews_tsv2'



In [28]:
# Set S3 staging directory 
s3_staging_dir = 's3://{0}/staging/athena'.format(bucket)

### Create Athena Table from downloaded TSV Dataset Files

In [42]:
statement = """CREATE EXTERNAL TABLE {}.{}(
         marketplace string,
         customer_id string,
         review_id string,
         product_id string,
         product_parent string,
         product_title string,
         product_category string,
         star_rating int,
         helpful_votes int,
         total_votes int,
         vine string,
         verified_purchase string,
         review_headline string,
         review_body string,
         review_date string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' ESCAPED BY '\\\\' LINES TERMINATED BY '\\n' LOCATION '{}'
TBLPROPERTIES ( 'compressionType'='gzip', 'skip.header.line.count'='1')""".format(database_name, table_name_tsv, s3_destination_path_tsv)

print(statement)

CREATE EXTERNAL TABLE dsoaws.amazon_reviews_tsv2(
         marketplace string,
         customer_id string,
         review_id string,
         product_id string,
         product_parent string,
         product_title string,
         product_category string,
         star_rating int,
         helpful_votes int,
         total_votes int,
         vine string,
         verified_purchase string,
         review_headline string,
         review_body string,
         review_date string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://sagemaker-us-east-1-806570384721/amazon-reviews-pds/tsv'
TBLPROPERTIES ( 'compressionType'='gzip', 'skip.header.line.count'='1')


In [43]:
# Execute query using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()

cursor.execute(statement)


<pyathena.cursor.Cursor at 0x7f6b8f260ef0>

In [44]:
# Load query results into Pandas DataFrame and show results
df  = as_pandas(cursor)
df.head(5)

### Create Parquet Files from 

In [None]:
'CREATE TABLE amazon_reviews_parquet_from_tsv \
WITH ( format = 'PARQUET', external_location = 's3://{}' ) AS \
SELECT marketplace, \
         customer_id, \
         review_id, \
         product_id, \
         product_parent, \
         product_title, \
         product_category, \
         star_rating, \
         helpful_votes, \
         total_votes, \
         vine, \
         verified_purchase, \
         review_headline, \
         review_body, \
         review_date \
FROM amazon_reviews_tsv'.format(s3_destination_path_parquet)