# Building a Data Lake using AWS Glue <a name="top"></a>

## Table of Contents:

1. [Introduction](#Introduction)
2. [Activity 1 : CSV to Parquet conversion](#Activity-1-:-CSV-to-Parquet-conversion)
3. [Activity 2 : Building a Star Schema in your Datalake](#Activity-2-:-Building-a-Star-Schema-in-your-Datalake)
3. [Activity 3 : Building an AWS Glue Workflow](#Activity-3-:-Building-an-AWS-Glue-Workflow)
4. [Wrap-up](#Wrap-up)

## Introduction
[(Back to the top)](#top)

In this notebook, we will use AWS Glue to perform 3 activities:
    
- Convert a CSV Dataset to Parquet partitioned out by key fields.
- Build a Star (Denormalized) Schema from an OLTP 3NF (3rd Normal Form) Schema.
- Finally, deploy the piplline to create an AWS Glue Workflow.

Let's start by connecting to our our AWS Glue Dev Endpoint - a persistent AWS Glue Spark  Development environment.

In [None]:
spark.version

In [None]:
spark.sql("show databases").show()

In [None]:
spark.sql("show tables").show()

Note that regular Spark SQL commands work great as we have enabled the feature 'Use Glue Data Catalog as the Hive metastore' for our AWS Glue Dev Endpoint. 

You can click on the link to read more on [AWS Glue Data Catalog Support for Spark SQL Jobs](
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html)

## Activity 1 : CSV to Parquet conversion
[(Back to the top)](#top)

The 1st dataset we will be using is the NYC Taxi Trips CSV dataset with 1.2B records. We will partition the data in the analytics tier by vendor name, year and month, catalog this data in the AWS Glue Data Catalog. This dataset has 5 vendors and 8 years of data.

We will perform the following 3 steps to make the final Parquet converted data available as an AWS Glue Table.

<img src="../resources/activity_flow_1.png" alt="Module1 Flow" style="width: 350px;"/>


### Crawl the Source Data

The 1st step is to run the AWS Glue Crawler on the raw dataset to create the table in the AWS Glue Catalog.

Create and Execute a AWS Glue Crawler on the source data in S3

- Navigate to the AWS Glue console at Services -> AWS Glue
- From the left-hand panel menu, navigate to Data Catalog -> Crawlers.
- Click on the button ‘Add Crawler’ to create a new AWS Glue Crawler.
- Fields to fill in:
    - Page: Add information about your crawler
        - Crawler name: **nyc_trips_csv_crawler**
    - Page: Add a data store
        - Choose a data store: S3
        - Include path: **s3://###s3_bucket###/data/nyc_trips_csv/**
    - Page: Choose an IAM role
       - IAM Role: Choose an existing IAM role **###iam_role###**
    - Page: Configure the crawler's output
        - Database: Click on ‘Add database’ and enter database name as **nyc_trips**.
- Click on the button ‘Finish’ to create the crawler.
- Select the new Crawler and click on 'Run crawler' to run the Crawler.

Once the data is crawled, which should take about a minute, we can view the database and tables in the AWS Glue Catalog and query the tables as well:

### Transform the data to Parquet

Let's query the table created:

In [None]:
spark.sql("use nyc_trips").show()

In [None]:
spark.sql("show tables").show()

In [None]:
df = spark.sql("select * from nyc_trips.nyc_trips_csv")
df.printSchema()

Let's now write an AWS Glue Spark job to convert this csv data into a columnar(parquet) format.

In [None]:
## We will simulate the Glue job arguments 
import sys
sys.argv = ["CSV2Parquet","--JOB_NAME", "CSV2Parquet"]

Let's start the code for the AWS Glue Job:

In [None]:
## Glue boilerplate code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3, json

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
print (args['JOB_NAME']+" START...")
if 'sc' not in vars(): sc = SparkContext()
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

## Glue boilerplate code

In [None]:
db_name='nyc_trips'
tbl_name='nyc_trips_csv'
output_dir='s3://###s3_bucket###/data/nyc_trips_parquet/'

We can easily instantiate an AWS Glue DynamicFrame from the AWS Glue Catalog table:

In [None]:
# Read the input data
dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
dyf.printSchema()

As we are not doing any transformations here, we can write out the data out to our Amazon S3 bucket in Parquet right way:

In [None]:
# Write the data out in Parquet
glueContext.write_dynamic_frame.from_options(frame = dyf, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ['vendor_name', 'year', 'month']}, format = "parquet")

In [None]:
## Glue boilerplate code

job.commit()
print (args['JOB_NAME']+" END...")

## Glue boilerplate code

### Crawl the Transformed Data

Now that the output data is in Amazon S3, let's crawl this dataset in AWS Glue and query this data using Amazon Athena.

- Navigate to the AWS Glue console at Services -> AWS Glue
- From the left-hand panel menu, navigate to Data Catalog -> Crawlers.
- Click on the button ‘Add Crawler’ to create a new AWS Glue Crawler.
- Fields to fill in:
    - Page: Add information about your crawler
        - Crawler name: **nyc_trips_parquet_crawler**
    - Page: Add a data store
        - Choose a data store: S3
        - Include path: **s3://###s3_bucket###/data/nyc_trips_parquet/**
    - Page: Choose an IAM role
       - IAM Role: Choose an existing IAM role **###iam_role###**
    - Page: Configure the crawler's output
        - Database: Select database as **nyc_trips**
- Click on the button ‘Finish’ to create the crawler.
- Select the new Crawler and click on Run crawler to run the Crawler.

Once the Crawler run has completed, navigate to the Amazon Athena console, Services -> Athena to run Amazon Athena queries on this dataset.

Note: You may need to output location for Amazon Athena by clicking on Settings -> Query result location in the Amazon Athena console and setting the value to : 

**s3://###s3_bucket###/athena-query-results/**

We can also query the data using Spark SQL:


In [None]:
spark.sql("show tables").show()

In [None]:
spark.sql("select count(*) from nyc_trips_parquet").show()

## Activity 2 : Building a Star Schema in your Datalake
[(Back to the top)](#top)

In this activity, we will denormalize an OLTP 3NF schema to Parquet. This activity demonstrates the using AWS Glue operations to perform powerful data transformations on input data:

![alt text](../resources/denormalize.png "Building a Star Schema")

### Step 1 : Crawl the Source Data

The 1st step is to run the AWS Crawler on the raw dataset to create the tables in the AWS Glue Catalog.

- Navigate to the AWS Glue console at Services -> AWS Glue
- From the left-hand panel menu, navigate to Data Catalog -> Crawlers.
- Click on the button ‘Add Crawler’ to create a new AWS Glue Crawler.
- Fields to fill in:
    - Page: Add information about your crawler
        - Crawler name: **salesdb_crawler**
    - Page: Add a data store
        - Choose a data store: S3
        - Include path: **s3://###s3_bucket###/data/salesdb/**
    - Page: Choose an IAM role
       - IAM Role: Choose an existing IAM role **###iam_role###**
    - Page: Configure the crawler's output
        - Database:  Click on ‘Add database’ and enter database name as **salesdb**.
- Click on the button ‘Finish’ to create the crawler.
- Select the new Crawler and click on Run crawler to run the Crawler.



In [None]:
spark.sql("use salesdb").show()
spark.sql("show tables").show()

### Step 2: Transform the dataset

Let's now denormalize the source tables where applicable and write out the data in Parquet format to the destination location:

In [None]:
db_name='salesdb'
table1='customer'
table2='customer_site'
output_dir='s3://###s3_bucket###/data/sales_analytics/customer_dim'
print (output_dir)

# Read the Source Tables
cust_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table1)
cust_site_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table2)

# Join the two Source Tables
customer_dim_dyf = Join.apply(cust_dyf,cust_site_dyf,
                       'cust_id', 'cust_id').drop_fields(['cust_id'])

# Write the denormalized CUSTOMER_DIM table in Parquet
glueContext.write_dynamic_frame.from_options(frame = customer_dim_dyf, connection_type = "s3", connection_options = {"path": output_dir}, format = "parquet")


In [None]:
table1='product_category'
table2='product'
output_dir='s3://###s3_bucket###/data/sales_analytics/product_dim/'
print (output_dir)

# Read the Source Tables
table1_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table1)
table2_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table2)

#Join the Source Tables
product_dim_dyf = Join.apply(table1_dyf,table2_dyf,
                       'category_id', 'category_id').drop_fields(['category_id'])

# Write the denormalized CUSTOMER_DIM table in Parquet
glueContext.write_dynamic_frame.from_options(frame = product_dim_dyf, connection_type = "s3", connection_options = {"path": output_dir}, format = "parquet")


In [None]:
table1='supplier'
output_dir='s3://###s3_bucket###/data/sales_analytics/supplier_dim/'
print (output_dir)

# Read the Source Tables
table1_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table1)


# Write the denormalized CUSTOMER_DIM table in Parquet
glueContext.write_dynamic_frame.from_options(frame = table1_dyf, connection_type = "s3", connection_options = {"path": output_dir}, format = "parquet")

In [None]:
table1='sales_order_detail'
table2='sales_order'
output_dir='s3://###s3_bucket###/data/sales_analytics/sales_order_fact/'
print (output_dir)

For the 'sales_order_fact' table, we will try a different approach - 

- We will convert the AWS Glue DynamicFrame to a Spark DataFrame
- Register the Spark Dataframe to a Spark Temporary View
- Use Spark SQL to build the write out the target dataset.

This demonstrates that AWS Glue DynamicFrames and Spark Dataframes are interchangeable and you can get the best of both worlds by using both the options where suitable.

In [None]:
# Read the Source Tables
table1_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table1)
table2_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = table2)

In [None]:
table1_dyf.printSchema()

In [None]:
table2_dyf.printSchema()

In [None]:
table1_dyf.toDF().createOrReplaceTempView("sales_order_v")
table2_dyf.toDF().createOrReplaceTempView("sales_order_detail_v")

In [None]:
# Write the denormalized SALES_ORDER_FACT table
df=spark.sql("SELECT a.*, b.site_id, b.order_date,b.ship_mode \
FROM sales_order_detail_v b, sales_order_v a \
WHERE a.order_id=b.order_id")
df.printSchema()
print(df.count())
df.coalesce(1).write.mode("OVERWRITE").parquet("s3://###s3_bucket###/data/sales_analytics/sales_order_fact/")

Note that we used the power of Spark SQL for this transformation instead of AWS Glue DynamicFrame transforms. This dataset is small so we also coalesced the number of partitions in the Spark dataframe to 1 to ensure only 1 file gets written to our output location.

In [None]:
%%sh
aws s3 ls s3://###s3_bucket###/data/sales_analytics/sales_order_fact/

Now that the output data is in Amazon S3, let's crawl this dataset in AWS Glue and query this data using Amazon Athena.

### Step 3 : Crawl the Transformed Data

- Navigate to the Glue console at Services -> Glue
- From the left-hand panel menu, navigate to Data Catalog -> Crawlers.
- Click on the button ‘Add Crawler’ to create a new Glue Crawler.
- Fields to fill in:
    - Page: Add information about your crawler
        - Crawler name: **sales_analytics_crawler**
    - Page: Add a data store
        - Choose a data store: S3
        - Include path: **s3://###s3_bucket###/data/sales_analytics/**
    - Page: Choose an IAM role
        - IAM Role: Choose an existing IAM role **###iam_role###**
    - Page: Configure the crawler's output
        - Database:  Click on ‘Add database’ and enter database name as **sales_analytics**.
- Click on the button ‘Finish’ to create the crawler.
- Select the new Crawler and click on Run crawler to run the Crawler.


In [None]:
spark.sql("use sales_analytics").show()
spark.sql("show tables").show()

## Activity 3 : Building an AWS Glue Workflow
[(Back to the top)](#top)

An AWS Glue workflow is an orchestration used to visualize and manage the relationship and execution of multiple AWS Glue triggers, jobs and crawlers. Let's now build an AWS Glue Workflow for the same. 

The 1st step is to create the AWS Glue Jobs. As the AWS Glue ETL code is already staged in our Amazon S3 bucket, we will simply call the AWS Glue APIs to create the AWS Glue Jobs.

In [None]:
%local

import boto3

acct_number=boto3.client('sts').get_caller_identity().get('Account')
bucket='###s3_bucket###'
iam_role='###iam_role###'

# Create the AWS Glue Spark Jobs
glue = boto3.client("glue")

for job_name in ['Load_SALES_ORDER_FACT', 'Load_PRODUCT_DIM', 'Load_CUSTOMER_DIM','Load_SUPPLIER_DIM']:
    response=glue.create_job(Name=job_name,
                         Role=f"arn:aws:iam::{acct_number}:role/{iam_role}",
                         ExecutionProperty={'MaxConcurrentRuns': 1},
                         Command={'Name': 'glueetl',
                                  'ScriptLocation': f's3://{bucket}/scripts/{job_name}.py',
                                  'PythonVersion': '3'},
                         DefaultArguments={'--TempDir': f's3://{bucket}/temp',
                                           '--enable-continuous-cloudwatch-log': 'true',
                                           '--enable-glue-datacatalog': '',
                                           '--enable-metrics': '',
                                           '--enable-spark-ui': 'true',
                                           '--spark-event-logs-path': f's3://{bucket}/spark_glue_etl_logs/{job_name}',
                                           '--job-bookmark-option': 'job-bookmark-disable',
                                           '--job-language': 'python',
                                           '--S3_BUCKET': bucket },
                         MaxRetries=0,
                         Timeout=2880,
                         MaxCapacity=3.0,
                         GlueVersion='1.0',
                         Tags={'Owner': 'Glue_Labs'}
                        )
    print (response)

The Workflow consists of 3 AWS Glue triggers:

- The 1st OnDemand Trigger loads the Dimension tables.
- The 2nd Conditional Trigger loads the Fact table.
- The 3rd Conditional Trigger updated the table definitions in the Catalog.

In [None]:
%local

glue = boto3.client("glue")

# Create the AWS Glue Workflow
response = glue.create_workflow(
    Name='Sales_Analytics_Workflow',
    Description='Sales Analytics Workflow v1.0'
)
print (response)

# 1. The Trigger to load the Dimensions table
response = glue.create_trigger(
    Name='1_Load_Dimensions',
    WorkflowName='Sales_Analytics_Workflow',
    Type='ON_DEMAND',
    Actions=[{'JobName': 'Load_CUSTOMER_DIM',
    'Arguments': {'--job-bookmark-option': 'job-bookmark-disable'},
    'Timeout': 2880},
   {'JobName': 'Load_PRODUCT_DIM',
    'Arguments': {'--job-bookmark-option': 'job-bookmark-disable'},
    'Timeout': 2880},
   {'JobName': 'Load_SUPPLIER_DIM',
    'Arguments': {'--job-bookmark-option': 'job-bookmark-disable'},
    'Timeout': 2880}]
)
print (response)  

# 2. The Trigger to load the Facts table
response = glue.create_trigger(
    Name='2_Load_Facts',
    WorkflowName='Sales_Analytics_Workflow',
    Type='CONDITIONAL',
    StartOnCreation=True,
    Actions=[{'JobName': 'Load_SALES_ORDER_FACT'}],
    Predicate= {'Logical': 'AND',
    'Conditions': [{'LogicalOperator': 'EQUALS',
                  'JobName': 'Load_SUPPLIER_DIM',
                   'State': 'SUCCEEDED'},
                  {'LogicalOperator': 'EQUALS',
                   'JobName': 'Load_PRODUCT_DIM',
                   'State': 'SUCCEEDED'},
                  {'LogicalOperator': 'EQUALS',
                   'JobName': 'Load_CUSTOMER_DIM',
                   'State': 'SUCCEEDED'}]
               }
)
print (response)  

# Finally, the Trigger for the Crawler
response = glue.create_trigger(
    Name='3_Update_Catalog',
    WorkflowName='Sales_Analytics_Workflow',
    Type='CONDITIONAL',
    StartOnCreation=True,
    Actions=[{'CrawlerName': 'sales_analytics_crawler'}],
    Predicate= {'Logical': 'ANY',
   'Conditions': [{'LogicalOperator': 'EQUALS',
     'JobName': 'Load_SALES_ORDER_FACT',
     'State': 'SUCCEEDED'}]}
)
print (response)     

Let's review the AWS Glue Workflow created:
    
- Navigate to the Glue Console at Service -> Glue
- From the left-hand panel menu, choose Workflows
- Select the Workflow 'Sales_Analytics_Workflow'.

Your workflow should look like this:

![title](../resources/Glue_Workflow.png)
  

Let us now run this workflow: 

- Select the workflow and click on 'Action - > Run' to launch the workflow
- You can view the run details and visually track the progress of each acitvity in the workflow from the 'History' tab by selecting the workflow run and clicking on 'View Run Details'

![title](../resources/View_Run_Details.png)


## Wrap-up
[(Back to the top)](#top)


In this notebook, we ran exercises to perform: 

1. A CSV to Parquet conversion and observed how easy it is to transform and write data to an Amazon S3 bucket using AWS Glue, partitioned by key fields.
2. A more complex transformation - denormalizing of a 3NF OLTP schema, and we observed how easy it is to perform complex data transformations using the power of both AWS Glue DynamicFrames and Spark SQL.
3. We built and executed an AWS Glue Workflow to orchestrate multiple AWS Glue Jobs.
