## Building Open Data Lakes with Debezium, Apache Kafka, Hudi, Spark, and Hive on AWS

__Author:__ Gary A. Stafford  
__Date:__ 2021-10-20  
__Purpose:__ Demonstrate the use of Debezium, Apache Kafka, Hudi, Spark, and Hive to populate and manage an S3-based data lake on AWS from an Amazon RDS datasource. Apache Hudi, Spark, and Hive are hosted on Amazon EMR. Apache Kafka is hosted on Amazon MSK. Kafka Connect is hosted on Amazon EKS.  
__References:__  
- https://hudi.apache.org/docs/quick-start-guide/
- https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
- https://hudi.apache.org/docs/configurations#SPARK_DATASOURCE

### Run Commands from Master Node

SSH to EMR master node as `hadoop` user.

```shell
hdfs dfs -mkdir -p /apps/hudi/lib
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar
hdfs dfs -copyFromLocal /usr/lib/spark/jars/spark-avro.jar /apps/hudi/lib/spark-avro.jar
```

### Museum of Modern Art Collection
Title, artist, date, and medium of every artwork in the MoMA collection.

Dataset: https://www.kaggle.com/momanyc/museum-collection

CSV-format data files:
- artists.csv (596K / ~15k rows)
- artworks.csv (33M / ~130k rows)

### Hudi DeltaStreamer Spark Job

Start the Hudi DeltaStreamer job on EMR.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("moma-cdc-hudi") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars":
            "hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar",
        "spark.serializer":
            "org.apache.spark.serializer.KryoSerializer",
        "spark.sql.hive.convertMetastoreParquet":
            "false"
    }
}

In [None]:
from datetime import datetime
import os
import boto3

In [None]:
spark.sql("SHOW databases;").show(truncate=False)

In [None]:
spark.sql("USE moma;")
spark.sql("SHOW TABLES;").show(truncate=False)

In [None]:
spark.sql("DESCRIBE artists_ro;").show(truncate=False)

In [None]:
spark.sql("SHOW PARTITIONS artists_ro;").show(15, truncate=False)

In [None]:
spark.sql("SELECT * FROM artists_rt LIMIT 5;").show()

In [None]:
%%sh

# preview hudi files in s3
export DATA_LAKE_BUCKET="<your_data_lake_bucket>"

aws s3api list-objects-v2 \
    --bucket $DATA_LAKE_BUCKET --prefix "moma/artists" \
    --query "Contents[].Key" --max-items 50

## Make Some Changes

Make changes to the database. Confirm new Avro are file created in raw part of the data lake. Confirm the new Parquet files are created in the Hudi-managed part of the data lake.

From AWS documentation: "_Hudi creates two tables in the Hive metastore for __MoR__: a table with the name that you specified, which is a read-optimized view (__\_ro__), and a table with the same name appended with __\_rt__, which is a real-time view. You can query both tables._"

#### Debezium Operations

Mandatory string that describes the type of operation that caused the connector to generate the event. In this example, c indicates that the operation created a row. Valid values are:

- c = create
- r = read (applies to only snapshots)
- u = update
- d = delete

#### References
- <https://hudi.apache.org/docs/querying_data/>  
- <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html>
- <https://debezium.io/documentation/reference/connectors/postgresql.html#postgresql-create-events>

In [None]:
spark.sql("SELECT * FROM artists_ro WHERE artist_id IN (445, 535);").show()

In [None]:
spark.sql("SELECT * FROM artists_rt WHERE artist_id IN (445, 535);").show()

In [None]:
spark.sql("SELECT * FROM artists_ro WHERE artist_id IN (451);").show()

In [None]:
spark.sql("SELECT * FROM artists_rt WHERE artist_id IN (451);").show()