## Configuration

Before running the code in the cell(s) below SSH into your EMR cluster and run the following 

```hdfs dfs -mkdir -p /apps/hudi/lib```

```hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar```

```hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar /apps/hudi/lib/spark-avro.jar```

This will copy the Hudi jar files from the local file system to HDFS on the master node of the notebook cluster

In [1]:
%%configure
{
    "conf": {
            "spark.jars":"hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar",
            "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
            "spark.sql.hive.convertMetastoreParquet":"false"
    }
}

In [2]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

from datetime import datetime

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
19,application_1638389236079_0047,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Write to S3 via. Hudi

In [3]:
data = [
        ("1", "Chris", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')),
        ("2", "Will", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')),
        ("3", "Emma", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')),
        ("4", "John", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')),
        ("5", "Eric", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')),
        ("6", "Adam", "2020-01-01", datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S'))
]

schema = StructType([
        StructField("id", StringType(), False),
        StructField("name", StringType(), False), 
        StructField("create_date", StringType(), False),             
        StructField("last_update_time", TimestampType(), False)
])

inputDF = spark.createDataFrame(data=data,schema=schema)

# Create hudiOptions variable
hudiOptions = {
    'hoodie.table.name': 'copy_on_write_python',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.partitionpath.field': 'create_date',
    'hoodie.datasource.write.precombine.field': 'last_update_time',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.table': 'copy_on_write_python',
    'hoodie.datasource.hive_sync.partition_fields': 'last_update_time',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ'
}

# Write a DataFrame to S3 as a Hudi dataset 
inputDF \
    .write \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.write.operation', 'insert') \
    .options(**hudiOptions) \
    .mode('overwrite') \
    .save('s3://hudi-sharkech/merge_on_read_python/')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Upsert data

Lets do an upsert ... this will be *upsert #1* 

In *upsert 1* we change **Chris** to **Chris Sharkey**

also note that for this write we set [inline compaction][0] to false ```option("hoodie.compact.inline", "false")``` . This keeps Hudi from compacting our changes during the write operation.

[0]:https://hudi.apache.org/docs/0.7.0/configurations#withinlinecompactioninlinecompaction--false

In [4]:
data = [
        ("1", "Chris Sharkey", "2020-01-01", datetime.strptime('2020-01-02 00:00:00', '%Y-%m-%d %H:%M:%S'))
]

schema = StructType([
        StructField("id", StringType(), False),
        StructField("name", StringType(), False),
        StructField("create_date", StringType(), False),             
        StructField("last_update_time", TimestampType(), False)
])

updateDF = spark.createDataFrame(data=data,schema=schema)

# Upsert the records in updateDF
updateDF \
    .write \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .options(**hudiOptions) \
    .option("hoodie.compact.inline", "false") \
    .mode('append') \
    .save('s3://hudi-sharkech/merge_on_read_python/')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Read the Hudi Table

Hudi provides 3 query types
1. Snapshot Query
2. Read Optimized Query 
3. Incremental Query

We will cover Snapshot queries and Read Optimized queries below. Incremetnal queries are covered in the [copy_on_write][1] notebooks.

Query Type|Description
:---|:---|
Snapshot Queries|Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query. 
Incremental Queries|Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines.
Read Optimized Queries|For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.

[1]:https://github.com/ev2900/Hudi_Elastic_Map_Reduce/tree/main/copy_on_write

### Snapshot Query

We expect a snapshot query to return the most up to date version of a Hudi table. 

The snap shotquery should include *upsert 1* that changed Chris to **Chris Sharkey**

In [5]:
snapshotQueryDF = spark.read.format('org.apache.hudi').load('s3://hudi-sharkech/merge_on_read_python' + '/*/*')

# snapshotQueryDF.orderBy("id").show()
# snapshotQueryDF.select("_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name").orderBy("_hoodie_record_key").show()

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+-----------+-------------------+
| id|         name|create_date|   last_update_time|
+---+-------------+-----------+-------------------+
|  1|Chris Sharkey| 2020-01-01|2020-01-02 00:00:00|
|  2|         Will| 2020-01-01|2020-01-01 00:00:00|
|  3|         Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|         John| 2020-01-01|2020-01-01 00:00:00|
|  5|         Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|         Adam| 2020-01-01|2020-01-01 00:00:00|
+---+-------------+-----------+-------------------+

### Read Optimized Queries

A read optimized query to return the latest data compacted. 

*upsert 1* that changed **Chris** to **Chris Sharkey** has not been compacted to the base parquet files yet becuase we set ```option("hoodie.compact.inline", "false")``` during the upsert in the prior step.

We expect the read optimized query to **not** reflect the changes made in *upsert 1*

In [6]:
readOptimizedQueryDF = spark \
    .read \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.query.type', 'read_optimized') \
    .load('s3://hudi-sharkech/merge_on_read_python' + '/*/*') \

readOptimizedQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----+-----------+-------------------+
| id| name|create_date|   last_update_time|
+---+-----+-----------+-------------------+
|  1|Chris| 2020-01-01|2020-01-01 00:00:00|
|  2| Will| 2020-01-01|2020-01-01 00:00:00|
|  3| Emma| 2020-01-01|2020-01-01 00:00:00|
|  4| John| 2020-01-01|2020-01-01 00:00:00|
|  5| Eric| 2020-01-01|2020-01-01 00:00:00|
|  6| Adam| 2020-01-01|2020-01-01 00:00:00|
+---+-----+-----------+-------------------+

Now that we are getting the hang of it ... lets do another upsert this will be *upsert #2*

*upsert 2* will change **Chris Sharkey** to **Chris M Sharkey**

In [7]:
data = [
        ("1", "Chris M Sharkey", "2020-01-01", datetime.strptime('2020-01-02 00:00:00', '%Y-%m-%d %H:%M:%S'))
]

schema = StructType([
        StructField("id", StringType(), False),
        StructField("name", StringType(), False),
        StructField("create_date", StringType(), False),             
        StructField("last_update_time", TimestampType(), False)
])

updateDF = spark.createDataFrame(data=data,schema=schema)

# Upsert the records in updateDF
updateDF \
    .write \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .options(**hudiOptions) \
    .option("hoodie.compact.inline", "false") \
    .mode('append') \
    .save('s3://hudi-sharkech/merge_on_read_python/')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Snapshot query ... the query results should include the changes we just made in *upsert 2*

In [8]:
snapshotQueryDF = spark.read.format('org.apache.hudi').load('s3://hudi-sharkech/merge_on_read_python' + '/*/*')

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---------------+-----------+-------------------+
| id|           name|create_date|   last_update_time|
+---+---------------+-----------+-------------------+
|  1|Chris M Sharkey| 2020-01-01|2020-01-02 00:00:00|
|  2|           Will| 2020-01-01|2020-01-01 00:00:00|
|  3|           Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|           John| 2020-01-01|2020-01-01 00:00:00|
|  5|           Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|           Adam| 2020-01-01|2020-01-01 00:00:00|
+---+---------------+-----------+-------------------+

Read optimized query .. neither *upsert 1* or *upsert 2* have been compacted yet. 

The read optimized query should **not** include the changes made by either *upsert 1* or *upsert 2*

In [9]:
readOptimizedQueryDF = spark \
    .read \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.query.type', 'read_optimized') \
    .load('s3://hudi-sharkech/merge_on_read_python' + '/*/*') \

readOptimizedQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----+-----------+-------------------+
| id| name|create_date|   last_update_time|
+---+-----+-----------+-------------------+
|  1|Chris| 2020-01-01|2020-01-01 00:00:00|
|  2| Will| 2020-01-01|2020-01-01 00:00:00|
|  3| Emma| 2020-01-01|2020-01-01 00:00:00|
|  4| John| 2020-01-01|2020-01-01 00:00:00|
|  5| Eric| 2020-01-01|2020-01-01 00:00:00|
|  6| Adam| 2020-01-01|2020-01-01 00:00:00|
+---+-----+-----------+-------------------+

### Compaction

Running a compaction will merge the changes we made in *upsert 1* and *upsert 2* with the base parquet files. After a compaction the snapshot query and read optimized query will return the same results. 

An easy way to trigger a compaction is to do another write operation and set ```option("hoodie.compact.inline", "true")``` and set ```option("hoodie.compact.inline.max.delta.commits", "1")```

We could also run a compaction from the [Hudi CLI][1] or via. other methods 

[1]:https://hudi.apache.org/docs/0.7.0/deployment#compactions

In [10]:
data = [
        ("1", "Christopher M Sharkey", "2020-01-01", datetime.strptime('2020-01-02 00:00:00', '%Y-%m-%d %H:%M:%S'))
]

schema = StructType([
        StructField("id", StringType(), False),
        StructField("name", StringType(), False),
        StructField("create_date", StringType(), False),             
        StructField("last_update_time", TimestampType(), False)
])

updateDF = spark.createDataFrame(data=data,schema=schema)

# Upsert the records in updateDF
updateDF \
    .write \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .options(**hudiOptions) \
    .option("hoodie.compact.inline", "true") \
    .option("hoodie.compact.inline.max.delta.commits", "1") \
    .mode('append') \
    .save('s3://hudi-sharkech/merge_on_read_python/')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

The snapshot query and read optimized query will should now return the same results

In [11]:
# Snapshot query
snapshotQueryDF = spark.read.format('org.apache.hudi').load('s3://hudi-sharkech/merge_on_read_python' + '/*/*')

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------------------+-----------+-------------------+
| id|                name|create_date|   last_update_time|
+---+--------------------+-----------+-------------------+
|  1|Christopher M Sha...| 2020-01-01|2020-01-02 00:00:00|
|  2|                Will| 2020-01-01|2020-01-01 00:00:00|
|  3|                Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|                John| 2020-01-01|2020-01-01 00:00:00|
|  5|                Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|                Adam| 2020-01-01|2020-01-01 00:00:00|
+---+--------------------+-----------+-------------------+

In [12]:
# Read optimized query
readOptimizedQueryDF = spark \
    .read \
    .format('org.apache.hudi') \
    .option('hoodie.datasource.query.type', 'read_optimized') \
    .load('s3://hudi-sharkech/merge_on_read_python' + '/*/*') \

readOptimizedQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------------------+-----------+-------------------+
| id|                name|create_date|   last_update_time|
+---+--------------------+-----------+-------------------+
|  1|Christopher M Sha...| 2020-01-01|2020-01-02 00:00:00|
|  2|                Will| 2020-01-01|2020-01-01 00:00:00|
|  3|                Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|                John| 2020-01-01|2020-01-01 00:00:00|
|  5|                Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|                Adam| 2020-01-01|2020-01-01 00:00:00|
+---+--------------------+-----------+-------------------+