# Modern Data Lake Storage Layers - Apache Hudi

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars": "hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.sql.hive.convertMetastoreParquet": "false"
    }
}

In [None]:
%env S3_BUCKET_NAME=YOUR_S3_BUCKET_NAME

In [None]:
S3_BUCKET_NAME="YOUR_S3_BUCKET_NAME"

## The basics of Apache Hudi

We'll begin by following the [EMR Documentation for working with a Hudi Dataset](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html) that shows the minimal set of commands you need to get up and running. We'll start by running those commands and seeing what actually happens with the files on S3 as we progress.

First, we'll get write a minimal set of sample data.

In [4]:
# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
    "hoodie.table.name": "my_hudi_table",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.partitionpath.field": "creation_date",
    "hoodie.datasource.write.precombine.field": "last_update_time",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.table": "my_hudi_table",
    "hoodie.datasource.hive_sync.partition_fields": "creation_date",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.index.type": "GLOBAL_BLOOM",  # This is required if we want to ensure we upsert a record, even if the partition changes
    "hoodie.bloom.index.update.partition.path": "true",  # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
}

# Write a DataFrame as a Hudi dataset
inputDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "insert"
).options(**hudiOptions).mode("overwrite").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now let's take a quick look at our S3 bucket to see what's there.

In [5]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi/

                           PRE .hoodie/
                           PRE 2015-01-01/
                           PRE 2015-01-02/
2022-02-01 19:36:01          0 .hoodie_$folder$
2022-02-01 19:36:09          0 2015-01-01_$folder$
2022-02-01 19:36:10          0 2015-01-02_$folder$


OK, so we've got a `.hoodie` metadata folder and then two different "partition" folders based on the `creation_date`. What's inside those?

In [6]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi/ --recursive | tee /tmp/hudi_op_001

2022-02-01 19:36:01          0 tmp/hudi/.hoodie/.aux/.bootstrap/.fileids_$folder$
2022-02-01 19:36:01          0 tmp/hudi/.hoodie/.aux/.bootstrap/.partitions_$folder$
2022-02-01 19:36:01          0 tmp/hudi/.hoodie/.aux/.bootstrap_$folder$
2022-02-01 19:36:01          0 tmp/hudi/.hoodie/.aux_$folder$
2022-02-01 19:36:01          0 tmp/hudi/.hoodie/.temp_$folder$
2022-02-01 19:36:11       2706 tmp/hudi/.hoodie/20220201193557.commit
2022-02-01 19:36:03          0 tmp/hudi/.hoodie/20220201193557.commit.requested
2022-02-01 19:36:07       1842 tmp/hudi/.hoodie/20220201193557.inflight
2022-02-01 19:36:01          0 tmp/hudi/.hoodie/archived_$folder$
2022-02-01 19:36:01        503 tmp/hudi/.hoodie/hoodie.properties
2022-02-01 19:36:01          0 tmp/hudi/.hoodie_$folder$
2022-02-01 19:36:09         93 tmp/hudi/2015-01-01/.hoodie_partition_metadata
2022-02-01 19:36:10     434974 tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_20220201193557.parquet
2022-02-01 19:36:09       

Awesome, so we can see some "commit" files in the `.hoodie` metadata folder with timestamps equal to when we inserted the data. 

And then we _also_ see two `.parquet` files that presumably contain the 6 rows we inserted. 

_NOTE_ There is _also_ a `.hoodie_partition_metadata` file in each partition...let's see what that is.

In [7]:
%%sh

aws s3 cp s3://${S3_BUCKET_NAME}/tmp/hudi/2015-01-01/.hoodie_partition_metadata -

#partition metadata
#Tue Feb 01 19:36:08 UTC 2022
commitTime=20220201193557
partitionDepth=1


## Updating Data

Alright, let's go ahead and update one of our rows.

In [8]:
from pyspark.sql.functions import lit

# Create a new DataFrame from the first row of inputDF with a different creation_date value
updateDF = inputDF.where("id = 100").withColumn("creation_date", lit("2022-01-11"))

updateDF.show()

updateDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "upsert"
).options(**hudiOptions).mode("append").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+--------------------+
| id|creation_date|    last_update_time|
+---+-------------+--------------------+
|100|   2022-01-11|2015-01-01T13:51:...|
+---+-------------+--------------------+

OK! Let's see what impact that had on our S3 files.

In [9]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi/ --recursive > /tmp/hudi_op_002
diff -u /tmp/hudi_op_001 /tmp/hudi_op_002 || true

--- /tmp/hudi_op_001	2022-02-01 19:36:13.983085501 +0000
+++ /tmp/hudi_op_002	2022-02-01 19:36:32.319061451 +0000
@@ -6,12 +6,19 @@
 2022-02-01 19:36:11       2706 tmp/hudi/.hoodie/20220201193557.commit
 2022-02-01 19:36:03          0 tmp/hudi/.hoodie/20220201193557.commit.requested
 2022-02-01 19:36:07       1842 tmp/hudi/.hoodie/20220201193557.inflight
+2022-02-01 19:36:30       2705 tmp/hudi/.hoodie/20220201193615.commit
+2022-02-01 19:36:16          0 tmp/hudi/.hoodie/20220201193615.commit.requested
+2022-02-01 19:36:22       2560 tmp/hudi/.hoodie/20220201193615.inflight
 2022-02-01 19:36:01          0 tmp/hudi/.hoodie/archived_$folder$
 2022-02-01 19:36:01        503 tmp/hudi/.hoodie/hoodie.properties
 2022-02-01 19:36:01          0 tmp/hudi/.hoodie_$folder$
 2022-02-01 19:36:09         93 tmp/hudi/2015-01-01/.hoodie_partition_metadata
+2022-02-01 19:36:29     434922 tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet
 2022-02-01 19:36:10  

So now we have a new `.commit` file and also new `parquet` files in the `2022-01-11` partition.

But! Notice that we _also_ have a new `.parquet` file in the `2015-01-01` partition. Let's do a quick query to see what our data looks like now.

We're only going to seelct the `id`, `creation_date`, and include the input `filename`.

In [10]:
from  pyspark.sql.functions import input_file_name, regexp_replace

snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi/") \
    .select('id', 'creation_date') \
    .withColumn("filename", regexp_replace(input_file_name(), S3_BUCKET_NAME, '<BUCKET>'))
    
snapshotQueryDF.show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+----------------------------------------------------------------------------------------------------------+
|id |creation_date|filename                                                                                                  |
+---+-------------+----------------------------------------------------------------------------------------------------------+
|100|2022-01-11   |s3://<BUCKET>/tmp/hudi/2022-01-11/6c5d7a65-f1fb-4751-8bd1-a0a5ffc1caa2-0_1-42-13601_20220201193615.parquet|
|104|2015-01-02   |s3://<BUCKET>/tmp/hudi/2015-01-02/1feee853-39cf-437e-b705-96a2602b9f25-0_1-6-51_20220201193557.parquet    |
|105|2015-01-02   |s3://<BUCKET>/tmp/hudi/2015-01-02/1feee853-39cf-437e-b705-96a2602b9f25-0_1-6-51_20220201193557.parquet    |
|101|2015-01-01   |s3://<BUCKET>/tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|102|2015-01-01   |s3://<BUCKET>/tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201

We can see here that the row with `2022-01-11` comes from a new Parquet file.

And all the values for the `2015-01-01` partition now come from the new Parquet file. This means that in order to update 1 row, we had to create 2 new Parquet files.

So wait...how is Hudi reading this data?! Let's do a "normal" read of the Parquet files to see what's in there...

In [11]:
from pyspark.sql.functions import split

rawDF = (
    spark.read.parquet(f"s3://{S3_BUCKET_NAME}/tmp/hudi/*/*.parquet")
    .withColumn("filename", split(input_file_name(), "tmp/hudi").getItem(1))
    .sort("_hoodie_commit_time", "_hoodie_commit_seqno")
)
rawDF.show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+------------------------------------------------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |id |creation_date|last_update_time           |filename                                                                            |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+------------------------------------------------------------------------------------+
|20220201193557     |20220201193557_0_1  |100               |2015-01-01            |141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_202

We can see here that there are multiple `id` records for the partition where we updated data as well as the new partition. Specifically, ids 100-103 were initially in partition `2015-01-01` and 104-105 were in `2015-01-02`. We then updated id 100 to be in partition `2022-01-11`. So we see one two sets of values for 100-103. Let's get a closer look...

In [12]:
rawDF.select("id", "creation_date", "filename").sort( "filename", "id").show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+------------------------------------------------------------------------------------+
|id |creation_date|filename                                                                            |
+---+-------------+------------------------------------------------------------------------------------+
|101|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|102|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|103|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|100|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_20220201193557.parquet    |
|101|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_20220201193557.parquet    |
|102|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_20220201193557.parquet    |
|103|2015-01-01   |/2015-01-01/141ef477-8767-4d67-afde-

So not only can we see the original file names and values, but also the new ones. In the background, Hudi figures out which commits and values to show. Let's also take a quick peek inside one of those commit files.

In [13]:
%%sh

commit_filename=$(aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi/.hoodie/ | grep -E ".commit$" | tr -s " " | cut -f4 -d\ | tail -n 1)
aws s3 cp s3://${S3_BUCKET_NAME}/tmp/hudi/.hoodie/${commit_filename} -

{
  "partitionToWriteStats" : {
    "2015-01-01" : [ {
      "fileId" : "141ef477-8767-4d67-afde-938ffa553306-0",
      "path" : "2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet",
      "prevCommit" : "20220201193557",
      "numWrites" : 3,
      "numDeletes" : 1,
      "numUpdateWrites" : 0,
      "numInserts" : 0,
      "totalWriteBytes" : 434922,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "2015-01-01",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 434922,
      "minEventTime" : null,
      "maxEventTime" : null
    } ],
    "2022-01-11" : [ {
      "fileId" : "6c5d7a65-f1fb-4751-8bd1-a0a5ffc1caa2-0",
      "path" : "2022-01-11/6c5d7a65-f1fb-4751-8bd1-a0a5ffc1caa2-0_1-42-13601_20220201193615.parquet

Oh cool, it's just a JSON file that gives information about the records written and deleted as well as the relevant paths.

## Deleting Data

We can also delete records! Hudi supports [two types of deletes](https://hudi.apache.org/docs/writing_data/#deletes) - soft and hard. We'll use a hard delete to remove any trace of the record.

In [14]:
updateDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "upsert"
).option(
    "hoodie.datasource.write.payload.class",
    "org.apache.hudi.common.model.EmptyHoodieRecordPayload",
).options(
    **hudiOptions
).mode(
    "append"
).save(
    f"s3://{S3_BUCKET_NAME}/tmp/hudi/"
)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

And let's read the data again.

In [15]:
snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi") \
    .select('id', 'creation_date') \
    .withColumn("filename", regexp_replace(input_file_name(), S3_BUCKET_NAME, '<BUCKET>'))
    
snapshotQueryDF.show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+----------------------------------------------------------------------------------------------------------+
|id |creation_date|filename                                                                                                  |
+---+-------------+----------------------------------------------------------------------------------------------------------+
|104|2015-01-02   |s3://<BUCKET>/tmp/hudi/2015-01-02/1feee853-39cf-437e-b705-96a2602b9f25-0_1-6-51_20220201193557.parquet    |
|105|2015-01-02   |s3://<BUCKET>/tmp/hudi/2015-01-02/1feee853-39cf-437e-b705-96a2602b9f25-0_1-6-51_20220201193557.parquet    |
|101|2015-01-01   |s3://<BUCKET>/tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|102|2015-01-01   |s3://<BUCKET>/tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201193615.parquet|
|103|2015-01-01   |s3://<BUCKET>/tmp/hudi/2015-01-01/141ef477-8767-4d67-afde-938ffa553306-0_0-36-13600_20220201

Cool! So the records are gone...let's see what the files on S3 look like again.

In [16]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi/ --recursive > /tmp/hudi_op_003
diff -u /tmp/hudi_op_002 /tmp/hudi_op_003 || true

--- /tmp/hudi_op_002	2022-02-01 19:36:32.319061451 +0000
+++ /tmp/hudi_op_003	2022-02-01 19:36:50.823038714 +0000
@@ -9,6 +9,9 @@
 2022-02-01 19:36:30       2705 tmp/hudi/.hoodie/20220201193615.commit
 2022-02-01 19:36:16          0 tmp/hudi/.hoodie/20220201193615.commit.requested
 2022-02-01 19:36:22       2560 tmp/hudi/.hoodie/20220201193615.inflight
+2022-02-01 19:36:48       1778 tmp/hudi/.hoodie/20220201193636.commit
+2022-02-01 19:36:37          0 tmp/hudi/.hoodie/20220201193636.commit.requested
+2022-02-01 19:36:41       1906 tmp/hudi/.hoodie/20220201193636.inflight
 2022-02-01 19:36:01          0 tmp/hudi/.hoodie/archived_$folder$
 2022-02-01 19:36:01        503 tmp/hudi/.hoodie/hoodie.properties
 2022-02-01 19:36:01          0 tmp/hudi/.hoodie_$folder$
@@ -20,5 +23,6 @@
 2022-02-01 19:36:11     434940 tmp/hudi/2015-01-02/1feee853-39cf-437e-b705-96a2602b9f25-0_1-6-51_20220201193557.parquet
 2022-02-01 19:36:10          0 tmp/hudi/2015-01-02_$folder$
 2022-02-01 19:36:29        

We can see the only difference is a new Parquet file in the `2022-01-11` partition and a new set of `.commit` files.

In [17]:
rawDF = spark.read.parquet(f"s3://{S3_BUCKET_NAME}/tmp/hudi/*/*.parquet")
rawDF.show(rawDF.count(), truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |id |creation_date|last_update_time           |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+
|20220201193615     |20220201193615_1_7  |100               |2022-01-11            |6c5d7a65-f1fb-4751-8bd1-a0a5ffc1caa2-0_1-42-13601_20220201193615.parquet|100|2022-01-11   |2015-01-01T13:51:39.340396Z|
|20220201193557     |20220201193557_0_1  |100               |2015-01-01            |141ef477-8767-4d67-afde-938ffa553306-0_0-4-50_20220201193557.parquet    |100|2015-01-01   |2015-01-0

Note that only 1 file is showing up for the `2022-01-11` partition, because we deleted the only record in there and there are no records in the most recently committed Parquet file.

## Overview Wrapup

OK, so to wrap up we learned a few things:
- Update or deleting data will rewrite Parquet files that contain the affected rows being updated
- Hudi figures out which set of data to show based on a combination of commit files and IDs in the filename
- This *all* happens inside Spark/Hudi itself - there is no external database that tracks this (although you _can_ [sync to an external metastore](https://hudi.apache.org/docs/syncing_metastore))

One of the useful things about this approach is that it can support time travel. 😳 

You can perform something called an "incremental query" to see records that have changed since a given commit timestamp. In this case, let's take the most recent `_hoodie_commit_time` and see what the dataset looks at that point. Let's fetch the last commit time, do an update, and then pull records since that time.

_We'll fetch this programatically using a native Spark query as it will be different depending on when you run the notebook._

In [18]:
last_commit_time = rawDF.sort(rawDF._hoodie_commit_time.desc()).select("_hoodie_commit_time").limit(1).collect()[0]._hoodie_commit_time

# Create a DataFrame
newRecordsDF = spark.createDataFrame(
    [
        ("106", "2022-01-11", "2022-01-11T13:51:39.340396Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Write a DataFrame as a Hudi dataset
newRecordsDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "insert"
).options(**hudiOptions).mode("append").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi/")


# Read new data since the last commit
readOptions = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': last_commit_time,
}

incQueryDF = spark.read \
    .format('org.apache.hudi') \
    .options(**readOptions) \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi")
    
incQueryDF.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|creation_date|    last_update_time|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|     20220201193653|  20220201193653_0_8|               106|            2022-01-11|6c5d7a65-f1fb-475...|106|   2022-01-11|2022-01-11T13:51:...|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+

In [19]:
# Or data _before_ the commmit
readOptions = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': '0',
  'hoodie.datasource.read.end.instanttime': last_commit_time,
}

incQueryDF = spark.read \
    .format('org.apache.hudi') \
    .options(**readOptions) \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi")
    
incQueryDF.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|creation_date|    last_update_time|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|     20220201193615|  20220201193615_1_7|               100|            2022-01-11|6c5d7a65-f1fb-475...|100|   2022-01-11|2015-01-01T13:51:...|
|     20220201193557|  20220201193557_1_5|               104|            2015-01-02|1feee853-39cf-437...|104|   2015-01-02|2015-01-01T12:15:...|
|     20220201193557|  20220201193557_1_6|               105|            2015-01-02|1feee853-39cf-437...|105|   2015-01-02|2015-01-01T13:51:...|
|     20220201193557|  20220201193557_0_2|               101|            2015-01-01|141ef477-8767-4d6...|101|   2015-01-01|2015-01

In [20]:
first_commit_time = rawDF.sort(rawDF._hoodie_commit_time.asc()).select("_hoodie_commit_time").limit(1).collect()[0]._hoodie_commit_time

# And finally data from the first version of the table
readOptions = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': '0',
  'hoodie.datasource.read.end.instanttime': first_commit_time,
}

incQueryDF = spark.read \
    .format('org.apache.hudi') \
    .options(**readOptions) \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi")
    
incQueryDF.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|creation_date|    last_update_time|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------+--------------------+
|     20220201193557|  20220201193557_0_1|               100|            2015-01-01|141ef477-8767-4d6...|100|   2015-01-01|2015-01-01T13:51:...|
|     20220201193557|  20220201193557_0_2|               101|            2015-01-01|141ef477-8767-4d6...|101|   2015-01-01|2015-01-01T12:14:...|
|     20220201193557|  20220201193557_0_3|               102|            2015-01-01|141ef477-8767-4d6...|102|   2015-01-01|2015-01-01T13:51:...|
|     20220201193557|  20220201193557_0_4|               103|            2015-01-01|141ef477-8767-4d6...|103|   2015-01-01|2015-01

## Optimization: Copy on Write vs. Merge on Read

One of the interesting things about Hudi is that it has two different storage types - Copy on Write (COW) and Merge on Read (MOR) - that you can choose from depending on your workload. You can find more about the [difference between COW and MOR in the Hudi docs](https://hudi.apache.org/learn/faq#what-is-the-difference-between-copy-on-write-cow-vs-merge-on-read-mor-storage-types). 

In short, COW is great if you don't need real-time data and only have sparse updates. MOR is great if you want to be able to write **AND** query your data as fast as possible.

Let's see what impact changing the storage type has. Copy on Write is the default and what was used above. You can change the storage type by setting the `hoodie.datasource.write.storage.type` option. We'll take our same `inputDF` dataframe and write it with this added option.

In [21]:
hudiOptions['hoodie.datasource.write.storage.type'] = 'MERGE_ON_READ'

inputDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "insert"
).options(**hudiOptions).mode("overwrite").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi_mor/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi_mor/ --recursive | tee /tmp/hudi_op_mor_001

2022-02-01 19:37:03          0 tmp/hudi_mor/.hoodie/.aux/.bootstrap/.fileids_$folder$
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/.aux/.bootstrap/.partitions_$folder$
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/.aux/.bootstrap_$folder$
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/.aux_$folder$
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/.temp_$folder$
2022-02-01 19:37:06       2726 tmp/hudi_mor/.hoodie/20220201193700.deltacommit
2022-02-01 19:37:03       1842 tmp/hudi_mor/.hoodie/20220201193700.deltacommit.inflight
2022-02-01 19:37:03          0 tmp/hudi_mor/.hoodie/20220201193700.deltacommit.requested
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/archived_$folder$
2022-02-01 19:37:03        595 tmp/hudi_mor/.hoodie/hoodie.properties
2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie_$folder$
2022-02-01 19:37:04         93 tmp/hudi_mor/2015-01-01/.hoodie_partition_metadata
2022-02-01 19:37:04     435021 tmp/hudi_mor/2015-01-01/ae2faccc-5620-49b0-a1

So far, everything looks similar...let's go ahead and upsert some new data.

In [23]:
updateDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "upsert"
).options(**hudiOptions).mode("append").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi_mor/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
%%sh

aws s3 ls s3://${S3_BUCKET_NAME}/tmp/hudi_mor/ --recursive > /tmp/hudi_op_mor_002
diff -u /tmp/hudi_op_mor_001 /tmp/hudi_op_mor_002 || true

--- /tmp/hudi_op_mor_001	2022-02-01 19:37:08.179017504 +0000
+++ /tmp/hudi_op_mor_002	2022-02-01 19:37:21.915001376 +0000
@@ -6,12 +6,19 @@
 2022-02-01 19:37:06       2726 tmp/hudi_mor/.hoodie/20220201193700.deltacommit
 2022-02-01 19:37:03       1842 tmp/hudi_mor/.hoodie/20220201193700.deltacommit.inflight
 2022-02-01 19:37:03          0 tmp/hudi_mor/.hoodie/20220201193700.deltacommit.requested
+2022-02-01 19:37:21       2943 tmp/hudi_mor/.hoodie/20220201193708.deltacommit
+2022-02-01 19:37:13       2560 tmp/hudi_mor/.hoodie/20220201193708.deltacommit.inflight
+2022-02-01 19:37:09          0 tmp/hudi_mor/.hoodie/20220201193708.deltacommit.requested
 2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie/archived_$folder$
 2022-02-01 19:37:03        595 tmp/hudi_mor/.hoodie/hoodie.properties
 2022-02-01 19:37:02          0 tmp/hudi_mor/.hoodie_$folder$
+2022-02-01 19:37:20        842 tmp/hudi_mor/2015-01-01/.ae2faccc-5620-49b0-a11b-f3d0face20c4-0_20220201193700.log.1_0-150-40834
 2022-02-

Here we start to see some of the differences of COW and MOR - with MOR, we simply see a new Parquet file added in the new `2022-01-11` partition as well as the addition of a `.log` file. What's in there?

In [25]:
%%sh

LOG_FILE_KEY=$(cat /tmp/hudi_op_mor_002 | tr -s " " | cut -f4 -d\ | grep '.log.')

aws s3 cp s3://${S3_BUCKET_NAME}/${LOG_FILE_KEY} -

#HUDI#      <              �{"type":"record","name":"my_hudi_table_record","namespace":"hoodie.my_hudi_table","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":["null","string"],"default":null},{"name":"creation_date","type":"string"},{"name":"last_update_time","type":["null","string"],"default":null}]}       20220201193708       m      e [Lorg.apache.hudi.common.model.HoodieKey�org.apache.hudi.common.model.HoodieKe�2015-01-0�10�          B

We can see that this is a binary file that contains some information about the update. Let's read the Parquet files to see what's in them.

In [26]:
rawDF = spark.read.parquet(f"s3://{S3_BUCKET_NAME}/tmp/hudi_mor/*/*.parquet")
rawDF.show(rawDF.count(), truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+--------------------+------------------+----------------------+-------------------------------------------------------------------------+---+-------------+---------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                        |id |creation_date|last_update_time           |
+-------------------+--------------------+------------------+----------------------+-------------------------------------------------------------------------+---+-------------+---------------------------+
|20220201193700     |20220201193700_0_9  |100               |2015-01-01            |ae2faccc-5620-49b0-a11b-f3d0face20c4-0_0-121-27300_20220201193700.parquet|100|2015-01-01   |2015-01-01T13:51:39.340396Z|
|20220201193700     |20220201193700_0_10 |101               |2015-01-01            |ae2faccc-5620-49b0-a11b-f3d0face20c4-0_0-121-27300_20220201193700.parquet|101|2015-01-01   |2015

We can see here that **ONLY** the new data was added to the set of Parquet files. As opposed to the Copy on Write where new Parquet files were written in any partition where data was changed.