# Deep Dive into Hudi Table & Query Types: Snapshot, RO, Incremental, Time Travel, CDC
This notebook is your guide to mastering Hudi's advanced query capabilities. We'll explore hands-on examples of the different read modes—Snapshot, Read-Optimized, Incremental, Time Travel, and Change Data Capture - to help you understand when to use each for building efficient data pipelines.

## Setting up the Environment
We begin by loading the utils.ipynb notebook, which contains the necessary imports and functions to start a SparkSession.

In [1]:
%run utils.ipynb

Now, let's start the SparkSession. We'll give it the app name 'Query-Types' and configure it to use our Hudi and MinIO settings.

In [2]:
%%capture
spark = get_spark("Query-Types")

25/08/14 11:24:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/14 11:24:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/14 11:24:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


Before we can start querying, we need to create our Hudi tables. For this deep dive, we'll create one table for each of Hudi's main storage types:

- **trips_table_cow:** Our Copy-on-Write (COW) table, which we'll use to demonstrate how Hudi rewrites files on updates.
- **trips_table_mor:** Our Merge-on-Read (MOR) table, which will help us understand how Hudi uses log files for faster updates and different read views.

After creating both tables, we'll be ready to explore all the query types.

This is the sample ride data we will use to create our Hudi table. It includes details like the timestamp, a unique ID, rider, driver, fare, and city.

In [3]:
columns = ["ts", "uuid", "rider", "driver", "fare", "city"]
data = [
    ("2025-08-10 08:15:30", "uuid-001", "rider-A", "driver-X", 18.50, "new_york"),
    ("2025-08-10 09:22:10", "uuid-002", "rider-B", "driver-Y", 22.75, "san_francisco"),
    ("2025-08-10 10:05:45", "uuid-003", "rider-C", "driver-Z", 14.60, "chicago"),
    ("2025-08-10 11:40:00", "uuid-004", "rider-D", "driver-W", 31.90, "new_york"),
    ("2025-08-10 12:55:15", "uuid-005", "rider-E", "driver-V", 25.10, "san_francisco"),
    ("2025-08-10 13:20:35", "uuid-006", "rider-F", "driver-U", 19.80, "chicago"),
    ("2025-08-10 14:10:05", "uuid-007", "rider-G", "driver-T", 28.45, "san_francisco"),
    ("2025-08-10 15:00:20", "uuid-008", "rider-H", "driver-S", 16.25, "new_york"),
    ("2025-08-10 15:45:50", "uuid-009", "rider-I", "driver-R", 24.35, "chicago"),
    ("2025-08-10 16:30:00", "uuid-010", "rider-J", "driver-Q", 20.00, "new_york"),
]

In [5]:
inputDF = spark.createDataFrame(data).toDF(*columns)
display(inputDF)

ts,uuid,rider,driver,fare,city
2025-08-10 08:15:30,uuid-001,rider-A,driver-X,18.5,new_york
2025-08-10 09:22:10,uuid-002,rider-B,driver-Y,22.75,san_francisco
2025-08-10 10:05:45,uuid-003,rider-C,driver-Z,14.6,chicago
2025-08-10 11:40:00,uuid-004,rider-D,driver-W,31.9,new_york
2025-08-10 12:55:15,uuid-005,rider-E,driver-V,25.1,san_francisco
2025-08-10 13:20:35,uuid-006,rider-F,driver-U,19.8,chicago
2025-08-10 14:10:05,uuid-007,rider-G,driver-T,28.45,san_francisco
2025-08-10 15:00:20,uuid-008,rider-H,driver-S,16.25,new_york
2025-08-10 15:45:50,uuid-009,rider-I,driver-R,24.35,chicago
2025-08-10 16:30:00,uuid-010,rider-J,driver-Q,20.0,new_york


Hudi offers two primary table types to choose from:

- **Copy-on-Write (COW)**
- **Merge-on-Read (MOR)**

### Hudi Configuration for a COW Table

In [7]:
table_name_cow = "trips_table_cow"
base_path = f"s3a://warehouse/hudi-db"

cow_hudi_conf = {
    "hoodie.table.name": table_name_cow, # The name of our Hudi table.
    "hoodie.datasource.write.recordkey.field": "uuid", # The column that acts as the unique identifier for each record.
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE", # Hudi uses Copy-on-Write as the default table type, but we are being explicit here.
    "hoodie.datasource.write.partitionpath.field": "city", # The column Hudi uses to partition the data on storage.
    "hoodie.datasource.write.precombine.field": "ts", # The field used to deduplicate records when a conflict occurs.
    "hoodie.write.markers.type": "DIRECT",
    "hoodie.table.cdc.enabled": "true",
    "hoodie.datasource.write.hive_style_partitioning": "true" # This ensures partition directories are named like `city=new_york`.
}

### Inserting data in a COW Table (Fresh Insert)

In [8]:
# Write the DataFrame to a Hudi COW table
# The default operation is "upsert" if this is not specified.
inputDF.write \
    .format("hudi") \
    .option("hoodie.datasource.write.operation", "insert") \
    .options(**cow_hudi_conf) \
    .mode("overwrite") \
    .save(f"{base_path}/{table_name_cow}")



                                                                                

### Hudi Configuration for a MOR Table

In [9]:
table_name_mor = "trips_table_mor"
base_path = f"s3a://warehouse/hudi-db"

mor_hudi_conf = {
    "hoodie.table.name": table_name_mor,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.table.type": "MERGE_ON_READ",
    "hoodie.datasource.write.partitionpath.field": "city",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.write.markers.type": "DIRECT",
    "hoodie.datasource.write.hive_style_partitioning": "true"
}

### Inserting Data into a MOR Table (Fresh Insert)

In [10]:
# Write the DataFrame to a Hudi MOR table
inputDF.write \
    .format("hudi") \
    .option("hoodie.datasource.write.operation", "insert") \
    .options(**mor_hudi_conf) \
    .mode("overwrite") \
    .save(f"{base_path}/{table_name_mor}")

Now that our tables are set up, we can begin our deep dive into Hudi's powerful query types. In this section, we will discuss:

- **Snapshot Query:** The default read mode for viewing the latest state of the table.
- **Incremental Query:** A way to get only the new data added since a specific point in time.
- **Time Travel Query:** How to view the table as it existed at a past moment.
- **Read-Optimized (RO) Query:** A specialized mode for faster reads on MOR tables.
- **Change Data Capture (CDC) Query:** How to retrieve a detailed stream of changes (updates, inserts, and deletes).

### Snapshot Query
This is the default query type when reading Hudi tables. Its goal is to give you a complete, up-to-the-minute view of your data. When you run this query on a Merge-on-Read (MOR) table, Hudi merges the recent changes from the log files with the base data files to present the latest records, which can affect performance.

Let's do a quick snapshot query to see the current state of our tables.

In [11]:
cowSnapshotQueryDF = spark.read \
        .format("hudi") \
        .load(f"{base_path}/{table_name_cow}" + "/*/*")

display(cowSnapshotQueryDF.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814124928814,uuid-002,rider-B,driver-Y,22.75,san_francisco,2025-08-10 09:22:10
20250814124928814,uuid-005,rider-E,driver-V,25.1,san_francisco,2025-08-10 12:55:15
20250814124928814,uuid-007,rider-G,driver-T,28.45,san_francisco,2025-08-10 14:10:05
20250814124928814,uuid-001,rider-A,driver-X,18.5,new_york,2025-08-10 08:15:30
20250814124928814,uuid-004,rider-D,driver-W,31.9,new_york,2025-08-10 11:40:00
20250814124928814,uuid-008,rider-H,driver-S,16.25,new_york,2025-08-10 15:00:20
20250814124928814,uuid-010,rider-J,driver-Q,20.0,new_york,2025-08-10 16:30:00
20250814124928814,uuid-003,rider-C,driver-Z,14.6,chicago,2025-08-10 10:05:45
20250814124928814,uuid-006,rider-F,driver-U,19.8,chicago,2025-08-10 13:20:35
20250814124928814,uuid-009,rider-I,driver-R,24.35,chicago,2025-08-10 15:45:50


Now update one record in the COW table.

In [15]:
from pyspark.sql.functions import col
updatesDF = spark.read.format("hudi").load(f"{base_path}/{table_name_cow}").filter(col("rider") == "rider-G").withColumn("fare", col("fare") * 10)

display(updatesDF.select("uuid", "rider", "driver", "fare", "city", "ts"))

uuid,rider,driver,fare,city,ts
uuid-007,rider-G,driver-T,284.5,san_francisco,2025-08-10 14:10:05


In [17]:
updatesDF.write \
    .format("hudi") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .options(**cow_hudi_conf) \
    .mode("append") \
    .save(f"{base_path}/{table_name_cow}")

Again execute the snapshot query to confirm if it results the latest view of the table and YES, We can see that it fetches the updated fare.

In [19]:
cowSnapshotQueryDF = spark.read \
        .format("hudi") \
        .load(f"{base_path}/{table_name_cow}" + "/*/*")

display(cowSnapshotQueryDF.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts").filter(col("rider") == "rider-G"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814134348351,uuid-007,rider-G,driver-T,284.5,san_francisco,2025-08-10 14:10:05


### Incremental Reads
Hudi's incremental query feature lets us efficiently process only the data that has changed since a specific point in time. We'll start by listing all the commit times in our table.

Now, let's configure an incremental read to grab only the data committed after our update operation. Let's fetch the latest commit from the table.

In [23]:
# Get distinct commit times ordered
commits_df = spark.read.format("hudi").load(f"{base_path}/{table_name_cow}") \
    .select("_hoodie_commit_time") \
    .distinct() \
    .orderBy("_hoodie_commit_time")

# Collect top 50 commit times as a list
commits = [row['_hoodie_commit_time'] for row in commits_df.take(50)]

incrementalTime = commits[-1]  # Commit time we are interested in
display(commits_df)
print(f"Incremental commit time: {incrementalTime}")

_hoodie_commit_time
20250814124928814
20250814134348351


Incremental commit time: 20250814134348351


In [24]:
incremental_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': incrementalTime,
}

incrementalQueryDF = spark.read.format("hudi") \
  .options(**incremental_read_options) \
  .load(f"{base_path}/{table_name_cow}")

incrementalQueryDF.createOrReplaceTempView("trips_incremental")

When we query our temporary incremental table, you can see that it returns only the single record that was updated since our last write operation.

In [25]:
display(spark.sql("select _hoodie_commit_time, uuid, rider, driver, fare, city, ts from trips_incremental"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814134348351,uuid-007,rider-G,driver-T,284.5,san_francisco,2025-08-10 14:10:05


### Time Travel Query
Hudi also allows for time travel, which means we can query the state of our table at a specific point in the past. By specifying the commit time from our initial data insertion, we can view the table's contents before we performed the update.

In [30]:
beginTime = commits[-2]  # Commit time we are interested in
print(f"Begin/Initial commit time: {beginTime}")

Begin/Initial commit time: 20250814124928814


In [31]:
spark.read.format("hudi") \
  .option("as.of.instant", beginTime) \
  .load(f"{base_path}/{table_name_cow}").createOrReplaceTempView("trips_time_travel")

In [32]:
display(spark.sql("select _hoodie_commit_time, uuid, rider, driver, fare, city, ts from trips_time_travel"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814124928814,uuid-002,rider-B,driver-Y,22.75,san_francisco,2025-08-10 09:22:10
20250814124928814,uuid-005,rider-E,driver-V,25.1,san_francisco,2025-08-10 12:55:15
20250814124928814,uuid-007,rider-G,driver-T,28.45,san_francisco,2025-08-10 14:10:05
20250814124928814,uuid-001,rider-A,driver-X,18.5,new_york,2025-08-10 08:15:30
20250814124928814,uuid-004,rider-D,driver-W,31.9,new_york,2025-08-10 11:40:00
20250814124928814,uuid-008,rider-H,driver-S,16.25,new_york,2025-08-10 15:00:20
20250814124928814,uuid-010,rider-J,driver-Q,20.0,new_york,2025-08-10 16:30:00
20250814124928814,uuid-003,rider-C,driver-Z,14.6,chicago,2025-08-10 10:05:45
20250814124928814,uuid-006,rider-F,driver-U,19.8,chicago,2025-08-10 13:20:35
20250814124928814,uuid-009,rider-I,driver-R,24.35,chicago,2025-08-10 15:45:50


As you can see, querying the historical view shows the original fare for 'rider-G' before we updated it. This is a great way to audit or restore data from the past.

### Change Data Capture (CDC)
Hudi's Change Data Capture (CDC) feature lets you read a stream of all the changes (inserts, updates, and deletes) that have been applied to your table. This is perfect for downstream systems that need to react to data modifications in real-time. We'll start by adding some new data and updating an existing record to generate some changes.

In [33]:
from pyspark.sql.functions import lit
from pyspark.sql import Row

# Define a DataFrame with one new record and one updated record
cdc_data = [
    ("2025-08-11 10:00:00", "uuid-011", "rider-K", "driver-P", 10.50, "chicago"), # new record
    ("2025-08-10 09:22:10", "uuid-002", "rider-B", "driver-Y", 50.00, "san_francisco") # updated record
]

cdc_columns = ["ts", "uuid", "rider", "driver", "fare", "city"]
cdcDF = spark.createDataFrame(cdc_data).toDF(*cdc_columns)

Now, we'll perform an upsert with our new data. This will create a new commit with one insert and one update.

In [35]:
cdcDF.write \
    .format("hudi") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .options(**cow_hudi_conf) \
    .mode("append") \
    .save(f"{base_path}/{table_name_cow}")

To see the changes from this specific transaction, we'll first get its commit time. We'll then use this as our starting point for the CDC query to capture all the changes from that moment forward.

In [36]:
from pyspark.sql.functions import max

# Find the maximum commit time
latest_commit_time = spark.read.format("hudi").load(f"{base_path}/{table_name_cow}") \
    .agg(max("_hoodie_commit_time")) \
    .collect()[0][0]

print(f"Latest commit time: {latest_commit_time}")

Latest commit time: 20250814144129688


Now we can perform a CDC query using a special incremental format. We'll set the query type to "incremental" and specify "hoodie.datasource.query.incremental.format": "cdc". By using the latest_commit_time we just fetched, we can capture all the changes from our last commit. The output will include the op column, which tells us whether a record was inserted, updated, or deleted.

In [41]:
cdc_read_options = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': latest_commit_time,
  'hoodie.datasource.query.incremental.format': 'cdc'
}

cdcQueryDF = spark.read.format("hudi"). \
  options(**cdc_read_options). \
  load(f"{base_path}/{table_name_cow}").show(truncate=False)

#display(cdcQueryDF)

+---+-----+------+-----+
|op |ts_ms|before|after|
+---+-----+------+-----+
+---+-----+------+-----+



Let's look at the above output to see what happened:

**Update:** We have a record where **op is u**. This corresponds to the update we made to uuid-002. The before column shows the original fare of 22.75, and the after column shows the new fare of 50.0.

**Insert:** We also have a record where **op is i**. This is the new record for uuid-011. The before column is null because it didn't exist before this commit, while the after column contains all the new record's data.

Just like COW tables, lets query the MOR table as well. As we do not have any updates on MOR yet it is having only base data files. Also inspect the filesystem too.

In [12]:
ls(f"{base_path}/{table_name_mor}")

s3a://warehouse/hudi-db/trips_table_mor/.hoodie
s3a://warehouse/hudi-db/trips_table_mor/city=chicago
s3a://warehouse/hudi-db/trips_table_mor/city=new_york
s3a://warehouse/hudi-db/trips_table_mor/city=san_francisco


In [13]:
ls(f"{base_path}/{table_name_mor}/city=new_york")

s3a://warehouse/hudi-db/trips_table_mor/city=new_york/.hoodie_partition_metadata
s3a://warehouse/hudi-db/trips_table_mor/city=new_york/4791ceb3-88de-47b8-9b63-da6bed09fc5b-0_2-61-160_20250814125712793.parquet


### Snapshot Query

In [14]:
morSnapshotQueryDF = spark.read \
        .format("hudi") \
        .load(f"{base_path}/{table_name_mor}" + "/*/*")

display(morSnapshotQueryDF.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814125712793,uuid-002,rider-B,driver-Y,22.75,san_francisco,2025-08-10 09:22:10
20250814125712793,uuid-005,rider-E,driver-V,25.1,san_francisco,2025-08-10 12:55:15
20250814125712793,uuid-007,rider-G,driver-T,28.45,san_francisco,2025-08-10 14:10:05
20250814125712793,uuid-003,rider-C,driver-Z,14.6,chicago,2025-08-10 10:05:45
20250814125712793,uuid-006,rider-F,driver-U,19.8,chicago,2025-08-10 13:20:35
20250814125712793,uuid-009,rider-I,driver-R,24.35,chicago,2025-08-10 15:45:50
20250814125712793,uuid-001,rider-A,driver-X,18.5,new_york,2025-08-10 08:15:30
20250814125712793,uuid-004,rider-D,driver-W,31.9,new_york,2025-08-10 11:40:00
20250814125712793,uuid-008,rider-H,driver-S,16.25,new_york,2025-08-10 15:00:20
20250814125712793,uuid-010,rider-J,driver-Q,20.0,new_york,2025-08-10 16:30:00


### Reading in Read-Optimized Mode
Now, let's read the same table in read-optimized mode. This mode is faster because it only reads the base files, but it won't show any recent updates that are still in the log files.

In [43]:
mor_ro_df = spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "read_optimized") \
    .load(f"{base_path}/{table_name_mor}")

display(mor_ro_df.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814125712793,uuid-002,rider-B,driver-Y,22.75,san_francisco,2025-08-10 09:22:10
20250814125712793,uuid-005,rider-E,driver-V,25.1,san_francisco,2025-08-10 12:55:15
20250814125712793,uuid-007,rider-G,driver-T,28.45,san_francisco,2025-08-10 14:10:05
20250814125712793,uuid-001,rider-A,driver-X,18.5,new_york,2025-08-10 08:15:30
20250814125712793,uuid-004,rider-D,driver-W,31.9,new_york,2025-08-10 11:40:00
20250814125712793,uuid-008,rider-H,driver-S,16.25,new_york,2025-08-10 15:00:20
20250814125712793,uuid-010,rider-J,driver-Q,20.0,new_york,2025-08-10 16:30:00
20250814125712793,uuid-003,rider-C,driver-Z,14.6,chicago,2025-08-10 10:05:45
20250814125712793,uuid-006,rider-F,driver-U,19.8,chicago,2025-08-10 13:20:35
20250814125712793,uuid-009,rider-I,driver-R,24.35,chicago,2025-08-10 15:45:50


### Updating a Record in the MOR table
Let's update a record in our MOR table to see how it affects our read modes. We'll find the record for 'driver-W' and double its fare.

In [44]:
from pyspark.sql.functions import col
updatesDF = spark.read.format("hudi").load(f"{base_path}/{table_name_mor}").filter(col("driver") == "driver-W").withColumn("fare", col("fare") * 2)

display(updatesDF.select("uuid", "rider", "driver", "fare", "city", "ts"))

uuid,rider,driver,fare,city,ts
uuid-004,rider-D,driver-W,63.8,new_york,2025-08-10 11:40:00


Now we perform the upsert. In a MOR table, this update will be written to a log file, separate from the main Parquet data files.

In [45]:
updatesDF.write \
    .format("hudi") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .options(**mor_hudi_conf) \
    .mode("append") \
    .save(f"{base_path}/{table_name_mor}")

After the update, a snapshot query correctly shows the new fare for 'driver-W'. This is because the log files containing our update were merged with the base files during this read operation.

In [48]:
morSnapshotQueryDF = spark.read.format("hudi").load(f"{base_path}/{table_name_mor}")
display(morSnapshotQueryDF.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts").filter(col("driver") == "driver-W"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814151334128,uuid-004,rider-D,driver-W,63.8,new_york,2025-08-10 11:40:00


Finally, a read-optimized query of the same table still shows the old fare for 'driver-W'. This is because the read-optimized query only looks at the base data files and ignores the unmerged update in the log file.

In [50]:
mor_ro_df = spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "read_optimized") \
    .load(f"{base_path}/{table_name_mor}")

display(mor_ro_df.select("_hoodie_commit_time", "uuid", "rider", "driver", "fare", "city", "ts").filter(col("driver") == "driver-W"))

_hoodie_commit_time,uuid,rider,driver,fare,city,ts
20250814125712793,uuid-004,rider-D,driver-W,31.9,new_york,2025-08-10 11:40:00
