# Deep Dive into Hudi Metadata Table and Indexing Enhancements in 1.x, Including SQL-Based Index Management

Welcome to this guide on the Hudi Metadata Table and its role in boosting performance. In a large-scale data lake, simply listing files can become a significant bottleneck. Hudi's Metadata Table is a powerful, self-managed Hudi table that tracks all file listings, partitions, and statistics, allowing for much faster queries and more efficient operations.

In Hudi 1.x, these features have been further enhanced with the ability to manage indexes directly using SQL. This notebook will demonstrate:

- ***What the Metadata Table is:*** We'll inspect the files that make up the Metadata Table.
- ***The Performance Impact:*** We'll show how the Metadata Table speeds up file listing.
- ***SQL-Based Index Management:*** We'll create, use, and drop indexes directly with SQL commands to optimize queries.

## Setting up the Environment
First, we begin by importing our necessary libraries and starting a SparkSession configured to work with Hudi and MinIO.

In [1]:
%run utils.ipynb

Now, let's start the SparkSession. We'll give it the app name 'HudiMetadataIndexing' and configure it to use our Hudi and MinIO settings.

In [2]:
%%capture
spark = get_spark("HudiMetadataIndexing")

25/08/26 12:45:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Initial Table Creation
We'll start with a simple dataset of ride data. This will be our main table, and we'll then explore its metadata.

In [3]:
initial_data = [
    ("2025-08-10 08:15:30", "uuid-001", "rider-A", "driver-X", 18.50, "new_york"),
    ("2025-08-10 09:22:10", "uuid-002", "rider-B", "driver-Y", 22.75, "san_francisco"),
    ("2025-08-10 10:05:45", "uuid-003", "rider-C", "driver-Z", 14.60, "chicago")
]
initial_columns = ["ts", "uuid", "rider", "driver", "fare", "city"]
initial_df = spark.createDataFrame(initial_data).toDF(*initial_columns)

display(initial_df)

                                                                                

ts,uuid,rider,driver,fare,city
2025-08-10 08:15:30,uuid-001,rider-A,driver-X,18.5,new_york
2025-08-10 09:22:10,uuid-002,rider-B,driver-Y,22.75,san_francisco
2025-08-10 10:05:45,uuid-003,rider-C,driver-Z,14.6,chicago


Now, let's create a Hudi table with a crucial configuration: ***"hoodie.metadata.enable": "true".*** This flag tells Hudi to maintain an internal Metadata Table, which will speed up our operations.

In [4]:
table_name = "rides_metadata_table"
base_path = "s3a://warehouse/hudi-metadata"

hudi_conf = {
    "hoodie.table.name": table_name,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.partitionpath.field": "city",
    "hoodie.metadata.enable": "false"
}

initial_df.write.format("hudi") \
    .options(**hudi_conf) \
    .mode("overwrite") \
    .save(f"{base_path}/{table_name}")

# Register a temp view to easily query the table
#spark.read.format("hudi").load(f"{base_path}/{table_name}").createOrReplaceTempView(table_name)

                                                                                

## The Hudi Metadata Table: A Deeper Look

Hudi employs a special internal metadata table within each dataset to track metadata information - such as file listings and column statistics, helping avoid costly file system scans and improving read/write efficiency.

***Key Features of the Metadata Table:***
- ***Scalable:*** Capable of scaling to large sizes, handling TBs of metadata efficiently.
- ***Flexible:*** Supports multi-modal indexing, allowing enabling/disabling various index types dynamically.
- ***Fast Lookups:*** Uses an SSTable-like base file format (HFile) for fast partial scans and selective column reads.

***The metadata table holds auxiliary data like:***
- File indices for efficient record location
- Column statistics for data skipping
- Bloom filters for quick membership tests
- Record and secondary indexes to speed up queries

Let's look at the file system of our newly created table. Here three directories with city names, are partitions containing data.

In [5]:
ls(f"{base_path}/{table_name}")

s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie
s3a://warehouse/hudi-metadata/rides_metadata_table/chicago
s3a://warehouse/hudi-metadata/rides_metadata_table/new_york
s3a://warehouse/hudi-metadata/rides_metadata_table/san_francisco


Now If you look inside a partition directory you will see following files. ***.hoodie_partition_metadata*** files store information about partition.

In [6]:
ls(f"{base_path}/{table_name}/new_york")

s3a://warehouse/hudi-metadata/rides_metadata_table/new_york/.hoodie_partition_metadata
s3a://warehouse/hudi-metadata/rides_metadata_table/new_york/422dd207-ac0a-406c-a388-5211af047bec-0_1-16-85_20250826124514061.parquet


The .hoodie directory contains subdirectories that store metadata files. Notice the special ***.hoodie/metadata*** directory. This is the Metadata Table itself. The files inside are not human-readable but are critical for Hudi's performance.

In [7]:
# List the contents of the Hudi table's .hoodie directory
ls(f"{base_path}/{table_name}/.hoodie")

s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/hoodie.properties
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/.aux
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/.schema
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/.temp
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/timeline


The output shows several key directories and files:

- ***.aux, .index_defs, .temp:*** These folders store internal metadata and temporary files.
- ***.schema:*** This folder stores schema information for the Hudi table which helps in schema evolution.
- ***timeline:*** This directory contains all the files that make up the Hudi Timeline, which is a record of every transaction that has occurred on the table.
- ***metadata:*** This is the Metadata Table. It is itself a Hudi table and contains the file-level metadata like partition paths, file listings, and commit information that allows Hudi to quickly find files without performing a full file system scan.
- ***hoodie.properties:*** The main configuration file for the table, which holds settings like the table name, key fields, and partitioning.

Another crucial part of this metadata is the ***Hudi Timeline***, which consists of small files that log every change to the table. These meta-files follow the naming pattern below:

[action timestamp].[action type].[action state]

In [8]:
# List the contents of the Hudi table's timeline directory
ls(f"{base_path}/{table_name}/.hoodie/timeline")

s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/timeline/20250826124514061.commit.requested
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/timeline/20250826124514061.inflight
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/timeline/20250826124514061_20250826124516083.commit
s3a://warehouse/hudi-metadata/rides_metadata_table/.hoodie/timeline/history


- An action timestamp is a unique, chronological identifier for each event, marking when it was scheduled.
- An action type describes the operation that took place. Examples include commit or deltacommit for data changes, compaction or clean for maintenance, and savepoint or restore for recovery.
- An action state shows the current status of the action. It can be requested (waiting to start), inflight (in progress), or commit (completed).

## Indexing Enhancements in Hudi 1.x
Hudi 1.x introduces an advanced indexing subsystem that generalizes index capabilities closer to those found in relational databases.

Important Enhancements:
- ***Secondary Indexes:*** Support for indexes on any secondary columns to speed up query filtering.
- ***Expression-Based Indexes:*** Indexes on expressions or transformed columns, enabling advanced data skipping.
- ***SQL-Based Index Management:*** Users can create and manage indexes using standard SQL DDL commands via Spark SQL.
- ***Asynchronous Indexing:*** Indexes can be built asynchronously alongside ongoing writes, improving write throughput without blocking.

## SQL-Based Index Creation and Management
With Hudi 1.x, you can create different types of indexes directly on the Metadata Table using SQL. These indexes further accelerate query performance, especially for filtering on specific columns.

### Example Commands:

- Enable record index (dependency for secondary index)
=> SET hoodie.metadata.record.index.enable=true;

- Create record index on primary key column (e.g., uuid)
=> CREATE INDEX record_index ON hudi_table (uuid);

- Create secondary index on 'rider' column
=> CREATE INDEX idx_rider ON hudi_table (rider);

- Create bloom filter index on 'driver' column
=> CREATE INDEX idx_bloom_driver ON hudi_table USING bloom_filters(driver) OPTIONS(expr='identity');

- Create expression-based column stats index on timestamp column
=> CREATE INDEX idx_column_ts ON hudi_table USING column_stats(ts) OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');

- Drop indexes when no longer needed
=> DROP INDEX record_index ON hudi_table;

### Practical Example Workflow

***1. Create a Spark SQL Table on Hudi Dataset***

In [9]:
spark.sql(f"""
 CREATE TABLE {table_name} (
    ts BIGINT,
    uuid STRING,
    rider STRING,
    driver STRING,
    fare DOUBLE,
    city STRING
 ) USING hudi 
     options(
        primaryKey ='uuid'
)
PARTITIONED BY (city)
LOCATION '{base_path}';
""")

DataFrame[]

In [10]:
# Step 1: Enable record index (required for secondary index)
spark.sql("SET hoodie.metadata.record.index.enable=false")

DataFrame[key: string, value: string]

In [11]:
# Step 2: Create Record Index on primary key 'uuid'
spark.sql(f"CREATE INDEX record_index ON {table_name} (uuid)")

                                                                                



Py4JJavaError: An error occurred while calling o49.sql.
: org.apache.hudi.exception.HoodieException: Metadata table is not yet initialized. Initialize FILES partition before any other partition [Metadata partition {name: record_index, prefix: record-index-}]
	at org.apache.hudi.index.HoodieSparkIndexClient.doSchedule(HoodieSparkIndexClient.java:234)
	at org.apache.hudi.index.HoodieSparkIndexClient.createRecordIndex(HoodieSparkIndexClient.java:117)
	at org.apache.hudi.index.HoodieSparkIndexClient.create(HoodieSparkIndexClient.java:100)
	at org.apache.spark.sql.hudi.command.CreateIndexCommand.run(IndexCommands.scala:69)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:638)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:629)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:659)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:748)


In [None]:
# Step 3: Create Secondary Index on 'rider' column
spark.sql(f"CREATE INDEX idx_rider ON {table_name} (rider)")