docs: RFC-103 - Support Vector Indexing in Apache Hudi by suryaprasanna · Pull Request #14255 · apache/hudi

suryaprasanna · 2025-11-13T04:05:46Z

Describe the issue this Pull Request addresses

Issue: #14290

This pull request proposes a RFC for adding support for Vector Indexes in Apache Hudi.

Summary and Changelog

Impact

none

Risk Level

low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

rahil-c · 2025-11-13T15:20:06Z

Thanks @suryaprasanna for making this!

I was wondering if you can create a github issue for your RFC similar to something like this #14219 and then link your RFC there as well. Also wondering if can change the title for both the issue and this PR to be something like RFC-103 - Vector Index in Hudi.

rahil-c · 2025-11-13T15:25:57Z

rfc/rfc-103/rfc-103.md

+## Background
+Following are the goals for this RFC
+- Creating vector indexes based on a base column in a table - either an embedding column or a text column.
+- Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries.


I am interested in this idea, you might cover later in the RFC around what our strategies are for updating the index again since I am assuming this is expensive. If we are do on every commit which we pay the cost on each write, or if we offer follow hudi current model of having this trigger x commits or some time interval to reindex?

rahil-c · 2025-11-13T15:26:41Z

rfc/rfc-103/rfc-103.md

+- Creating vector indexes based on a base column in a table - either an embedding column or a text column.
+- Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries.
+- First-class SQL experience for creating, and dropping indexes (Spark)
+- SQL extensions to query the index. (Spark, then Presto/Trino)


@suryaprasanna I am thinking maybe for these two bullet points, we move these items scope of RFC 102, since it handles exposing the capability to perform to the user.

- First-class SQL experience for creating, and dropping indexes (Spark) - SQL extensions to query the index. (Spark, then Presto/Trino)

Will link the draft RFC 102: #14219 around exposing vector search

I also think that we have SQL semantics already for creating and dropping indexes in spark, for secondary index https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexDataTypes.scala#L190C1-L191C1. So wondering if we can leverage similar wiring?

rahil-c · 2025-11-13T15:31:35Z

rfc/rfc-103/rfc-103.md

+   - Highly optimized C++ library with Python bindings
+   - Ideal for large-scale GPU-accelerated similarity search
+
+For initial implementation, we will look into HNSW and Jvector implementations.


@suryaprasanna I am curious if you looked into IVF and if its actually a good candidate to be implemented for hudi's internal metadata table. https://blog.dailydoseofds.com/p/approximate-nearest-neighbor-search

I'm wondering if the Locality-Sensitive Hashing(LSH) is a good candicate for hashing index, because of it's efficiency and may be more friendly for streaming ingestion? https://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

rahil-c · 2025-11-13T15:46:51Z

rfc/rfc-103/rfc-103.md

+#### Supporting transactions for Upsert/deletes:
+Vector embeddings are not static—they can evolve over time as underlying features change or as new embedding generation algorithms are applied. Therefore, the vector index implementation must support upserts and deletes, along with integration into Hudi table services such as rollback and restore operations.
+
+To achieve this, we propose using a versioning mechanism for maintaining and updating the vector index. Each update to the embeddings or index structure can be associated with a version, enabling consistent rollback and recovery when table operations require reverting to a previous state.


Should we track any other additional metadata such as what was the embedding model used to generate this column, or what is the blob/binary col this embedding column maps to?

danny0405 · 2025-11-14T04:30:55Z

rfc/rfc-103/rfc-103.md

+
+#### Clusters of Indexes (Horizontal Scalability for Vector Indexing)
+
+Approximate Nearest Neighbor (ANN) algorithms are graph-based structures that are inherently designed for in-memory computation. As a result, traditional ANN implementations typically achieve scalability through vertical scaling — by adding more memory and CPU resources to a single node.


are graph-based structures

The main families of ANN indexing structures are graph-based, tree-based, hashing-based, and quantization-based methods, not just graph-based?

And one base file could have multiple index columns right?

If a base file contains multiple embedding columns, then yes I would assume that you can have indexes built on each one.

danny0405 · 2025-11-14T04:32:55Z

rfc/rfc-103/rfc-103.md

+
+##### Proposed Approach
+
+To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.


how do we organize these clusters, hashing by keys? can one cluster includes multiple file groups here from Hudi's notion?

If i understand correctly these clusters are basically rows that are grouped together by how similar their embedding values are to one another, and not directly related to the recordkey being hashed unless you are talking about some other value being used as key? @danny0405 @suryaprasanna

https://www.pinecone.io/learn/a-developers-guide-to-ann-algorithms/

Based on how vectors are structured, vector indexing algorithms can be divided into three main categories: Spatial Partitioning (also known as cluster-based indexing or clustering indexes) Graph-based indexing (e.g. [HNSW](https://www.pinecone.io/learn/series/faiss/hnsw/)) Hash-based indexing (e.g., [locality-sensitive hashing](https://www.pinecone.io/learn/series/faiss/vector-indexes/#Locality-Sensitive-Hashing)) We’ll skip hash-based indexing in this article because its performance on all aspects, including reads, writes, and storage, is currently worse than that of graph-based and spatial partitioning-based indexing. Almost no vector databases use hash-based indexing nowadays. Inverted File Index (IVF) Another popular index is the Inverted File Index (IVF). It’s easy to use, has high search quality, and has reasonable search speed. It’s the most basic version of the spatial-partitioning index and inherits its advantages including little space overhead and the ability to work with object storage as well as disadvantages like lower query throughput compared to graph-based indexing in-memory and on-disk.

When I was reading online I saw that IVF seemed to be popular with object storage and seems to be a default in other systems in the space https://lancedb.com/docs/indexing/#:~:text=Inverted%20File%20Index%20(IVF)%20Implementation,but%20the%20slower%20the%20query.

My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?

danny0405 · 2025-11-14T04:38:38Z

rfc/rfc-103/rfc-103.md

+
+#### Base & Log Files:
+
+In Hudi’s vector index implementation actually embeddings are inserted under the index files, base files contain metadata information about the index. File groups used for creating base files will be of the format **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**. Here, column_name is the column for which the vector index is created. This way when there are multiple vector indices part of the same dataset, compaction operation can be performed on each vector index separately. There by avoiding large scale updates to all the indexes at the same time.


The base file name in the example graph does not really follow this naming convention: **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**

rahil-c · 2025-11-15T22:16:28Z

rfc/rfc-103/rfc-103.md

+- SQL extensions to query the index. (Spark, then Presto/Trino)
+
+Non-goals/Unclear:
+- Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?)


@suryaprasanna For these RAG applcations, can the rag application just be a python script shared via pyspark? The developer can then directly invoke our vector search api from RFC 102? I saw this basic blog on RAG https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus that shows an example with a vector database like milvus, so wondering if instead you swap out milvus with hudi acting as the vector store, and using pyspark as execution layer.

Or is spark not suited for these kinda use cases?

why spark specifically..

hudi-bot · 2025-11-17T20:59:50Z

CI report:

945a8ff Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar · 2025-11-20T01:54:43Z

@rahil-c @suryaprasanna can we merge this and the other RFC PR?

Also please follow process and grab a RFC number

vinothchandar

Left a lot of comments. Overall, I'd love for this to be rewritten based on whats already available in tableVersion=8/9.

vinothchandar · 2025-11-20T02:33:29Z

rfc/rfc-103/rfc-103.md

+- @vinoth
+- @rahil-c
+
+## Status


vinothchandar · 2025-11-20T02:47:06Z

rfc/rfc-103/rfc-103.md

+- SQL extensions to query the index. (Spark, then Presto/Trino)
+
+Non-goals/Unclear:
+- Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?)


why spark specifically..

vinothchandar · 2025-11-20T03:02:16Z

rfc/rfc-103/rfc-103.md

+
+##### Proposed Approach
+
+To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.


My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?

vinothchandar · 2025-11-20T03:03:56Z

rfc/rfc-103/rfc-103.md

+##### Proposed Approach
+
+To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.
+During execution, each Spark executor can load the relevant cluster’s index into memory—or operate in a hybrid mode using both disk and memory—to perform local nearest neighbor searches efficiently.


so would this build the data structure needed for ANN each time a query is executed?

vinothchandar · 2025-11-20T03:07:08Z

rfc/rfc-103/rfc-103.md

+
+#### Read Flow on Vector Index:
+
+When users are accessing the vector index on a column embeddings to find top K nearest neighbours for a vector. The query will first start a spark stage to find the top K nearest neighbours across each of the clusters and the work will be distributed using Spark. Let us say there are 10 clusters, then total 10xK vectors are collected as intermediary result and as part of final step from these 10xK vectors, another top K vectors that are nearest is computed based by on the driver and returned.


Should n't we just shuffle the first stage and compute the final top K .. distributed fashion?

vinothchandar · 2025-11-20T03:56:04Z

rfc/rfc-103/rfc-103.md

+
+Expanding on these workflows, following sections will go over the data layout and directory structure of vector index implementation.
+
+#### Bootstrap workflow:


lets call this indexer or index building consistent with other SI building workflows..

all of these parameters should be fed nicely into a CREATE INDEX .. command syntax

vinothchandar · 2025-11-20T03:57:57Z

rfc/rfc-103/rfc-103.md

+
+For example, if a column’s embeddings are split into two clusters, the system generates two separate vector indexes for that column.
+
+This process is repeated for every column where vector indexing is enabled. Thus, if two columns each produce two clusters, a total of four vector indexes are constructed.


each column should be its own vector index?

vinothchandar · 2025-11-20T03:58:36Z

rfc/rfc-103/rfc-103.md

+
+##### Cluster Formation and Index Construction
+
+After loading the embedding vectors from the data files, the system groups them into clusters. Clustering provides horizontal scalability for vector search, preventing the bottlenecks associated with a single monolithic index. Each cluster is processed independently using the configured indexing algorithm, resulting in one vector index per cluster.


Can we add details on how exactly this clustering algorithm is done. are we envisioning use of sth like k-means

vinothchandar · 2025-11-20T04:00:49Z

rfc/rfc-103/rfc-103.md

+
+Once the split is complete, the records destined for a specific vector index are evenly distributed across the file groups configured for that index and written into their log files. Importantly, the vector index base files stored under `vector_index/.index_files` are not updated during ingestion. Therefore, SQL queries that read only the index files will not immediately reflect changes written to log files. The .index_files are updated only during compaction. This separation ensures ingestion remains lightweight and avoids expensive index rebuild operations.
+
+This design intentionally stores records for each vector index under column-specific file groups. The purpose is to avoid conflicts when multiple compaction plans run concurrently—each targeting different vector index columns. Because of this structure, when a compaction operation is triggered to refresh a specific vector index, only the file groups belonging to that index are scanned. The system does not need to scan every file group in the vector_index partition, enabling more efficient and isolated compaction.


would nt updates cause changes to cluster boundaries.. should this be readjusted based on periodic Hudi clustering..

vinothchandar · 2025-11-20T04:02:05Z

rfc/rfc-103/rfc-103.md

+
+Ingestion workflow stores the changes to vector embeddings in a vector_index partition directory in log file format. These are not readily available for user to read. For users to consume these changes they need to be applied on the vector indexes that are present under .index_files directory. Hudi’s Compaction is the table service that applies these changes on to the index files.
+
+Compaction for the vector index partition is significantly more complex than compaction for other metadata partitions. This complexity arises because vector index files are stored outside the traditional base-file and log-file structure, and each new version of a vector index is maintained independently from its previous version. Since vector indexes do not support transaction-based change tracking and cannot roll back partial updates, version management must be implemented externally. This external versioning ensures that the currently serving index remains isolated and stable while a new version is being built and updated during compaction.


I want to fit this all under the current storage format/model

vinothchandar · 2025-11-20T04:28:07Z

rfc/rfc-103/rfc-103.md

+   - Graph-based approach
+   - Excellent recall-speed tradeoff
+   - Widely used in production systems
+2. jVector


can we produce some microbenchmarks for this?

suryaprasanna added 2 commits November 12, 2025 20:04

Create RFC 103 for Supporting Vector Index on Hudi

5dbc18b

WIP

a71a6f6

github-actions bot added the size:M PR with lines of changes in (100, 300] label Nov 13, 2025

rahil-c reviewed Nov 13, 2025

View reviewed changes

danny0405 reviewed Nov 14, 2025

View reviewed changes

rahil-c reviewed Nov 15, 2025

View reviewed changes

suryaprasanna added 5 commits November 15, 2025 20:30

Expand incremental and compaction workflows

4f01389

Rearrange data layout section

3ff0fd9

Rearrange some sentences

a3d3318

Highlight some sentences

6f42e82

Rearrange images

945a8ff

suryaprasanna mentioned this pull request Nov 17, 2025

Add Support for Vector Indexing in Apache Hudi #14290

Open

suryaprasanna changed the title ~~Vector index rfc~~ docs: RFC-103 - Support Vector Indexing in Apache Hudi Nov 17, 2025

apache deleted a comment from hudi-bot Nov 17, 2025

vinothchandar added the rfc Request for comments label Nov 20, 2025

vinothchandar requested changes Nov 20, 2025

View reviewed changes

vinothchandar reviewed Nov 20, 2025

View reviewed changes


		#### Clusters of Indexes (Horizontal Scalability for Vector Indexing)

		Approximate Nearest Neighbor (ANN) algorithms are graph-based structures that are inherently designed for in-memory computation. As a result, traditional ANN implementations typically achieve scalability through vertical scaling — by adding more memory and CPU resources to a single node.


		##### Proposed Approach

		To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.


		#### Base & Log Files:

		In Hudi’s vector index implementation actually embeddings are inserted under the index files, base files contain metadata information about the index. File groups used for creating base files will be of the format <COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile. Here, column_name is the column for which the vector index is created. This way when there are multiple vector indices part of the same dataset, compaction operation can be performed on each vector index separately. There by avoiding large scale updates to all the indexes at the same time.


		#### Read Flow on Vector Index:

		When users are accessing the vector index on a column embeddings to find top K nearest neighbours for a vector. The query will first start a spark stage to find the top K nearest neighbours across each of the clusters and the work will be distributed using Spark. Let us say there are 10 clusters, then total 10xK vectors are collected as intermediary result and as part of final step from these 10xK vectors, another top K vectors that are nearest is computed based by on the driver and returned.


		Expanding on these workflows, following sections will go over the data layout and directory structure of vector index implementation.

		#### Bootstrap workflow:


		For example, if a column’s embeddings are split into two clusters, the system generates two separate vector indexes for that column.

		This process is repeated for every column where vector indexing is enabled. Thus, if two columns each produce two clusters, a total of four vector indexes are constructed.


		##### Cluster Formation and Index Construction

		After loading the embedding vectors from the data files, the system groups them into clusters. Clustering provides horizontal scalability for vector search, preventing the bottlenecks associated with a single monolithic index. Each cluster is processed independently using the configured indexing algorithm, resulting in one vector index per cluster.


		Once the split is complete, the records destined for a specific vector index are evenly distributed across the file groups configured for that index and written into their log files. Importantly, the vector index base files stored under `vector_index/.index_files` are not updated during ingestion. Therefore, SQL queries that read only the index files will not immediately reflect changes written to log files. The .index_files are updated only during compaction. This separation ensures ingestion remains lightweight and avoids expensive index rebuild operations.

		This design intentionally stores records for each vector index under column-specific file groups. The purpose is to avoid conflicts when multiple compaction plans run concurrently—each targeting different vector index columns. Because of this structure, when a compaction operation is triggered to refresh a specific vector index, only the file groups belonging to that index are scanned. The system does not need to scan every file group in the vector_index partition, enabling more efficient and isolated compaction.


		Ingestion workflow stores the changes to vector embeddings in a vector_index partition directory in log file format. These are not readily available for user to read. For users to consume these changes they need to be applied on the vector indexes that are present under .index_files directory. Hudi’s Compaction is the table service that applies these changes on to the index files.

		Compaction for the vector index partition is significantly more complex than compaction for other metadata partitions. This complexity arises because vector index files are stored outside the traditional base-file and log-file structure, and each new version of a vector index is maintained independently from its previous version. Since vector indexes do not support transaction-based change tracking and cannot roll back partial updates, version management must be implemented externally. This external versioning ensures that the currently serving index remains isolated and stable while a new version is being built and updated during compaction.

Conversation

suryaprasanna commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

rahil-c commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Nov 17, 2025

CI report:

Uh oh!

vinothchandar commented Nov 20, 2025

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

suryaprasanna commented Nov 13, 2025 •

edited

Loading

rahil-c Nov 13, 2025 •

edited

Loading

rahil-c Nov 15, 2025 •

edited

Loading