Skip to content

docs: RFC-103 - Support Vector Indexing in Apache Hudi#14255

Open
suryaprasanna wants to merge 7 commits intoapache:masterfrom
suryaprasanna:vector-index-rfc
Open

docs: RFC-103 - Support Vector Indexing in Apache Hudi#14255
suryaprasanna wants to merge 7 commits intoapache:masterfrom
suryaprasanna:vector-index-rfc

Conversation

@suryaprasanna
Copy link
Contributor

@suryaprasanna suryaprasanna commented Nov 13, 2025

Describe the issue this Pull Request addresses

Issue: #14290

This pull request proposes a RFC for adding support for Vector Indexes in Apache Hudi.

Summary and Changelog

Impact

none

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Nov 13, 2025
@rahil-c
Copy link
Collaborator

rahil-c commented Nov 13, 2025

Thanks @suryaprasanna for making this!

I was wondering if you can create a github issue for your RFC similar to something like this #14219 and then link your RFC there as well. Also wondering if can change the title for both the issue and this PR to be something like RFC-103 - Vector Index in Hudi.

## Background
Following are the goals for this RFC
- Creating vector indexes based on a base column in a table - either an embedding column or a text column.
- Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am interested in this idea, you might cover later in the RFC around what our strategies are for updating the index again since I am assuming this is expensive. If we are do on every commit which we pay the cost on each write, or if we offer follow hudi current model of having this trigger x commits or some time interval to reindex?

- Creating vector indexes based on a base column in a table - either an embedding column or a text column.
- Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries.
- First-class SQL experience for creating, and dropping indexes (Spark)
- SQL extensions to query the index. (Spark, then Presto/Trino)
Copy link
Collaborator

@rahil-c rahil-c Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suryaprasanna I am thinking maybe for these two bullet points, we move these items scope of RFC 102, since it handles exposing the capability to perform to the user.

- First-class SQL experience for creating, and dropping indexes (Spark)
- SQL extensions to query the index. (Spark, then Presto/Trino)

Will link the draft RFC 102: #14219 around exposing vector search

I also think that we have SQL semantics already for creating and dropping indexes in spark, for secondary index https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexDataTypes.scala#L190C1-L191C1. So wondering if we can leverage similar wiring?

- Highly optimized C++ library with Python bindings
- Ideal for large-scale GPU-accelerated similarity search

For initial implementation, we will look into HNSW and Jvector implementations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suryaprasanna I am curious if you looked into IVF and if its actually a good candidate to be implemented for hudi's internal metadata table. https://blog.dailydoseofds.com/p/approximate-nearest-neighbor-search

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the Locality-Sensitive Hashing(LSH) is a good candicate for hashing index, because of it's efficiency and may be more friendly for streaming ingestion? https://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

#### Supporting transactions for Upsert/deletes:
Vector embeddings are not static—they can evolve over time as underlying features change or as new embedding generation algorithms are applied. Therefore, the vector index implementation must support upserts and deletes, along with integration into Hudi table services such as rollback and restore operations.

To achieve this, we propose using a versioning mechanism for maintaining and updating the vector index. Each update to the embeddings or index structure can be associated with a version, enabling consistent rollback and recovery when table operations require reverting to a previous state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we track any other additional metadata such as what was the embedding model used to generate this column, or what is the blob/binary col this embedding column maps to?


#### Clusters of Indexes (Horizontal Scalability for Vector Indexing)

Approximate Nearest Neighbor (ANN) algorithms are graph-based structures that are inherently designed for in-memory computation. As a result, traditional ANN implementations typically achieve scalability through vertical scaling — by adding more memory and CPU resources to a single node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are graph-based structures

The main families of ANN indexing structures are graph-based, tree-based, hashing-based, and quantization-based methods, not just graph-based?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one base file could have multiple index columns right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a base file contains multiple embedding columns, then yes I would assume that you can have indexes built on each one.


##### Proposed Approach

To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we organize these clusters, hashing by keys? can one cluster includes multiple file groups here from Hudi's notion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understand correctly these clusters are basically rows that are grouped together by how similar their embedding values are to one another, and not directly related to the recordkey being hashed unless you are talking about some other value being used as key? @danny0405 @suryaprasanna

https://www.pinecone.io/learn/a-developers-guide-to-ann-algorithms/

Based on how vectors are structured, vector indexing algorithms can be divided into three main categories:

Spatial Partitioning (also known as cluster-based indexing or clustering indexes)
Graph-based indexing (e.g. [HNSW](https://www.pinecone.io/learn/series/faiss/hnsw/))
Hash-based indexing (e.g., [locality-sensitive hashing](https://www.pinecone.io/learn/series/faiss/vector-indexes/#Locality-Sensitive-Hashing))
We’ll skip hash-based indexing in this article because its performance on all aspects, including reads, writes, and storage, is currently worse than that of graph-based and spatial partitioning-based indexing. Almost no vector databases use hash-based indexing nowadays.


Inverted File Index (IVF)
Another popular index is the Inverted File Index (IVF). It’s easy to use, has high search quality, and has reasonable search speed.

It’s the most basic version of the spatial-partitioning index and inherits its advantages including little space overhead and the ability to work with object storage as well as disadvantages like lower query throughput compared to graph-based indexing in-memory and on-disk.

When I was reading online I saw that IVF seemed to be popular with object storage and seems to be a default in other systems in the space https://lancedb.com/docs/indexing/#:~:text=Inverted%20File%20Index%20(IVF)%20Implementation,but%20the%20slower%20the%20query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?


#### Base & Log Files:

In Hudi’s vector index implementation actually embeddings are inserted under the index files, base files contain metadata information about the index. File groups used for creating base files will be of the format **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**. Here, column_name is the column for which the vector index is created. This way when there are multiple vector indices part of the same dataset, compaction operation can be performed on each vector index separately. There by avoiding large scale updates to all the indexes at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base file name in the example graph does not really follow this naming convention: **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**

- SQL extensions to query the index. (Spark, then Presto/Trino)

Non-goals/Unclear:
- Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?)
Copy link
Collaborator

@rahil-c rahil-c Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suryaprasanna For these RAG applcations, can the rag application just be a python script shared via pyspark? The developer can then directly invoke our vector search api from RFC 102? I saw this basic blog on RAG https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus that shows an example with a vector database like milvus, so wondering if instead you swap out milvus with hudi acting as the vector store, and using pyspark as execution layer.

Or is spark not suited for these kinda use cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why spark specifically..

@suryaprasanna suryaprasanna changed the title Vector index rfc docs: RFC-103 - Support Vector Indexing in Apache Hudi Nov 17, 2025
@apache apache deleted a comment from hudi-bot Nov 17, 2025
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar vinothchandar added the rfc Request for comments label Nov 20, 2025
@vinothchandar
Copy link
Member

@rahil-c @suryaprasanna can we merge this and the other RFC PR?

Also please follow process and grab a RFC number

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a lot of comments. Overall, I'd love for this to be rewritten based on whats already available in tableVersion=8/9.

- @vinoth
- @rahil-c

## Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

- SQL extensions to query the index. (Spark, then Presto/Trino)

Non-goals/Unclear:
- Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why spark specifically..


##### Proposed Approach

To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?

##### Proposed Approach

To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters.
During execution, each Spark executor can load the relevant cluster’s index into memory—or operate in a hybrid mode using both disk and memory—to perform local nearest neighbor searches efficiently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so would this build the data structure needed for ANN each time a query is executed?


#### Read Flow on Vector Index:

When users are accessing the vector index on a column embeddings to find top K nearest neighbours for a vector. The query will first start a spark stage to find the top K nearest neighbours across each of the clusters and the work will be distributed using Spark. Let us say there are 10 clusters, then total 10xK vectors are collected as intermediary result and as part of final step from these 10xK vectors, another top K vectors that are nearest is computed based by on the driver and returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should n't we just shuffle the first stage and compute the final top K .. distributed fashion?


Expanding on these workflows, following sections will go over the data layout and directory structure of vector index implementation.

#### Bootstrap workflow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call this indexer or index building consistent with other SI building workflows..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these parameters should be fed nicely into a CREATE INDEX .. command syntax


For example, if a column’s embeddings are split into two clusters, the system generates two separate vector indexes for that column.

This process is repeated for every column where vector indexing is enabled. Thus, if two columns each produce two clusters, a total of four vector indexes are constructed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each column should be its own vector index?


##### Cluster Formation and Index Construction

After loading the embedding vectors from the data files, the system groups them into clusters. Clustering provides horizontal scalability for vector search, preventing the bottlenecks associated with a single monolithic index. Each cluster is processed independently using the configured indexing algorithm, resulting in one vector index per cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add details on how exactly this clustering algorithm is done. are we envisioning use of sth like k-means


Once the split is complete, the records destined for a specific vector index are evenly distributed across the file groups configured for that index and written into their log files. Importantly, the vector index base files stored under `vector_index/.index_files` are not updated during ingestion. Therefore, SQL queries that read only the index files will not immediately reflect changes written to log files. The .index_files are updated only during compaction. This separation ensures ingestion remains lightweight and avoids expensive index rebuild operations.

This design intentionally stores records for each vector index under column-specific file groups. The purpose is to avoid conflicts when multiple compaction plans run concurrently—each targeting different vector index columns. Because of this structure, when a compaction operation is triggered to refresh a specific vector index, only the file groups belonging to that index are scanned. The system does not need to scan every file group in the vector_index partition, enabling more efficient and isolated compaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would nt updates cause changes to cluster boundaries.. should this be readjusted based on periodic Hudi clustering..


Ingestion workflow stores the changes to vector embeddings in a vector_index partition directory in log file format. These are not readily available for user to read. For users to consume these changes they need to be applied on the vector indexes that are present under .index_files directory. Hudi’s Compaction is the table service that applies these changes on to the index files.

Compaction for the vector index partition is significantly more complex than compaction for other metadata partitions. This complexity arises because vector index files are stored outside the traditional base-file and log-file structure, and each new version of a vector index is maintained independently from its previous version. Since vector indexes do not support transaction-based change tracking and cannot roll back partial updates, version management must be implemented externally. This external versioning ensures that the currently serving index remains isolated and stable while a new version is being built and updated during compaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to fit this all under the current storage format/model

- Graph-based approach
- Excellent recall-speed tradeoff
- Widely used in production systems
2. jVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we produce some microbenchmarks for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc Request for comments size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants