docs: RFC-103 - Support Vector Indexing in Apache Hudi#14255
docs: RFC-103 - Support Vector Indexing in Apache Hudi#14255suryaprasanna wants to merge 7 commits intoapache:masterfrom
Conversation
|
Thanks @suryaprasanna for making this! I was wondering if you can create a github issue for your RFC similar to something like this #14219 and then link your RFC there as well. Also wondering if can change the title for both the issue and this PR to be something like |
| ## Background | ||
| Following are the goals for this RFC | ||
| - Creating vector indexes based on a base column in a table - either an embedding column or a text column. | ||
| - Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries. |
There was a problem hiding this comment.
I am interested in this idea, you might cover later in the RFC around what our strategies are for updating the index again since I am assuming this is expensive. If we are do on every commit which we pay the cost on each write, or if we offer follow hudi current model of having this trigger x commits or some time interval to reindex?
| - Creating vector indexes based on a base column in a table - either an embedding column or a text column. | ||
| - Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries. | ||
| - First-class SQL experience for creating, and dropping indexes (Spark) | ||
| - SQL extensions to query the index. (Spark, then Presto/Trino) |
There was a problem hiding this comment.
@suryaprasanna I am thinking maybe for these two bullet points, we move these items scope of RFC 102, since it handles exposing the capability to perform to the user.
- First-class SQL experience for creating, and dropping indexes (Spark)
- SQL extensions to query the index. (Spark, then Presto/Trino)
Will link the draft RFC 102: #14219 around exposing vector search
I also think that we have SQL semantics already for creating and dropping indexes in spark, for secondary index https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexDataTypes.scala#L190C1-L191C1. So wondering if we can leverage similar wiring?
| - Highly optimized C++ library with Python bindings | ||
| - Ideal for large-scale GPU-accelerated similarity search | ||
|
|
||
| For initial implementation, we will look into HNSW and Jvector implementations. |
There was a problem hiding this comment.
@suryaprasanna I am curious if you looked into IVF and if its actually a good candidate to be implemented for hudi's internal metadata table. https://blog.dailydoseofds.com/p/approximate-nearest-neighbor-search
There was a problem hiding this comment.
I'm wondering if the Locality-Sensitive Hashing(LSH) is a good candicate for hashing index, because of it's efficiency and may be more friendly for streaming ingestion? https://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf
| #### Supporting transactions for Upsert/deletes: | ||
| Vector embeddings are not static—they can evolve over time as underlying features change or as new embedding generation algorithms are applied. Therefore, the vector index implementation must support upserts and deletes, along with integration into Hudi table services such as rollback and restore operations. | ||
|
|
||
| To achieve this, we propose using a versioning mechanism for maintaining and updating the vector index. Each update to the embeddings or index structure can be associated with a version, enabling consistent rollback and recovery when table operations require reverting to a previous state. |
There was a problem hiding this comment.
Should we track any other additional metadata such as what was the embedding model used to generate this column, or what is the blob/binary col this embedding column maps to?
|
|
||
| #### Clusters of Indexes (Horizontal Scalability for Vector Indexing) | ||
|
|
||
| Approximate Nearest Neighbor (ANN) algorithms are graph-based structures that are inherently designed for in-memory computation. As a result, traditional ANN implementations typically achieve scalability through vertical scaling — by adding more memory and CPU resources to a single node. |
There was a problem hiding this comment.
are graph-based structures
The main families of ANN indexing structures are graph-based, tree-based, hashing-based, and quantization-based methods, not just graph-based?
There was a problem hiding this comment.
And one base file could have multiple index columns right?
There was a problem hiding this comment.
If a base file contains multiple embedding columns, then yes I would assume that you can have indexes built on each one.
|
|
||
| ##### Proposed Approach | ||
|
|
||
| To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters. |
There was a problem hiding this comment.
how do we organize these clusters, hashing by keys? can one cluster includes multiple file groups here from Hudi's notion?
There was a problem hiding this comment.
If i understand correctly these clusters are basically rows that are grouped together by how similar their embedding values are to one another, and not directly related to the recordkey being hashed unless you are talking about some other value being used as key? @danny0405 @suryaprasanna
https://www.pinecone.io/learn/a-developers-guide-to-ann-algorithms/
Based on how vectors are structured, vector indexing algorithms can be divided into three main categories:
Spatial Partitioning (also known as cluster-based indexing or clustering indexes)
Graph-based indexing (e.g. [HNSW](https://www.pinecone.io/learn/series/faiss/hnsw/))
Hash-based indexing (e.g., [locality-sensitive hashing](https://www.pinecone.io/learn/series/faiss/vector-indexes/#Locality-Sensitive-Hashing))
We’ll skip hash-based indexing in this article because its performance on all aspects, including reads, writes, and storage, is currently worse than that of graph-based and spatial partitioning-based indexing. Almost no vector databases use hash-based indexing nowadays.
Inverted File Index (IVF)
Another popular index is the Inverted File Index (IVF). It’s easy to use, has high search quality, and has reasonable search speed.
It’s the most basic version of the spatial-partitioning index and inherits its advantages including little space overhead and the ability to work with object storage as well as disadvantages like lower query throughput compared to graph-based indexing in-memory and on-disk.
When I was reading online I saw that IVF seemed to be popular with object storage and seems to be a default in other systems in the space https://lancedb.com/docs/indexing/#:~:text=Inverted%20File%20Index%20(IVF)%20Implementation,but%20the%20slower%20the%20query.
There was a problem hiding this comment.
My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?
rfc/rfc-103/rfc-103.md
Outdated
|
|
||
| #### Base & Log Files: | ||
|
|
||
| In Hudi’s vector index implementation actually embeddings are inserted under the index files, base files contain metadata information about the index. File groups used for creating base files will be of the format **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**. Here, column_name is the column for which the vector index is created. This way when there are multiple vector indices part of the same dataset, compaction operation can be performed on each vector index separately. There by avoiding large scale updates to all the indexes at the same time. |
There was a problem hiding this comment.
The base file name in the example graph does not really follow this naming convention: **<COLUMN_NAME>-<FILE_INDEX>_<WRITE_TOKEN>_<COMMIT>.hfile**
| - SQL extensions to query the index. (Spark, then Presto/Trino) | ||
|
|
||
| Non-goals/Unclear: | ||
| - Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?) |
There was a problem hiding this comment.
@suryaprasanna For these RAG applcations, can the rag application just be a python script shared via pyspark? The developer can then directly invoke our vector search api from RFC 102? I saw this basic blog on RAG https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus that shows an example with a vector database like milvus, so wondering if instead you swap out milvus with hudi acting as the vector store, and using pyspark as execution layer.
Or is spark not suited for these kinda use cases?
|
@rahil-c @suryaprasanna can we merge this and the other RFC PR? Also please follow process and grab a RFC number |
vinothchandar
left a comment
There was a problem hiding this comment.
Left a lot of comments. Overall, I'd love for this to be rewritten based on whats already available in tableVersion=8/9.
| - @vinoth | ||
| - @rahil-c | ||
|
|
||
| ## Status |
| - SQL extensions to query the index. (Spark, then Presto/Trino) | ||
|
|
||
| Non-goals/Unclear: | ||
| - Fast serving layer, directly usable from RAG applications (this can be left to existing ODBC/SQL gateways that can talk to Spark?) |
|
|
||
| ##### Proposed Approach | ||
|
|
||
| To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters. |
There was a problem hiding this comment.
My understanding is : we are grouping the embeddings by some clustering algo. and eventually, each cluster group maps to a file group in the vector search index at the Hudi table level?
| ##### Proposed Approach | ||
|
|
||
| To enable horizontal scalability, we propose organizing the vector embeddings into multiple clusters of indexes. Each cluster will manage a subset of the embeddings, allowing the overall dataset to be evenly partitioned across the clusters. | ||
| During execution, each Spark executor can load the relevant cluster’s index into memory—or operate in a hybrid mode using both disk and memory—to perform local nearest neighbor searches efficiently. |
There was a problem hiding this comment.
so would this build the data structure needed for ANN each time a query is executed?
|
|
||
| #### Read Flow on Vector Index: | ||
|
|
||
| When users are accessing the vector index on a column embeddings to find top K nearest neighbours for a vector. The query will first start a spark stage to find the top K nearest neighbours across each of the clusters and the work will be distributed using Spark. Let us say there are 10 clusters, then total 10xK vectors are collected as intermediary result and as part of final step from these 10xK vectors, another top K vectors that are nearest is computed based by on the driver and returned. |
There was a problem hiding this comment.
Should n't we just shuffle the first stage and compute the final top K .. distributed fashion?
|
|
||
| Expanding on these workflows, following sections will go over the data layout and directory structure of vector index implementation. | ||
|
|
||
| #### Bootstrap workflow: |
There was a problem hiding this comment.
lets call this indexer or index building consistent with other SI building workflows..
There was a problem hiding this comment.
all of these parameters should be fed nicely into a CREATE INDEX .. command syntax
|
|
||
| For example, if a column’s embeddings are split into two clusters, the system generates two separate vector indexes for that column. | ||
|
|
||
| This process is repeated for every column where vector indexing is enabled. Thus, if two columns each produce two clusters, a total of four vector indexes are constructed. |
There was a problem hiding this comment.
each column should be its own vector index?
|
|
||
| ##### Cluster Formation and Index Construction | ||
|
|
||
| After loading the embedding vectors from the data files, the system groups them into clusters. Clustering provides horizontal scalability for vector search, preventing the bottlenecks associated with a single monolithic index. Each cluster is processed independently using the configured indexing algorithm, resulting in one vector index per cluster. |
There was a problem hiding this comment.
Can we add details on how exactly this clustering algorithm is done. are we envisioning use of sth like k-means
|
|
||
| Once the split is complete, the records destined for a specific vector index are evenly distributed across the file groups configured for that index and written into their log files. Importantly, the vector index base files stored under `vector_index/.index_files` are not updated during ingestion. Therefore, SQL queries that read only the index files will not immediately reflect changes written to log files. The .index_files are updated only during compaction. This separation ensures ingestion remains lightweight and avoids expensive index rebuild operations. | ||
|
|
||
| This design intentionally stores records for each vector index under column-specific file groups. The purpose is to avoid conflicts when multiple compaction plans run concurrently—each targeting different vector index columns. Because of this structure, when a compaction operation is triggered to refresh a specific vector index, only the file groups belonging to that index are scanned. The system does not need to scan every file group in the vector_index partition, enabling more efficient and isolated compaction. |
There was a problem hiding this comment.
would nt updates cause changes to cluster boundaries.. should this be readjusted based on periodic Hudi clustering..
|
|
||
| Ingestion workflow stores the changes to vector embeddings in a vector_index partition directory in log file format. These are not readily available for user to read. For users to consume these changes they need to be applied on the vector indexes that are present under .index_files directory. Hudi’s Compaction is the table service that applies these changes on to the index files. | ||
|
|
||
| Compaction for the vector index partition is significantly more complex than compaction for other metadata partitions. This complexity arises because vector index files are stored outside the traditional base-file and log-file structure, and each new version of a vector index is maintained independently from its previous version. Since vector indexes do not support transaction-based change tracking and cannot roll back partial updates, version management must be implemented externally. This external versioning ensures that the currently serving index remains isolated and stable while a new version is being built and updated during compaction. |
There was a problem hiding this comment.
I want to fit this all under the current storage format/model
| - Graph-based approach | ||
| - Excellent recall-speed tradeoff | ||
| - Widely used in production systems | ||
| 2. jVector |
There was a problem hiding this comment.
can we produce some microbenchmarks for this?
Describe the issue this Pull Request addresses
Issue: #14290
This pull request proposes a RFC for adding support for Vector Indexes in Apache Hudi.
Summary and Changelog
Impact
none
Risk Level
low
Documentation Update
Contributor's checklist