Part of #18676. RFC-104 / design PR.
Scope
Prove the milestone-1 pipeline works end-to-end on Spark.
Tasks
- New Scala test
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala.
- Test flow:
- Write a small Hudi MOR table (~1k rows) with a
vector column populated by synthetic embeddings drawn from K well-separated Gaussian clusters in R^32.
- Run
CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS (numClusters = 'K', fgPerCluster = '2').
- Assertions:
- MDT partition
vector_index_vec_idx exists on disk.
- MDT file-group count equals
K * fgPerCluster.
- Every base-table record key appears exactly once in the MDT partition.
- Each MDT record's
clusterId is in [0, K) and its vector field matches the base-table vector for that key.
- Bonus assertion: KMeans recovered the synthetic clusters (centroid-to-truth nearest-neighbor distance below a threshold).
Depends on
- Sub-issues 1–6 (this is the integration test that lights up the whole milestone)
Part of #18676. RFC-104 / design PR.
Scope
Prove the milestone-1 pipeline works end-to-end on Spark.
Tasks
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala.vectorcolumn populated by synthetic embeddings drawn from K well-separated Gaussian clusters in R^32.CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS (numClusters = 'K', fgPerCluster = '2').vector_index_vec_idxexists on disk.K * fgPerCluster.clusterIdis in[0, K)and itsvectorfield matches the base-table vector for that key.Depends on