Skip to content

[Vector Index] End-to-end integration test for vector index bootstrap #18856

@rahil-c

Description

@rahil-c

Part of #18676. RFC-104 / design PR.

Scope

Prove the milestone-1 pipeline works end-to-end on Spark.

Tasks

  • New Scala test hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala.
  • Test flow:
    1. Write a small Hudi MOR table (~1k rows) with a vector column populated by synthetic embeddings drawn from K well-separated Gaussian clusters in R^32.
    2. Run CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS (numClusters = 'K', fgPerCluster = '2').
    3. Assertions:
      • MDT partition vector_index_vec_idx exists on disk.
      • MDT file-group count equals K * fgPerCluster.
      • Every base-table record key appears exactly once in the MDT partition.
      • Each MDT record's clusterId is in [0, K) and its vector field matches the base-table vector for that key.
      • Bonus assertion: KMeans recovered the synthetic clusters (centroid-to-truth nearest-neighbor distance below a threshold).

Depends on

  • Sub-issues 1–6 (this is the integration test that lights up the whole milestone)

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureNew features and enhancements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions