[Vector Index] End-to-end integration test for vector index bootstrap

Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1).

## Scope

Prove the milestone-1 pipeline works end-to-end on Spark.

## Tasks

- New Scala test `hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala`.
- Test flow:
  1. Write a small Hudi MOR table (~1k rows) with a `vector` column populated by synthetic embeddings drawn from K well-separated Gaussian clusters in R^32.
  2. Run `CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS (numClusters = 'K', fgPerCluster = '2')`.
  3. Assertions:
     - MDT partition `vector_index_vec_idx` exists on disk.
     - MDT file-group count equals `K * fgPerCluster`.
     - Every base-table record key appears exactly once in the MDT partition.
     - Each MDT record's `clusterId` is in `[0, K)` and its `vector` field matches the base-table vector for that key.
     - Bonus assertion: KMeans recovered the synthetic clusters (centroid-to-truth nearest-neighbor distance below a threshold).

## Depends on

- Sub-issues 1–6 (this is the integration test that lights up the whole milestone)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vector Index] End-to-end integration test for vector index bootstrap #18856

Scope

Tasks

Depends on

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Vector Index] End-to-end integration test for vector index bootstrap #18856

Description

Scope

Tasks

Depends on

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions