Skip to content

[Vector Index] CREATE INDEX ... USING vector_index DDL + index definition wiring #18855

@rahil-c

Description

@rahil-c

Part of #18676. RFC-104 / design PR.

Scope

User-facing entry point that triggers the bootstrap pipeline from sub-issue 5.

Tasks

  • Extend hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/IndexCommands.scala to recognize vector_index index type.
  • Extend hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/HoodieSparkIndexClient.java to:
    • Accept SQL options: vectorColumn (required), numClusters (optional, default from config), fgPerCluster (optional, default from config).
    • Validate the column exists on the table and is of type array<float> or array<double>.
    • Persist user-supplied params into HoodieIndexDefinition (so the bootstrap can read them back).
    • Invoke ScheduleIndexActionExecutor → metadata writer bootstrap path.

Example DDL the change must support:

CREATE INDEX my_vec_idx ON hudi_tbl
USING vector_index (embedding)
OPTIONS (numClusters = '128', fgPerCluster = '2');

Tests

  • Negative test: missing vectorColumn option → clear error.
  • Negative test: non-array column → clear error.
  • Positive test: valid DDL parses and persists the HoodieIndexDefinition correctly.

Depends on

  • Sub-issues 1, 5 (need partition type + bootstrap implementation)

Out of scope

DROP INDEX and REFRESH INDEX for vector indexes (later milestone).

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureNew features and enhancements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions