Skip to content

[Vector Index] File-group mapping function for cluster-to-file-group routing #18852

@rahil-c

Description

@rahil-c

Part of #18676. RFC-104 / design PR.

Scope

Records belonging to the same cluster must land in the same contiguous bucket of MDT file groups (cluster = a folder containing N files). This sub-task adds the mapping function used by the MDT writer.

Tasks

  • Add getVectorKeyToFileGroupMappingFunction(numClusters, fgPerCluster) in hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java.
  • Key encoding: prefix the record key with the cluster ID, e.g. C<hex(clusterId)>|<recordKey>. Allows prefix scans per cluster at read time.
  • Mapping: fileGroupIndex = (clusterId * fgPerCluster) + (hash(recordKey) % fgPerCluster).
  • Override getFileGroupMappingFunction(HoodieIndexVersion) on the VECTOR_INDEX enum in MetadataPartitionType so MDT routes records to the right file group.

Tests

  • Unit test: insert many synthetic (recordKey, clusterId) tuples; assert all records for cluster c land in file groups [c*fgPerCluster, (c+1)*fgPerCluster).
  • Unit test: varying fgPerCluster (1, 4, 16) — distribution of records within a cluster is roughly uniform across that cluster's file groups.

Depends on

  • Sub-issue 1 (partition type registration)

Out of scope

Actual writing into the file groups — that happens in sub-issue 5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureNew features and enhancements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions