Part of #18676. RFC-104 / design PR.
Scope
Records belonging to the same cluster must land in the same contiguous bucket of MDT file groups (cluster = a folder containing N files). This sub-task adds the mapping function used by the MDT writer.
Tasks
- Add
getVectorKeyToFileGroupMappingFunction(numClusters, fgPerCluster) in hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java.
- Key encoding: prefix the record key with the cluster ID, e.g.
C<hex(clusterId)>|<recordKey>. Allows prefix scans per cluster at read time.
- Mapping:
fileGroupIndex = (clusterId * fgPerCluster) + (hash(recordKey) % fgPerCluster).
- Override
getFileGroupMappingFunction(HoodieIndexVersion) on the VECTOR_INDEX enum in MetadataPartitionType so MDT routes records to the right file group.
Tests
- Unit test: insert many synthetic
(recordKey, clusterId) tuples; assert all records for cluster c land in file groups [c*fgPerCluster, (c+1)*fgPerCluster).
- Unit test: varying
fgPerCluster (1, 4, 16) — distribution of records within a cluster is roughly uniform across that cluster's file groups.
Depends on
- Sub-issue 1 (partition type registration)
Out of scope
Actual writing into the file groups — that happens in sub-issue 5.
Part of #18676. RFC-104 / design PR.
Scope
Records belonging to the same cluster must land in the same contiguous bucket of MDT file groups (cluster = a folder containing N files). This sub-task adds the mapping function used by the MDT writer.
Tasks
getVectorKeyToFileGroupMappingFunction(numClusters, fgPerCluster)inhudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java.C<hex(clusterId)>|<recordKey>. Allows prefix scans per cluster at read time.fileGroupIndex = (clusterId * fgPerCluster) + (hash(recordKey) % fgPerCluster).getFileGroupMappingFunction(HoodieIndexVersion)on theVECTOR_INDEXenum inMetadataPartitionTypeso MDT routes records to the right file group.Tests
(recordKey, clusterId)tuples; assert all records for clustercland in file groups[c*fgPerCluster, (c+1)*fgPerCluster).fgPerCluster(1, 4, 16) — distribution of records within a cluster is roughly uniform across that cluster's file groups.Depends on
Out of scope
Actual writing into the file groups — that happens in sub-issue 5.