Balance per-database replicas in PGP and Greedy region allocators#17714
Merged
Conversation
PartiteGraphPlacementRegionGroupAllocator ignored the databaseAllocatedRegionGroups argument, so PGP balanced only the global per-DataNode region count. On clusters with multiple databases this let one DataNode hold many replicas of one database while holding few of another (e.g. 20 DataNodes / 2 databases / 60 RGs each could produce DN16 with 15 tod_sod0 replicas vs. 3 usr_sod0 replicas), which in turn prevented downstream leader balancing from reaching an even distribution. Make PGP and the Greedy fallback aware of the per-database load: - PGP now reads databaseAllocatedRegionGroups, tracks databaseRegionCounter[], and compares candidate sets with a (regionSum, databaseRegionSum, edgeSum) triple. Pre-sort inside the sub-graph uses (regionCount, databaseRegionCount, freeDiskSpace, random) so the fixed alpha slots also honour per-database balance. - GreedyRegionGroupAllocator adds databaseRegionCount to DataNodeEntry and sorts by (regionCount, databaseRegionCount, freeDiskSpace, random); buildWeightList consumes databaseAllocatedRegionGroups. The priority order (global > per-db > scatter) matches the user request. PGP's partite-graph structure still provides the high-scatter property by construction, so demoting edgeSum to the tertiary key does not regress scatter width. Tests: - New PartiteGraphPlacementRegionGroupAllocatorTest covers rf 2/3/5 multi-database scenarios, including the reported 20-DN/2-db regression. Each DataNode now holds exactly the expected per-db replica count. - GreedyRegionGroupAllocatorTest gets a new per-database balance test. - New IoTDBPerDatabaseRegionGroupAllocationIT exercises PGR, GCR, and GREEDY policies end-to-end on a real cluster. - CommonConfig (+ Mpp/Shared/Remote impls) gains setRegionGroupAllocatePolicy so ITs can switch between allocators.
The Partite-Graph-Placement allocator's class is PartiteGraphPlacementRegionGroupAllocator, but the RegionGroupAllocatePolicy enum exposed it as PGR. Rename the enum constant to PGP so the user-facing config value matches the algorithm name. Also document region_group_allocate_policy in iotdb-system.properties.template (commented out, defaults to GCR) so the option is discoverable; the property had no template entry before.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17714 +/- ##
============================================
+ Coverage 40.39% 40.42% +0.02%
+ Complexity 2575 2574 -1
============================================
Files 5179 5179
Lines 349659 349756 +97
Branches 44688 44712 +24
============================================
+ Hits 141251 141380 +129
+ Misses 208408 208376 -32 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
PartiteGraphPlacementRegionGroupAllocatorignored itsdatabaseAllocatedRegionGroupsargument, so PGP balanced only the global per-DataNode region count and let one DataNode hold many replicas of one database while holding few of another. On a 20-DataNode / 2-database cluster with 60 DataRegion groups per database (replication factor 3), this could leave DN-16 with 15 replicas oftod_sod0but only 3 ofusr_sod0. That imbalance bled into the leader balancer: per-(database, DataNode) leader counts that should have been a flat 3 ranged from 2 to 4, with one DataNode holding 8 leaders total.This PR makes both PGP and the Greedy fallback aware of the per-database load.
Approach
Comparison priority (smaller is better) for both allocators:
regionSum— total per-DataNode region count (global balance)databaseRegionSum— per-(database, DataNode) region count (new)edgeSum— 2-region scatter (PGP only)PGP's partite-graph structure already provides the high-scatter property by construction, so demoting
edgeSumto the tertiary key does not regress scatter width.PartiteGraphPlacementRegionGroupAllocator
prepare()now consumesdatabaseAllocatedRegionGroupsand builds adatabaseRegionCounter[]parallel toregionCounter[].valuation()returns aValue(regionSum, databaseRegionSum, edgeSum)triple;subGraphSearchandpartiteGraphSearchcompare viaValue.compareTo.PgpDataNodeEntryordered by(regionCount, databaseRegionCount, freeDiskSpace, random)so the fixed alpha slots also honour per-database balance.GreedyRegionGroupAllocator
DataNodeEntrygains adatabaseRegionCountfield;compareToorders by(regionCount, databaseRegionCount, freeDiskSpace, random).buildWeightListaccepts and accumulatesdatabaseAllocatedRegionGroups.IT framework
CommonConfig(and theMpp/Shared/Remoteimplementations) gainssetRegionGroupAllocatePolicy(String)so integration tests can switch between PGR / GCR / GREEDY.Test plan
mvn test -pl iotdb-core/confignode -Dtest='*RegionGroupAllocator*Test'— 9 unit tests pass.PartiteGraphPlacementRegionGroupAllocatorTestcovers replication factor 2 / 3 / 5 multi-database scenarios. The 20-DN / 2-db / rf-3 regression now distributes each database's 60 region groups as exactly 9 replicas per DataNode (max − min = 0).GreedyRegionGroupAllocatorTest.mvn verify -DskipUTs -Dit.test='IoTDBPerDatabaseRegionGroupAllocationIT#testPgrPolicyPerDbReplicaBalance' -PClusterIT -P with-integration-tests— PGR end-to-end IT passes on a 1C4D cluster.GCRpolicy.mvn spotless:apply— clean.Out of scope / follow-ups