[core][flink] Supporting Per-Partition Bucket Counts#7865
Open
mikedias wants to merge 6 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In partitioned Paimon tables, all partitions share the same bucket count defined at the table level. This becomes a bottleneck when data is highly skewed: a "hot" partition (e.g., a large tenant) may receive orders of magnitude more data than other partitions, yet it is forced to use the same number of buckets. The only workaround was to increse the number of buckets for the entire table, but that in turn end up creating too many buckets for smaller partitions, leading to a small file problem.
Solution
This PR introduces per-partition bucket counts, allowing individual partitions to be independently rescaled. Skewed partitions can be split into more buckets without affecting the rest of the table.
The core idea is a new
PartitionBucketMappingthat maintains an explicitpartition → bucket countmap alongside a table-level default. Every component that needs to assign a bucket to a row (write selectors, key extractors) now consults this mapping rather than blindly usingschema().numBuckets(). Each partition's bucket count is derived from thetotalBucketsfield already stamped on its data files in the manifest, so no schema migration is required.Changes
Core (
paimon-core)PartitionBucketMapping(new) — Serializable mapping ofBinaryRow partition → int bucketCount, with aloadFromTablefactory that scans the manifest to reconstruct the current per-partition layout and falls back to the schema default gracefully.SchemaBucketFileStoreTable(new) — A lightweightDelegatedFileStoreTablewrapper used during rescale/overwrite operations. It forces all writes to use the new target bucket count (ignoring the per-partition map), ensuring the overwrite lands in the right buckets.FixedBucketRowKeyExtractor/FixedBucketWriteSelector— Updated to accept aPartitionBucketMappingand callresolveNumBuckets(partition)per row instead of using a fixed global count.WriteRestore/FileSystemWriteRestore— Extended withextractTotalBucketslogic that correctly handles three cases: non-empty buckets (use the value from existing data files), empty buckets on partitioned tables (look up the per-partition override), and empty buckets on unpartitioned tables (fall back to schema default so the committer-side mismatch check still fires).PartitionEntry— Minor fix for correct behaviour in non-partitioned table corner cases.Flink (
paimon-flink)FlinkSinkBuilder— WiresPartitionBucketMappinginto the streaming sink pipeline so that per-partition bucket routing is applied at ingest time.RescaleAction/CompactAction— UseRescaleFileStoreTablewhen performing rescale/overwrite so the new bucket count is applied only to the target partitions.RowDataChannelComputer— Updated to route rows to the correct sub-task using the per-partition bucket count.TableWriteCoordinator/PostponeFixedBucketChannelComputer— Fixed to handle the "empty bucket" scenario that can arise in write-restore flows when a partition exists in the mapping but has no files yet.RowDataKeyAndBucketExtractor(deleted) — Test helper class replaced with using the superclass types directly.Behaviour
RuntimeExceptionis thrown if this is violated.rescaleprocedure or a manualINSERT OVERWRITEin batch mode:After the job completes, the rescaled partition uses 32 buckets while all other partitions are untouched.
Testing
We haven been soaking this change in our test environments and we are seeing good results. Plus, we add a bunch of new tests to validate we are not breaking anything:
•
PartitionBucketMappingTest— unit tests for mapping resolution and loadFromTable.•
FixedBucketRowKeyExtractorTest— verifies correct bucket assignment with heterogeneous per-partition counts.•
FileStoreCommitTest— integration tests covering rescale commits with mixed bucket counts.•
FileSystemWriteRestoreTest— covers the empty-bucket write-restore scenario end-to-end, including the non-partitioned corner case.•
RescaleBucketITCase— end-to-end Flink integration tests for INSERT OVERWRITE-based rescale and streaming restore after rescale.•
RescaleActionITCase— end-to-end tests for the rescale procedure action with per-partition targeting.•
TableWriteCoordinatorTest— unit tests for coordinator behaviour under the new mapping.