[python] Pre-repartition Ray writes by (partition, bucket) for fixed-bucket tables#7813
Open
TheR1sing3un wants to merge 3 commits into
Open
[python] Pre-repartition Ray writes by (partition, bucket) for fixed-bucket tables#7813TheR1sing3un wants to merge 3 commits into
TheR1sing3un wants to merge 3 commits into
Conversation
0ed753e to
9d6bc0f
Compare
Without pre-clustering, Ray's default round-robin block distribution scatters rows that share the same (partition, bucket) across many Ray tasks. Each task opens its own writer, producing partitions x buckets x ray_tasks files instead of the partitions x buckets the writer would naturally produce. This commit adds a helper module that groups rows by (partition_keys..., bucket) using Ray's groupby/map_groups so all rows for one (partition, bucket) land in one Ray task. Bucket assignment is computed via FixedBucketRowKeyExtractor (the same extractor the writer uses) so the shuffle bucket is byte-identical to the writer's. The helper is opt-in (defaults to no-op) and the next commit wires it through write_paimon. Non-HASH_FIXED tables soft-fall-back with a warning instead of raising.
Adds two new keyword-only parameters to write_paimon():
* shuffle: bool = False — pre-cluster rows by (partition_keys..., bucket)
for HASH_FIXED tables, so each (partition, bucket) lands in one Ray
task. Mirrors what PaimonSparkWriter.repartitionByPartitionsAndBucket
does on the Spark side. Non-HASH_FIXED tables log a warning and
fall back to the original no-shuffle write.
* override_num_blocks: Optional[int] = None — Ray output block count
(mirrors the same-named parameter on read_paimon). With shuffle=True
it is a parallelism hint for the groupby shuffle; with shuffle=False
it triggers a plain Ray block rebalance.
End-to-end coverage in pypaimon/tests/ray_repartition_test.py:
* roundtrip equality for both shuffle=False and shuffle=True
* file-count reduction on a HASH_FIXED multi-block write
* soft fallback for BUCKET_UNAWARE + warning emitted
* sink-visible schema does not carry the transient bucket column
* override_num_blocks alone produces a plain rebalance
Defaults preserve the previous round-robin behaviour so no existing
caller is affected.
9d6bc0f to
553d9c0
Compare
Ray's groupby pipeline drops Arrow field-level not-null annotations, so shuffle-written PK data files lose the not-null on PK columns. When read_paimon reads them back via RayDatasource, from_batches rejects the nullability mismatch (batch: int32 vs schema: int32 not null). This is a pre-existing read-path issue, not caused by shuffle. Switch the shuffle roundtrip tests to read via the direct table API (ReadBuilder -> to_arrow) instead of read_paimon, since these tests verify write correctness, not the Ray read path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When
write_paimonis given a Ray Dataset, Ray's default round-robinblock distribution scatters rows that share the same
(partition, bucket)across many Ray tasks. Each task opens its own writer and emitsits own data file, so the write produces
partitions × buckets × ray_tasksfiles instead of thepartitions × bucketsthe writer would naturally produce.Spark and Flink already cluster rows by
(partition, bucket)beforewriting — see
PaimonSparkWriter.repartitionByPartitionsAndBucketandthe
RowAssignerChannelComputer/RowWithBucketChannelComputerchain.This PR brings the same pre-clustering to the Ray path.
Linked Issue
N/A
Effect
Two new keyword-only parameters on
write_paimon:shuffle: bool = False— for HASH_FIXED tables, group rows by(partition_keys..., bucket)via Ray'sgroupby/map_groupssoeach
(partition, bucket)lands in one Ray task. Bucket assignmentis computed with
FixedBucketRowKeyExtractor, the same extractor thewriter uses, so the shuffle-time bucket is byte-equivalent to the
writer's. Non-HASH_FIXED tables log a warning and write as before.
num_blocks: Optional[int] = None— optional Ray output blockcount. With
shuffle=Trueit is a parallelism hint for the groupby;with
shuffle=Falseit triggers a plain Ray block rebalance.Defaults preserve the previous behaviour, so no existing caller is
affected.
Tests
pypaimon/tests/test_ray_shuffle_helper.py— 8 unit tests coveringthe bucket-key UDF (column type, empty input, multi-chunk combine)
and every no-op / soft-fallback branch.
pypaimon/tests/ray_repartition_test.py— 7 end-to-end tests:shuffle=Falseroundtrip equalityshuffle=Trueroundtrip on a HASH_FIXED PK tableshuffle=Trueon a partitioned HASH_FIXED PK table (post-groupbyschema integrity check)
num_blocks=0raisesValueErrornum_blocks-only plain block rebalanceExisting Ray tests (
ray_integration_test.py,ray_data_test.py)remain green.
API & Format Impact
to
pypaimon.ray.write_paimon. No signature break.__paimon_bucket__column isstripped before the sink sees the dataset, so on-disk layout is
unaffected.
Documentation
docs/content/pypaimon/ray-data.md— new section explaining thesmall-file problem and the
shuffle/num_blocksoptions.