[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle required from stateful operators #48382

HeartSaVioR · 2024-10-08T07:53:26Z

What changes were proposed in this pull request?

This PR proposes to use different ShuffleOrigin for the shuffle required from stateful operators.

Spark has been using ENSURE_REQUIREMENTS as ShuffleOrigin which is open for optimization e.g. AQE can adjust the shuffle spec. Quoting the code of ENSURE_REQUIREMENTS:

// Indicates that the shuffle operator was added by the internal `EnsureRequirements` rule. It
// means that the shuffle operator is used to ensure internal data partitioning requirements and
// Spark is free to optimize it as long as the requirements are still ensured.
case object ENSURE_REQUIREMENTS extends ShuffleOrigin

But the distribution requirement for stateful operators is lot more strict - it has to use the all expressions to calculate the hash (for partitioning) and the number of shuffle partitions must be the same with the spec. This is because stateful operator assumes that there is 1:1 mapping between the partition for the operator and the "physical" partition for checkpointed state. That said, it is fragile if we allow any optimization to be made against shuffle for stateful operator.

To prevent this, this PR introduces a new ShuffleOrigin with note that the shuffle is not expected to be "modified".

Why are the changes needed?

This exposes a possibility of broken state based on the contract. We introduced StatefulOpClusteredDistribution in similar reason.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT added.

Was this patch authored or co-authored using generative AI tooling?

No.

…quired from stateful operators

HeartSaVioR · 2024-10-08T07:54:48Z

cc. @cloud-fan Please take a look, thanks!

HeartSaVioR · 2024-10-10T00:51:26Z

I see no further feedback from others. Thanks! Merging to master.

…quired from stateful operators ### What changes were proposed in this pull request? This PR proposes to use different ShuffleOrigin for the shuffle required from stateful operators. Spark has been using ENSURE_REQUIREMENTS as ShuffleOrigin which is open for optimization e.g. AQE can adjust the shuffle spec. Quoting the code of ENSURE_REQUIREMENTS: ``` // Indicates that the shuffle operator was added by the internal `EnsureRequirements` rule. It // means that the shuffle operator is used to ensure internal data partitioning requirements and // Spark is free to optimize it as long as the requirements are still ensured. case object ENSURE_REQUIREMENTS extends ShuffleOrigin ``` But the distribution requirement for stateful operators is lot more strict - it has to use the all expressions to calculate the hash (for partitioning) and the number of shuffle partitions must be the same with the spec. This is because stateful operator assumes that there is 1:1 mapping between the partition for the operator and the "physical" partition for checkpointed state. That said, it is fragile if we allow any optimization to be made against shuffle for stateful operator. To prevent this, this PR introduces a new ShuffleOrigin with note that the shuffle is not expected to be "modified". ### Why are the changes needed? This exposes a possibility of broken state based on the contract. We introduced StatefulOpClusteredDistribution in similar reason. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48382 from HeartSaVioR/SPARK-49905. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle re…

4151e40

…quired from stateful operators

github-actions bot added SQL STRUCTURED STREAMING labels Oct 8, 2024

cloud-fan approved these changes Oct 8, 2024

View reviewed changes

HeartSaVioR closed this in 7e82e29 Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle required from stateful operators #48382

[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle required from stateful operators #48382

HeartSaVioR commented Oct 8, 2024

Uh oh!

HeartSaVioR commented Oct 8, 2024

Uh oh!

HeartSaVioR commented Oct 10, 2024

Uh oh!

Uh oh!

[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle required from stateful operators #48382

[SPARK-49905][SQL][SS] Use different ShuffleOrigin for the shuffle required from stateful operators #48382

Conversation

HeartSaVioR commented Oct 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HeartSaVioR commented Oct 8, 2024

Uh oh!

HeartSaVioR commented Oct 10, 2024

Uh oh!

Uh oh!