[SPARK-57196][SQL] Make UnionExec whole-stage codegen thread-safe#56252
[SPARK-57196][SQL] Make UnionExec whole-stage codegen thread-safe#56252gengliangwang wants to merge 2 commits into
Conversation
### What changes were proposed in this pull request?
`UnionExec` whole-stage codegen fusion (SPARK-56482) kept per-emission codegen
state in mutable instance fields on the plan node: `currentEmittingChild` (set
in `doProduce`, read in `doConsume` to pick a child's projection) and
`numOutputRowsTerm` (the once-per-stage metric term). This PR moves both fields
to `ThreadLocal`, isolating the state to the single thread that runs a given
`doCodeGen` pass.
### Why are the changes needed?
A single `UnionExec` instance can have its whole-stage codegen driven by more
than one thread at the same time: a reused exchange/subquery stage is generated
concurrently with the main plan, and async subquery / dynamic-partition-pruning
execution can overlap a driver-side `doCodeGen`. With the shared mutable field,
a racing `doProduce` resets `currentEmittingChild` to -1 while another thread is
still inside `doConsume`, tripping:
java.lang.IllegalArgumentException: requirement failed:
UnionExec.doConsume invoked outside doProduce emission window
This surfaced as a flaky `LogicalPlanTagInSparkPlanSuite.q2` failure (q2 has a
UNION, and union fusion is on by default). Each `doCodeGen` pass is itself
single-threaded (`produce` -> `doConsume` run inline on one thread), so a
`ThreadLocal` isolates the state per pass without the cross-thread race.
### Does this PR introduce _any_ user-facing change?
No. It removes an intermittent internal codegen failure; generated code and
results are unchanged.
### How was this patch tested?
Added `UnionCodegenSuite` test "SPARK-57196: concurrent codegen of a shared
UnionExec stage is thread-safe" that drives `doCodeGen()` on one shared fused
`UnionExec` stage from 8 threads. It reproduces the "outside doProduce emission
window" failure on the old code and passes with this fix. Verified the full
`UnionCodegenSuite` (43 tests), the ANSI/AQE variants, and
`LogicalPlanTagInSparkPlanSuite` q2 all pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)
Co-authored-by: Isaac
|
Example flaky test: https://github.com/gengliangwang/spark/actions/runs/26682136586/job/78902199893 |
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 2 non-blocking, 0 nits.
Correct, well-motivated, well-tested thread-safety fix. Two non-blocking design questions only.
Design / architecture (2)
- basicPhysicalOperators.scala:1004: pass-scoped codegen state stored as per-thread
ThreadLocal, while the same method already routes a sibling piece throughctx— why ThreadLocal overctx-scoped storage? — see inline - (general): is UnionExec uniquely exposed, or do other
CodegenSupportoperators with varying per-emission instance state have the same race — worth a documented invariant?
Verification
Traced the race: numOutputRowsTerm is ctx.addReferenceObj (a per-CodegenContext references[N] term) and currentEmittingChild cycles per child and resets to -1, so both vary per pass and a shared field lets a concurrent doProduce clobber a value another thread reads in doConsume. Each doCodeGen pass runs produce->doConsume inline on one thread and passes don't nest on a thread, so per-thread ThreadLocal == per-pass isolation — the race is removed and the metric stays once-per-stage. doConsume's require(i >= 0) still fires before numOutputRowsTerm.get, so a null term can't leak into generated code. The benign control case (CodegenSupport.parent, also a shared mutable field but always set to the same value) confirms the fix targets exactly the fields that vary.
| // is still in `doConsume`. Each `doCodeGen` pass is itself single-threaded | ||
| // (`produce` -> `doConsume` run inline on one thread), so a `ThreadLocal` | ||
| // isolates the state per pass without that cross-thread race. | ||
| @transient private lazy val numOutputRowsTerm = new ThreadLocal[String] |
There was a problem hiding this comment.
These two fields are pass-scoped codegen state but stored as per-thread ThreadLocal. The framework's established home for pass-scoped codegen state is CodegenContext (INPUT_ROW, currentVars, currentPartitionIndexVar, freshNamePrefix), created fresh per doCodeGen pass — and this very method already routes a sibling piece of pass state, currentPartitionIndexVar, through ctx (saved/restored just below). So two pieces of the same kind of state now use two different mechanisms here. ctx-scoped storage would be exactly per-pass (ThreadLocal's per-thread isolation is only correct because passes don't nest on a thread) and consistent with the existing pattern. The tradeoff — and why this is a question, not a blocker — is that adding union-specific fields to the shared CodegenContext pollutes a class every operator uses, whereas ThreadLocal stays localized here. Could you add a sentence (comment or PR description) on why ThreadLocal was preferred over threading these through ctx?
There was a problem hiding this comment.
Good point -- you're right that this is per-pass state, and ThreadLocal is correct only because per-pass and per-thread coincide here (a pass runs inline on one thread, and passes don't nest). I've expanded the comment to spell out the scope and the rationale.
The reason I preferred ThreadLocal over ctx: CodegenContext has no general-purpose per-pass attribute map -- currentPartitionIndexVar and friends are named fields. Threading currentEmittingChild / numOutputRowsTerm through ctx would mean adding UnionExec-specific fields to a class every operator instantiates, whereas the ThreadLocal keeps the state local to the one node that needs it. That's the tradeoff you described, and I landed on keeping it localized.
Happy to move it onto ctx instead if you feel the consistency with currentPartitionIndexVar outweighs the pollution -- let me know your preference.
…cal over ctx Address review feedback: explain that the state is valid only for one doCodeGen pass, why per-thread isolation is correct here, and why it is kept in a ThreadLocal rather than threaded through CodegenContext. Co-authored-by: Isaac
cloud-fan
left a comment
There was a problem hiding this comment.
1 addressed, 1 remaining, 0 new.
The expanded comment cleanly answers my ctx-vs-ThreadLocal question — I'm good with ThreadLocal here: the "ctx has no general-purpose per-pass attribute map, and adding UnionExec-specific fields would pollute a class every operator uses" argument outweighs the consistency-with-currentPartitionIndexVar point. Thanks for documenting it.
Design / architecture (1, non-blocking, remaining)
- The broader round-1 question is still open: are other
CodegenSupportoperators that keep varying per-emission state in instance fields exposed to the same race, or isUnionExec's per-child multiplexing genuinely unique? Optional follow-up — not blocking this PR.
Verification
Re-confirmed the documented invariant against source: a doCodeGen pass runs produce->doConsume inline on one thread and passes never nest, so per-thread ThreadLocal == per-pass isolation; require(i >= 0) (1088) still fires before numOutputRowsTerm.get (1097). The three new comment claims all check out (currentPartitionIndexVar save/restore at 1054/1072; CodegenContext exposes only named fields, no general map; reset-guards-stale at 1071).
|
@cloud-fan thanks for the review. Merging to master/4.x/4.2 |
|
cc @huaxingao as well |
### What changes were proposed in this pull request? `UnionExec` whole-stage codegen fusion (SPARK-56482) kept per-emission codegen state in mutable instance fields on the plan node: `currentEmittingChild` (set in `doProduce`, read in `doConsume` to pick a child's projection) and `numOutputRowsTerm` (the once-per-stage metric term). This PR moves both fields to `ThreadLocal`, isolating the state to the single thread that runs a given `doCodeGen` pass. ### Why are the changes needed? A single `UnionExec` instance can have its whole-stage codegen driven by more than one thread at the same time: a reused exchange/subquery stage is generated concurrently with the main plan, and async subquery / dynamic-partition-pruning execution can overlap a driver-side `doCodeGen`. With the shared mutable field, a racing `doProduce` resets `currentEmittingChild` to `-1` while another thread is still inside `doConsume`, tripping: ``` java.lang.IllegalArgumentException: requirement failed: UnionExec.doConsume invoked outside doProduce emission window ``` This surfaced as a flaky `LogicalPlanTagInSparkPlanSuite.q2` failure (q2 contains a `UNION`, and union fusion is enabled by default). Each `doCodeGen` pass is itself single-threaded (`produce` -> `doConsume` run inline on one thread), so a `ThreadLocal` isolates the state per pass without the cross-thread race, while preserving the existing per-stage semantics (the metric term is still computed once per pass). ### Does this PR introduce _any_ user-facing change? No. It removes an intermittent internal code-generation failure; the generated code and query results are unchanged. ### How was this patch tested? Added a `UnionCodegenSuite` test, "SPARK-57196: concurrent codegen of a shared UnionExec stage is thread-safe", that drives `doCodeGen()` on one shared fused `UnionExec` stage from 8 threads. It reproduces the "outside doProduce emission window" failure on the unpatched code and passes with this fix. Also verified the full `UnionCodegenSuite` (43 tests), its ANSI/AQE variants, and `LogicalPlanTagInSparkPlanSuite` q2 all pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) Closes #56252 from gengliangwang/spark-union-codegen-threadsafe. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 694c848) Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? `UnionExec` whole-stage codegen fusion (SPARK-56482) kept per-emission codegen state in mutable instance fields on the plan node: `currentEmittingChild` (set in `doProduce`, read in `doConsume` to pick a child's projection) and `numOutputRowsTerm` (the once-per-stage metric term). This PR moves both fields to `ThreadLocal`, isolating the state to the single thread that runs a given `doCodeGen` pass. ### Why are the changes needed? A single `UnionExec` instance can have its whole-stage codegen driven by more than one thread at the same time: a reused exchange/subquery stage is generated concurrently with the main plan, and async subquery / dynamic-partition-pruning execution can overlap a driver-side `doCodeGen`. With the shared mutable field, a racing `doProduce` resets `currentEmittingChild` to `-1` while another thread is still inside `doConsume`, tripping: ``` java.lang.IllegalArgumentException: requirement failed: UnionExec.doConsume invoked outside doProduce emission window ``` This surfaced as a flaky `LogicalPlanTagInSparkPlanSuite.q2` failure (q2 contains a `UNION`, and union fusion is enabled by default). Each `doCodeGen` pass is itself single-threaded (`produce` -> `doConsume` run inline on one thread), so a `ThreadLocal` isolates the state per pass without the cross-thread race, while preserving the existing per-stage semantics (the metric term is still computed once per pass). ### Does this PR introduce _any_ user-facing change? No. It removes an intermittent internal code-generation failure; the generated code and query results are unchanged. ### How was this patch tested? Added a `UnionCodegenSuite` test, "SPARK-57196: concurrent codegen of a shared UnionExec stage is thread-safe", that drives `doCodeGen()` on one shared fused `UnionExec` stage from 8 threads. It reproduces the "outside doProduce emission window" failure on the unpatched code and passes with this fix. Also verified the full `UnionCodegenSuite` (43 tests), its ANSI/AQE variants, and `LogicalPlanTagInSparkPlanSuite` q2 all pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) Closes #56252 from gengliangwang/spark-union-codegen-threadsafe. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 694c848) Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
UnionExecwhole-stage codegen fusion (SPARK-56482) kept per-emission codegen state in mutable instance fields on the plan node:currentEmittingChild(set indoProduce, read indoConsumeto pick a child's projection) andnumOutputRowsTerm(the once-per-stage metric term). This PR moves both fields toThreadLocal, isolating the state to the single thread that runs a givendoCodeGenpass.Why are the changes needed?
A single
UnionExecinstance can have its whole-stage codegen driven by more than one thread at the same time: a reused exchange/subquery stage is generated concurrently with the main plan, and async subquery / dynamic-partition-pruning execution can overlap a driver-sidedoCodeGen. With the shared mutable field, a racingdoProduceresetscurrentEmittingChildto-1while another thread is still insidedoConsume, tripping:This surfaced as a flaky
LogicalPlanTagInSparkPlanSuite.q2failure (q2 contains aUNION, and union fusion is enabled by default). EachdoCodeGenpass is itself single-threaded (produce->doConsumerun inline on one thread), so aThreadLocalisolates the state per pass without the cross-thread race, while preserving the existing per-stage semantics (the metric term is still computed once per pass).Does this PR introduce any user-facing change?
No. It removes an intermittent internal code-generation failure; the generated code and query results are unchanged.
How was this patch tested?
Added a
UnionCodegenSuitetest, "SPARK-57196: concurrent codegen of a shared UnionExec stage is thread-safe", that drivesdoCodeGen()on one shared fusedUnionExecstage from 8 threads. It reproduces the "outside doProduce emission window" failure on the unpatched code and passes with this fix. Also verified the fullUnionCodegenSuite(43 tests), its ANSI/AQE variants, andLogicalPlanTagInSparkPlanSuiteq2 all pass.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)