[SPARK-56203][SQL] Fix race condition in `SortExec.rowSorter` by peter-toth · Pull Request #55006 · apache/spark

peter-toth · 2026-03-25T15:41:31Z

What changes were proposed in this pull request?

Replace private[sql] var rowSorter: ThreadLocal[UnsafeExternalRowSorter] with @transient private[sql] lazy val rowSorter: ThreadLocal[UnsafeExternalRowSorter] in SortExec.

Remove the rowSorter = new ThreadLocal() reassignment that was inside createSorter().

Why are the changes needed?

SortExec is a shared plan object: the same instance is used by all tasks that execute different partitions of the same stage. In the original code, createSorter() would write rowSorter = new ThreadLocal() — an unsynchronised write to a shared var. If two tasks (threads T0 and T1) called createSorter() concurrently:

T0 writes rowSorter = ThreadLocal_0, sets ThreadLocal_0.set(sorter_0)
T1 writes rowSorter = ThreadLocal_1, sets ThreadLocal_1.set(sorter_1)
T0's cleanupResources() reads rowSorter — now points to ThreadLocal_1 — calls ThreadLocal_1.get() on thread T0 → null → sorter_0 is leaked

With a stable lazy val, the ThreadLocal object is created once (Scala lazy-val initialisation is thread-safe). Every call to createSorter() just calls rowSorter.set(newSorter) on the same object. Because ThreadLocal gives each thread an independent slot, T0's slot and T1's slot are completely separate; neither task can observe or clobber the other's sorter reference.

@transient is required because ThreadLocal is not Serializable; after deserialisation the lazy val re-initialises to a fresh ThreadLocal on first access, which is the correct behaviour on an executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing SortSuite tests cover correct sort output and spill behaviour.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

### What changes were proposed in this pull request? Replace `private[sql] var rowSorter: ThreadLocal[UnsafeExternalRowSorter]` with `@transient private[sql] lazy val rowSorter: ThreadLocal[UnsafeExternalRowSorter]` in `SortExec`. Remove the `rowSorter = new ThreadLocal()` reassignment that was inside `createSorter()`. ### Why are the changes needed? `SortExec` is a shared plan object: the same instance is used by all tasks that execute different partitions of the same stage. In the original code, `createSorter()` would write `rowSorter = new ThreadLocal()` — an unsynchronised write to a shared `var`. If two tasks (threads T0 and T1) called `createSorter()` concurrently: 1. T0 writes `rowSorter = ThreadLocal_0`, sets `ThreadLocal_0.set(sorter_0)` 2. T1 writes `rowSorter = ThreadLocal_1`, sets `ThreadLocal_1.set(sorter_1)` 3. T0's `cleanupResources()` reads `rowSorter` — now points to `ThreadLocal_1` — calls `ThreadLocal_1.get()` on thread T0 → `null` → `sorter_0` is leaked With a stable `lazy val`, the `ThreadLocal` object is created once (Scala lazy-val initialisation is thread-safe). Every call to `createSorter()` just calls `rowSorter.set(newSorter)` on the same object. Because `ThreadLocal` gives each thread an independent slot, T0's slot and T1's slot are completely separate; neither task can observe or clobber the other's sorter reference. `@transient` is required because `ThreadLocal` is not `Serializable`; after deserialisation the lazy val re-initialises to a fresh `ThreadLocal` on first access, which is the correct behaviour on an executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added `SortSuite` test "cleanupResources is safe when createSorter was never called" which verifies that `cleanupResources()` is a no-op when called before any sorter is created (the empty-partition case that previously required a `rowSorter \!= null` guard on the now-removed reassignable `var`). Existing `SortSuite` tests cover correct sort output and spill behaviour. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6

dongjoon-hyun

+1, LGTM.

I guess this was the root cause of SortSuite flakiness, @peter-toth ?

peter-toth · 2026-03-25T18:44:56Z

I guess this was the root cause of SortSuite flakiness, @peter-toth ?

Could be, but I found the bug in a different way.

dongjoon-hyun · 2026-03-25T19:31:46Z

All tests passed. Merged to master for Apache Spark 4.2.0.

peter-toth · 2026-03-25T19:35:40Z

Thank you for the prompt review @dongjoon-hyun.

dongjoon-hyun · 2026-03-25T19:36:56Z

Thank you. Feel free to backport this if you want to use this in the live release branches.

dongjoon-hyun approved these changes Mar 25, 2026

View reviewed changes

dongjoon-hyun closed this in 1e52a02 Mar 25, 2026

peter-toth mentioned this pull request Mar 27, 2026

[SPARK-56250][SQL] Remove confusing defensive code in SortExec.rowSorter and add warning comment #55048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56203][SQL] Fix race condition in `SortExec.rowSorter`#55006

[SPARK-56203][SQL] Fix race condition in `SortExec.rowSorter`#55006
peter-toth wants to merge 1 commit intoapache:masterfrom
peter-toth:SPARK-56203-fix-sortexec-rowsorter

peter-toth commented Mar 25, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

peter-toth commented Mar 25, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

peter-toth commented Mar 25, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter-toth commented Mar 25, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 25, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

peter-toth commented Mar 25, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants