[SPARK-54055][CONNECT][PYTHON] Clean up per-session PythonWorkerFactory by kumbham · Pull Request #55131 · apache/spark

kumbham · 2026-04-01T06:51:11Z

What changes were proposed in this pull request?

Each Spark Connect session creates its own PythonWorkerFactory keyed by SPARK_JOB_ARTIFACT_UUID. These factories (and their daemon processes) were never cleaned up until SparkContext shutdown, causing unbounded process and thread leaks on long-running servers.

This change adds two cleanup mechanisms:

Eager cleanup (driver-side): SessionHolder.close() now calls SparkSession.cleanupPythonWorkers(), which finds and stops all PythonWorkerFactory instances matching the session's artifact UUID in SparkEnv's cache. This follows the same pattern as the existing cleanupPythonWorkerLogs() in the same lifecycle hook.
Idle-timeout eviction (executor-side safety net): A ScheduledExecutorService in SparkEnv periodically scans for PythonWorkerFactory instances with a non-default SPARK_JOB_ARTIFACT_UUID that have no active/idle workers and have been idle for >5 minutes, and evicts them. This handles executor-side cleanup where session close notifications from the driver cannot reach. The scheduler follows the same pattern used by ContextCleaner, Heartbeater, and other Spark core components.

Factories with a default artifact UUID (i.e., non-Connect workloads) are never evicted by the idle-timeout mechanism.

Closes SPARK-54055

Why are the changes needed?

With Spark Connect, each session always has a unique SPARK_JOB_ARTIFACT_UUID, even if there are no artifacts. This makes the UDF environment built by BasePythonRunner.compute unique per session, so each session gets its own PythonWorkerFactory and daemon process. PythonWorkerFactory has a stop method, but no one called it except SparkEnv.stop (which only runs at full shutdown). On a long-running Spark Connect server, this causes unbounded accumulation of daemon processes, MonitorThreads, and stderr/stdout reader threads — eventually leading to OOM.

Reproduction (from JIRA reporter):

with:
```python
from pyspark.sql import SparkSession

def _udf(iterator):
    yield from iterator

spark = SparkSession.builder.remote("sc://...").getOrCreate()
df = spark.range(128)
df.mapInPandas(_udf, df.schema).count()

After 200 sessions, 200+ daemon processes and ~1000 threads are leaked.

Does this PR introduce any user-facing change?

No. This is a resource leak fix. Python UDF behavior is unchanged.

How was this patch tested?

Added 4 new unit tests in PythonWorkerFactoryIdleSuite:

isIdleFactory returns false for default artifact UUID — factories without a session UUID are never evicted
isIdleFactory returns false for session factory with recent activity — active factories are not evicted
isIdleFactory returns true for session factory past timeout — idle session factories are correctly identified
destroyPythonWorkersByArtifactUUID removes only matching factories — validates selective cleanup by UUID
Also verified no regressions in existing test suites:

PythonWorkerFactorySuite (3/3 passed)
SparkConnectSessionHolderSuite (18/18 passed)
SparkConnectSessionManagerSuite (10/10 passed)
All api.python core tests (14/14 passed)

Was this patch authored or co-authored using generative AI tooling?

Cursor (Claude claude-4.6-opus-high-thinking)

kumbham · 2026-04-01T19:54:51Z

all of the failed tests are failing with this exception:
org.scalatest.exceptions.TestFailedDueToTimeoutExceptionThese tests deploy Spark pods onto a Kubernetes cluster. They time out because my fork's GitHub Actions runner has no K8s cluster configured.

https://github.com/kumbham/spark/actions/runs/23860864341/job/69567033253#step:11:8167

None of the failing tests are related to the changes in this PR (PythonWorkerFactory,
SparkEnv, SparkSession, SessionHolder).

…ory on session close Each Spark Connect session creates its own PythonWorkerFactory keyed by SPARK_JOB_ARTIFACT_UUID. These factories (and their daemon processes) were never cleaned up until SparkContext shutdown, causing unbounded process and thread leaks on long-running servers. This change adds two cleanup mechanisms: 1. Eager cleanup: SessionHolder.close() now calls SparkSession.cleanupPythonWorkers() which removes and stops all PythonWorkerFactory instances matching the session's artifact UUID from SparkEnv's cache. 2. Idle-timeout eviction: A background reaper thread in SparkEnv periodically scans for PythonWorkerFactory instances with a non-default artifact UUID that have been idle for >5 minutes, and evicts them. This handles executor-side cleanup where session close notifications are not received. Closes #XXXXX Made-with: Cursor

holdenk

Thanks for working on this! I've got some questions, it's been a hot minute since I thought about how we spawn Python workers.

holdenk · 2026-04-13T02:10:55Z

+  private val idleFactoryReaper =
+    ThreadUtils.newDaemonSingleThreadScheduledExecutor("idle-python-factory-reaper")
+  idleFactoryReaper.scheduleAtFixedRate(
+    () => evictIdlePythonWorkerFactories(),
+    PythonWorkerFactory.IDLE_FACTORY_CHECK_INTERVAL_MS,
+    PythonWorkerFactory.IDLE_FACTORY_CHECK_INTERVAL_MS,
+    TimeUnit.MILLISECONDS)


Would it make sense to only launch this if we have a Python job present? Or is the overhead low enough/complexity high enough of doing that it doesn't matter. (I can probably be convinced either way just looking for the thinking here)

holdenk · 2026-04-13T02:14:08Z

@@ -120,6 +120,14 @@ class SparkEnv (
      pythonExec: String, workerModule: String, daemonModule: String, envVars: Map[String, String])
  private val pythonWorkers = mutable.HashMap[PythonWorkersKey, PythonWorkerFactory]()


So are we just depending on the env var flow through here to make the cache work? I know this isn't your OG code but it feels funky as is.

holdenk · 2026-04-13T02:18:47Z

        // Create and start the daemon
        val command = Arrays.asList(pythonExec, "-m", daemonModule, workerModule)
        val pb = new ProcessBuilder(command)
        val jobArtifactUUID = envVars.getOrElse("SPARK_JOB_ARTIFACT_UUID", "default")


This seems like a redef of L124

kumbham changed the title ~~[SPARK-54055][CONNECT][PYSPARK] Clean up per-session PythonWorkerFact…~~ [SPARK-54055][CONNECT][PYSPARK] Clean up per-session PythonWorkerFactory Apr 1, 2026

kumbham force-pushed the skumbham/SPARK-54055-fix-python-worker-leak branch from 16b9db1 to 8d4843c Compare April 1, 2026 17:05

kumbham force-pushed the skumbham/SPARK-54055-fix-python-worker-leak branch from 8d4843c to 8ce7463 Compare April 1, 2026 23:59

HyukjinKwon changed the title ~~[SPARK-54055][CONNECT][PYSPARK] Clean up per-session PythonWorkerFactory~~ [SPARK-54055][CONNECT][PYTHON] Clean up per-session PythonWorkerFactory Apr 5, 2026

holdenk reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54055][CONNECT][PYTHON] Clean up per-session PythonWorkerFactory#55131

[SPARK-54055][CONNECT][PYTHON] Clean up per-session PythonWorkerFactory#55131
kumbham wants to merge 1 commit intoapache:masterfrom
kumbham:skumbham/SPARK-54055-fix-python-worker-leak

kumbham commented Apr 1, 2026

Uh oh!

kumbham commented Apr 1, 2026

Uh oh!

holdenk left a comment

Uh oh!

holdenk Apr 13, 2026

Uh oh!

holdenk Apr 13, 2026

Uh oh!

holdenk Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -120,6 +120,14 @@ class SparkEnv (
		pythonExec: String, workerModule: String, daemonModule: String, envVars: Map[String, String])
		private val pythonWorkers = mutable.HashMap[PythonWorkersKey, PythonWorkerFactory]()

Conversation

kumbham commented Apr 1, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

kumbham commented Apr 1, 2026

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants