[SPARK-56238][K8S] Fix app ID propagation in KubernetesClusterSchedulerBackend for `client` mode submission by xiaoxuandev · Pull Request #55355 · apache/spark

xiaoxuandev · 2026-04-15T19:15:46Z

What changes were proposed in this pull request?

Cache the application ID at construction time in KubernetesClusterSchedulerBackend so that applicationId() returns a stable value across calls.

Previously, applicationId() fell back to KubernetesConf.getKubernetesAppId() when spark.app.id was not yet set, which generates a new random UUID on every call. In client mode, SparkContext sets spark.app.id only after start() returns, so during start() the multiple calls to applicationId() (for podAllocator.start(), watchEvents.start(), pollEvents.start(), and setUpExecutorConfigMap()) each received a different ID. This caused subsystems to use inconsistent app IDs for pod labeling and filtering.

This only affects client mode. In cluster mode, the submission client generates the app ID upfront and writes it into spark.app.id via BasicDriverFeatureStep before the driver pod starts, so conf.getOption("spark.app.id") always returns a value and the getOrElse branch is never reached.

The fix adds a private val appId that resolves the ID once at construction time and returns it consistently, matching the pattern used by SchedulerBackend, LocalSchedulerBackend, and other backends.

Why are the changes needed?

Without this fix, the Kubernetes scheduler backend could propagate different app IDs to different subsystems during start(), leading to:

Pod allocator, watch events, and poll events using different app IDs
stop() unable to clean up resources created by start() (services, PVCs, config maps, executor pods) because the label selector uses a different ID

Does this PR introduce any user-facing change?

No direct user-facing change. This fixes an internal consistency issue that could cause resource leaks in Kubernetes deployments.

How was this patch tested?

Unit tests in KubernetesClusterSchedulerBackendSuite verifying applicationId() stability both when spark.app.id is set and when it is not set.
Existing tests for DeploymentAllocatorSuite, ExecutorPodsLifecycleManagerSuite, and StatefulSetAllocatorSuite continue to pass.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with Kiro.

…erBackend ### What changes were proposed in this pull request? Cache the application ID at construction time in `KubernetesClusterSchedulerBackend` so that `applicationId()` returns a stable value across calls. Previously, `applicationId()` fell back to `KubernetesConf.getKubernetesAppId()` when `spark.app.id` was not yet set, which generates a new random UUID on every call. In client mode, `SparkContext` sets `spark.app.id` only after `start()` returns, so during `start()` the multiple calls to `applicationId()` (for `podAllocator.start()`, `watchEvents.start()`, `pollEvents.start()`, and `setUpExecutorConfigMap()`) each received a different ID. This caused subsystems to use inconsistent app IDs for pod labeling and filtering. This only affects client mode. In cluster mode, the submission client generates the app ID upfront and writes it into `spark.app.id` via `BasicDriverFeatureStep` before the driver pod starts, so `conf.getOption("spark.app.id")` always returns a value and the `getOrElse` branch is never reached. The fix adds a `private val appId` that resolves the ID once at construction time and returns it consistently, matching the pattern used by `SchedulerBackend`, `LocalSchedulerBackend`, and other backends. ### Why are the changes needed? Without this fix, the Kubernetes scheduler backend could propagate different app IDs to different subsystems during `start()` in client mode, leading to: - Pod allocator, watch events, and poll events using different app IDs - `stop()` unable to clean up resources created by `start()` (services, PVCs, config maps, executor pods) because the label selector uses a different ID ### Does this PR introduce _any_ user-facing change? No direct user-facing change. This fixes an internal consistency issue that could cause resource leaks in Kubernetes client-mode deployments. ### How was this patch tested? - Unit tests in `KubernetesClusterSchedulerBackendSuite` verifying `applicationId()` stability both when `spark.app.id` is set and when it is not set. - Existing tests for `DeploymentAllocatorSuite`, `ExecutorPodsLifecycleManagerSuite`, and `StatefulSetAllocatorSuite` continue to pass. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored with Kiro.

dongjoon-hyun

cc @EnricoMi

dongjoon-hyun · 2026-04-15T19:33:33Z

+
+  test("SPARK-56238: applicationId() returns consistent value when spark.app.id is set") {
+    val id1 = schedulerBackendUnderTest.applicationId()
+    val id2 = schedulerBackendUnderTest.applicationId()


This test case doesn't make sense to me in this PR's context. Please remove this test case because this passes without your PR, @xiaoxuandev .

Removed, thanks!

dongjoon-hyun · 2026-04-15T19:36:32Z

+    assert(id1 === TEST_SPARK_APP_ID)
+  }
+
+  test("SPARK-56238: applicationId() is stable across calls when spark.app.id is not set") {


This test case seems to reproduce the reported scenario.

dongjoon-hyun · 2026-04-15T19:37:33Z

+    when(localRpcEnv.setupEndpoint(any(), any())).thenReturn(driverEndpointRef)
+    val localTaskScheduler = mock(classOf[TaskSchedulerImpl])
+    when(localTaskScheduler.sc).thenReturn(localSc)
+    val backendWithoutAppId = new KubernetesClusterSchedulerBackend(


Do you happen to know when this situation happens in the production environment, @xiaoxuandev ? I'm wondering if this is a valid case in the Apache Spark usage.

This only affects client mode.

One more question. Do you know if this is a regression or not? (as Enrico claims)

Yes, this is a regression introduced by #54269. The original code cached the generated ID in private val appId, which was correct.

Affected versions: v4.2.0-preview3+. All 4.0.x and 4.1.x releases are clean.

Regarding production usage: this affects Kubernetes client mode (--deploy-mode client), where the driver runs outside the K8s cluster. In that path, spark.app.id is not pre-set before backend.start(), so each call to applicationId() during start() would generate a different UUID.

Got it. Thank you for confirming, @xiaoxuandev .

dongjoon-hyun

+1, LGTM. Thank you, @xiaoxuandev .

EnricoMi · 2026-04-16T06:11:35Z

Thanks for fixing this!

dongjoon-hyun reviewed Apr 15, 2026

View reviewed changes

remove unrelated test

e2d5ad8

dongjoon-hyun approved these changes Apr 15, 2026

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-56238][K8S] Fix app ID propagation in KubernetesClusterSchedulerBackend~~ [SPARK-56238][K8S] Fix app ID propagation in KubernetesClusterSchedulerBackend for client mode submission Apr 15, 2026

dongjoon-hyun closed this in f6c8a95 Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56238][K8S] Fix app ID propagation in KubernetesClusterSchedulerBackend for `client` mode submission#55355

[SPARK-56238][K8S] Fix app ID propagation in KubernetesClusterSchedulerBackend for `client` mode submission#55355
xiaoxuandev wants to merge 2 commits into
apache:masterfrom
xiaoxuandev:fix-56238

xiaoxuandev commented Apr 15, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Apr 15, 2026 •

edited

Loading

Uh oh!

xiaoxuandev Apr 15, 2026

Uh oh!

dongjoon-hyun Apr 15, 2026

Uh oh!

dongjoon-hyun Apr 15, 2026 •

edited

Loading

Uh oh!

xiaoxuandev Apr 15, 2026

Uh oh!

dongjoon-hyun Apr 15, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

EnricoMi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiaoxuandev commented Apr 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

EnricoMi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun Apr 15, 2026 •

edited

Loading

dongjoon-hyun Apr 15, 2026 •

edited

Loading