Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 24, 2026

What changes were proposed in this pull request?

This PR aims to fix BasicExecutorFeatureStep to throw IllegalArgumentException for executor cpu misconfigs in order the Spark jobs ASAP.

Why are the changes needed?

From Apache Spark 4.1.0, Spark driver pod throws SparkException for the executor cpu misconfiguration before requesting to the K8s control plain. This improvement reduces the burden of K8s control plane.

26/01/24 06:55:31 INFO ExecutorPodsAllocator: Going to request 5 executors from Kubernetes for ResourceProfile Id: 0, target: 5, known: 0, sharedSlotFromPendingPods: 2147483647.
26/01/24 06:55:31 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs
26/01/24 06:55:31 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
org.apache.spark.SparkException: The executor cpu request (4) should be less than or equal to cpu limit (1)
	at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$configurePod$11(BasicExecutorFeatureStep.scala:236)

However, the Spark driver keeps re-trying to create executor pods in any way if the users didn't have an additional spark.driver.timeout configuration.

So, we had better exit the Spark job in this case ASAP. We can do that simply switches SparkException to IllegalArgumentException like the other steps.

Does this PR introduce any user-facing change?

Technically no because previously those misconfigured Spark job didn't get any resources.

How was this patch tested?

Pass the CIs with the updated test case.

Also, I checked manually via spark-submit:

$ bin/spark-submit --master k8s://$K8S_MASTER \
--deploy-mode cluster \
-c spark.executor.instances=5 \
-c spark.kubernetes.executor.request.cores=4 \
-c spark.kubernetes.executor.limit.cores=1 \
-c spark.kubernetes.container.image=apache/spark:SPARK-55134 \
-c spark.kubernetes.authenticate.driver.serviceAccountName=spark \
-c spark.kubernetes.executor.useDriverPodIP=true \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples.jar 200000
...
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: org-apache-spark-examples-sparkpi-0482f19beeec7491-driver
	 namespace: default
	 labels: spark-app-name -> org-apache-spark-examples-sparkpi, spark-app-selector -> spark-ee23f03db88b43fb906b0dbc1b04ad63, spark-role -> driver, spark-version -> 4.2.0-SNAPSHOT
	 pod uid: c6d41845-5893-4135-a065-278d94500315
	 creation time: 2026-01-24T07:33:52Z
	 service account name: spark
	 volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-8rbc8
	 node name: lima-rancher-desktop
	 start time: 2026-01-24T07:33:52Z
	 phase: Failed
	 container status:
		 container name: spark-kubernetes-driver
		 container image: apache/spark:SPARK-55134
		 container state: terminated
		 container started at: 2026-01-24T07:33:53Z
		 container finished at: 2026-01-24T07:33:55Z
		 exit code: 1
		 termination reason: Error
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-ee23f03db88b43fb906b0dbc1b04ad63 (phase: Failed)
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 container name: spark-kubernetes-driver
	 container image: apache/spark:SPARK-55134
	 container state: terminated
	 container started at: 2026-01-24T07:33:53Z
	 container finished at: 2026-01-24T07:33:55Z
	 exit code: 1
	 termination reason: Error
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with application ID spark-ee23f03db88b43fb906b0dbc1b04ad63 and submission ID default:org-apache-spark-examples-sparkpi-0482f19beeec7491-driver finished

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Jan 24, 2026

JIRA Issue Information

=== Bug SPARK-55134 ===
Summary: If the executor’s resource request is greater than its limit, the driver is expected to exit.
Assignee: None
Status: Open
Affected: ["4.1.0","4.1.1"]


This comment was automatically generated by GitHub Actions

@dongjoon-hyun
Copy link
Member Author

Could you review this PR when you have some time, @peter-toth ?

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The change looks good to me.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @peter-toth !

Merged to master/4.1.

dongjoon-hyun added a commit that referenced this pull request Jan 24, 2026
…tException` for executor cpu misconfigs

### What changes were proposed in this pull request?

This PR aims to fix `BasicExecutorFeatureStep` to throw `IllegalArgumentException` for executor cpu misconfigs in order the Spark jobs ASAP.

### Why are the changes needed?

From Apache Spark 4.1.0, Spark driver pod throws `SparkException` for the executor cpu misconfiguration before requesting to the K8s control plain. This improvement reduces the burden of K8s control plane.
- #51678

```
26/01/24 06:55:31 INFO ExecutorPodsAllocator: Going to request 5 executors from Kubernetes for ResourceProfile Id: 0, target: 5, known: 0, sharedSlotFromPendingPods: 2147483647.
26/01/24 06:55:31 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs
26/01/24 06:55:31 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
org.apache.spark.SparkException: The executor cpu request (4) should be less than or equal to cpu limit (1)
	at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$configurePod$11(BasicExecutorFeatureStep.scala:236)
```

However, the Spark driver keeps re-trying to create executor pods in any way if the users didn't have an additional `spark.driver.timeout` configuration.
- #45313

So, we had better exit the Spark job in this case ASAP. We can do that simply switches `SparkException` to `IllegalArgumentException` like the other steps.

- #30084

### Does this PR introduce _any_ user-facing change?

Technically no because previously those misconfigured Spark job didn't get any resources.

### How was this patch tested?

Pass the CIs with the updated test case.

Also, I checked manually via `spark-submit`:
```
$ bin/spark-submit --master k8s://$K8S_MASTER \
--deploy-mode cluster \
-c spark.executor.instances=5 \
-c spark.kubernetes.executor.request.cores=4 \
-c spark.kubernetes.executor.limit.cores=1 \
-c spark.kubernetes.container.image=apache/spark:SPARK-55134 \
-c spark.kubernetes.authenticate.driver.serviceAccountName=spark \
-c spark.kubernetes.executor.useDriverPodIP=true \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples.jar 200000
...
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: org-apache-spark-examples-sparkpi-0482f19beeec7491-driver
	 namespace: default
	 labels: spark-app-name -> org-apache-spark-examples-sparkpi, spark-app-selector -> spark-ee23f03db88b43fb906b0dbc1b04ad63, spark-role -> driver, spark-version -> 4.2.0-SNAPSHOT
	 pod uid: c6d41845-5893-4135-a065-278d94500315
	 creation time: 2026-01-24T07:33:52Z
	 service account name: spark
	 volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-8rbc8
	 node name: lima-rancher-desktop
	 start time: 2026-01-24T07:33:52Z
	 phase: Failed
	 container status:
		 container name: spark-kubernetes-driver
		 container image: apache/spark:SPARK-55134
		 container state: terminated
		 container started at: 2026-01-24T07:33:53Z
		 container finished at: 2026-01-24T07:33:55Z
		 exit code: 1
		 termination reason: Error
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-ee23f03db88b43fb906b0dbc1b04ad63 (phase: Failed)
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	 container name: spark-kubernetes-driver
	 container image: apache/spark:SPARK-55134
	 container state: terminated
	 container started at: 2026-01-24T07:33:53Z
	 container finished at: 2026-01-24T07:33:55Z
	 exit code: 1
	 termination reason: Error
26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with application ID spark-ee23f03db88b43fb906b0dbc1b04ad63 and submission ID default:org-apache-spark-examples-sparkpi-0482f19beeec7491-driver finished
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53948 from dongjoon-hyun/SPARK-55134.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ab3ec9e)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-55134 branch January 24, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants