-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55134] Fix BasicExecutorFeatureStep to throw IllegalArgumentException for executor cpu misconfigs
#53948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Exception for executor cpu misconfigs
JIRA Issue Information=== Bug SPARK-55134 === This comment was automatically generated by GitHub Actions |
|
Could you review this PR when you have some time, @peter-toth ? |
peter-toth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. The change looks good to me.
|
Thank you so much, @peter-toth ! Merged to master/4.1. |
…tException` for executor cpu misconfigs ### What changes were proposed in this pull request? This PR aims to fix `BasicExecutorFeatureStep` to throw `IllegalArgumentException` for executor cpu misconfigs in order the Spark jobs ASAP. ### Why are the changes needed? From Apache Spark 4.1.0, Spark driver pod throws `SparkException` for the executor cpu misconfiguration before requesting to the K8s control plain. This improvement reduces the burden of K8s control plane. - #51678 ``` 26/01/24 06:55:31 INFO ExecutorPodsAllocator: Going to request 5 executors from Kubernetes for ResourceProfile Id: 0, target: 5, known: 0, sharedSlotFromPendingPods: 2147483647. 26/01/24 06:55:31 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs 26/01/24 06:55:31 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber. org.apache.spark.SparkException: The executor cpu request (4) should be less than or equal to cpu limit (1) at org.apache.spark.deploy.k8s.features.BasicExecutorFeatureStep.$anonfun$configurePod$11(BasicExecutorFeatureStep.scala:236) ``` However, the Spark driver keeps re-trying to create executor pods in any way if the users didn't have an additional `spark.driver.timeout` configuration. - #45313 So, we had better exit the Spark job in this case ASAP. We can do that simply switches `SparkException` to `IllegalArgumentException` like the other steps. - #30084 ### Does this PR introduce _any_ user-facing change? Technically no because previously those misconfigured Spark job didn't get any resources. ### How was this patch tested? Pass the CIs with the updated test case. Also, I checked manually via `spark-submit`: ``` $ bin/spark-submit --master k8s://$K8S_MASTER \ --deploy-mode cluster \ -c spark.executor.instances=5 \ -c spark.kubernetes.executor.request.cores=4 \ -c spark.kubernetes.executor.limit.cores=1 \ -c spark.kubernetes.container.image=apache/spark:SPARK-55134 \ -c spark.kubernetes.authenticate.driver.serviceAccountName=spark \ -c spark.kubernetes.executor.useDriverPodIP=true \ --class org.apache.spark.examples.SparkPi \ local:///opt/spark/examples/jars/spark-examples.jar 200000 ... 26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: org-apache-spark-examples-sparkpi-0482f19beeec7491-driver namespace: default labels: spark-app-name -> org-apache-spark-examples-sparkpi, spark-app-selector -> spark-ee23f03db88b43fb906b0dbc1b04ad63, spark-role -> driver, spark-version -> 4.2.0-SNAPSHOT pod uid: c6d41845-5893-4135-a065-278d94500315 creation time: 2026-01-24T07:33:52Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-8rbc8 node name: lima-rancher-desktop start time: 2026-01-24T07:33:52Z phase: Failed container status: container name: spark-kubernetes-driver container image: apache/spark:SPARK-55134 container state: terminated container started at: 2026-01-24T07:33:53Z container finished at: 2026-01-24T07:33:55Z exit code: 1 termination reason: Error 26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-ee23f03db88b43fb906b0dbc1b04ad63 (phase: Failed) 26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Container final statuses: container name: spark-kubernetes-driver container image: apache/spark:SPARK-55134 container state: terminated container started at: 2026-01-24T07:33:53Z container finished at: 2026-01-24T07:33:55Z exit code: 1 termination reason: Error 26/01/24 16:33:57 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with application ID spark-ee23f03db88b43fb906b0dbc1b04ad63 and submission ID default:org-apache-spark-examples-sparkpi-0482f19beeec7491-driver finished ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53948 from dongjoon-hyun/SPARK-55134. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ab3ec9e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR aims to fix
BasicExecutorFeatureStepto throwIllegalArgumentExceptionfor executor cpu misconfigs in order the Spark jobs ASAP.Why are the changes needed?
From Apache Spark 4.1.0, Spark driver pod throws
SparkExceptionfor the executor cpu misconfiguration before requesting to the K8s control plain. This improvement reduces the burden of K8s control plane.However, the Spark driver keeps re-trying to create executor pods in any way if the users didn't have an additional
spark.driver.timeoutconfiguration.spark.driver.timeoutandDriverTimeoutPlugin#45313So, we had better exit the Spark job in this case ASAP. We can do that simply switches
SparkExceptiontoIllegalArgumentExceptionlike the other steps.Does this PR introduce any user-facing change?
Technically no because previously those misconfigured Spark job didn't get any resources.
How was this patch tested?
Pass the CIs with the updated test case.
Also, I checked manually via
spark-submit:Was this patch authored or co-authored using generative AI tooling?
No.