I use ds to schedule spark tasks (using spark-on-k8s mode). When the pod runs for too long, the task instance page of ds will show the following error in the specific task log:
191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:54.320 +0800 - ->
25/09/12 14:20:53 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:55.320 +0800 - ->
25/09/12 14:20:54 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:56.321 +0800 - ->
25/09/12 14:20:55 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:57.322 +0800 - ->
25/09/12 14:20:56 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:58.323 +0800 - ->
25/09/12 14:20:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:59.324 +0800 - ->
25/09/12 14:20:58 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
namespace: ds
labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
creation time: 2025-09-12T06:10:40Z
service account name: spark-driver-sa
volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
node name: kubesphere-node-7
start time: 2025-09-12T06:10:40Z
phase: Running
container status:
container name: spark-kubernetes-driver
container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
container state: terminated
container started at: 2025-09-12T06:10:42Z
container finished at: 2025-09-12T06:20:58Z
exit code: 0
termination reason: Completed
[INFO] 2025-09-12 14:21:00.325 +0800 - ->
25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:21:01.325 +0800 - ->
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
namespace: ds
labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
creation time: 2025-09-12T06:10:40Z
service account name: spark-driver-sa
volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
node name: kubesphere-node-7
start time: 2025-09-12T06:10:40Z
phase: Running
container status:
container name: spark-kubernetes-driver
container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
container state: terminated
container started at: 2025-09-12T06:10:42Z
container finished at: 2025-09-12T06:20:58Z
exit code: 0
termination reason: Completed
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
namespace: ds
labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
creation time: 2025-09-12T06:10:40Z
service account name: spark-driver-sa
volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
node name: kubesphere-node-7
start time: 2025-09-12T06:10:40Z
phase: Succeeded
container status:
container name: spark-kubernetes-driver
container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
container state: terminated
container started at: 2025-09-12T06:10:42Z
container finished at: 2025-09-12T06:20:58Z
exit code: 0
termination reason: Completed
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Succeeded)
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Container final statuses:
container name: spark-kubernetes-driver
container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
container state: terminated
container started at: 2025-09-12T06:10:42Z
container finished at: 2025-09-12T06:20:58Z
exit code: 0
termination reason: Completed
25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application com.analysis.job.UserCleanupMergeBatchJob with application ID spark-191892f6b9e44020aea1005406d009b2 and submission ID ds:com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver finished
25/09/12 14:21:00 INFO ShutdownHookManager: Shutdown hook called
25/09/12 14:21:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c12b6c0-f57a-4bf6-9321-336f6045ba05
[ERROR] 2025-09-12 14:21:01.327 +0800 - Handle pod log error
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist.
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:182)
at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.handle(AbstractYarnTask.java:53)
at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerTaskExecutor.executeTask(DefaultWorkerTaskExecutor.java:51)
at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecutor.run(WorkerTaskExecutor.java:172)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist.
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:254)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
... 3 common frames omitted
Caused by: java.lang.RuntimeException: The driver pod does not exist.
at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:244)
... 7 common frames omitted
[INFO] 2025-09-12 14:21:01.328 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946, processId:49917 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0
[INFO] 2025-09-12 14:21:01.328 +0800 - Start finding appId in /opt/dolphinscheduler/logs/20250912/151590881017889/8/103282/91946.log, fetch way: log
[INFO] 2025-09-12 14:21:01.330 +0800 -
***********************************************************************************************
[INFO] 2025-09-12 14:21:01.330 +0800 - ********************************* Finalize task instance ************************************
[INFO] 2025-09-12 14:21:01.330 +0800 - ***********************************************************************************************
[INFO] 2025-09-12 14:21:01.331 +0800 - Upload output files: [] successfully
[INFO] 2025-09-12 14:21:01.333 +0800 - Send task execute status: SUCCESS to master : dolphinscheduler-worker-1.dolphinscheduler-worker-headless:1234
[INFO] 2025-09-12 14:21:01.333 +0800 - Remove the current task execute context from worker cache
[INFO] 2025-09-12 14:21:01.334 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
[INFO] 2025-09-12 14:21:01.347 +0800 - Success clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
[INFO] 2025-09-12 14:21:01.347 +0800 - FINALIZE_SESSION
When using ds to schedule Spark tasks (spark-on-k8s mode), the task execution time is greater than 10 minutes.
Search before asking
What happened
I use ds to schedule spark tasks (using spark-on-k8s mode). When the pod runs for too long, the task instance page of ds will show the following error in the specific task log:
What you expected to happen
ds task instance page Each task instance can pull logs normally
How to reproduce
When using ds to schedule Spark tasks (spark-on-k8s mode), the task execution time is greater than 10 minutes.
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
Code of Conduct