Skip to content

[Bug] [log] ds log pull error #17521

@silent-night-no-trace

Description

@silent-night-no-trace

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

I use ds to schedule spark tasks (using spark-on-k8s mode). When the pod runs for too long, the task instance page of ds will show the following error in the specific task log:

191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:54.320 +0800 -  -> 
	25/09/12 14:20:53 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:55.320 +0800 -  -> 
	25/09/12 14:20:54 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:56.321 +0800 -  -> 
	25/09/12 14:20:55 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:57.322 +0800 -  -> 
	25/09/12 14:20:56 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:58.323 +0800 -  -> 
	25/09/12 14:20:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:20:59.324 +0800 -  -> 
	25/09/12 14:20:58 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
	25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
		 pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
		 namespace: ds
		 labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
		 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
		 creation time: 2025-09-12T06:10:40Z
		 service account name: spark-driver-sa
		 volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
		 node name: kubesphere-node-7
		 start time: 2025-09-12T06:10:40Z
		 phase: Running
		 container status: 
			 container name: spark-kubernetes-driver
			 container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
			 container state: terminated
			 container started at: 2025-09-12T06:10:42Z
			 container finished at: 2025-09-12T06:20:58Z
			 exit code: 0
			 termination reason: Completed
[INFO] 2025-09-12 14:21:00.325 +0800 -  -> 
	25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
[INFO] 2025-09-12 14:21:01.325 +0800 -  -> 
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
		 pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
		 namespace: ds
		 labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
		 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
		 creation time: 2025-09-12T06:10:40Z
		 service account name: spark-driver-sa
		 volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
		 node name: kubesphere-node-7
		 start time: 2025-09-12T06:10:40Z
		 phase: Running
		 container status: 
			 container name: spark-kubernetes-driver
			 container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
			 container state: terminated
			 container started at: 2025-09-12T06:10:42Z
			 container finished at: 2025-09-12T06:20:58Z
			 exit code: 0
			 termination reason: Completed
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
		 pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
		 namespace: ds
		 labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5
		 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
		 creation time: 2025-09-12T06:10:40Z
		 service account name: spark-driver-sa
		 volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
		 node name: kubesphere-node-7
		 start time: 2025-09-12T06:10:40Z
		 phase: Succeeded
		 container status: 
			 container name: spark-kubernetes-driver
			 container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
			 container state: terminated
			 container started at: 2025-09-12T06:10:42Z
			 container finished at: 2025-09-12T06:20:58Z
			 exit code: 0
			 termination reason: Completed
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Succeeded)
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Container final statuses:
	
	
		 container name: spark-kubernetes-driver
		 container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
		 container state: terminated
		 container started at: 2025-09-12T06:10:42Z
		 container finished at: 2025-09-12T06:20:58Z
		 exit code: 0
		 termination reason: Completed
	25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application com.analysis.job.UserCleanupMergeBatchJob with application ID spark-191892f6b9e44020aea1005406d009b2 and submission ID ds:com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver finished
	25/09/12 14:21:00 INFO ShutdownHookManager: Shutdown hook called
	25/09/12 14:21:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c12b6c0-f57a-4bf6-9321-336f6045ba05
[ERROR] 2025-09-12 14:21:01.327 +0800 - Handle pod log error
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist.
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:182)
	at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.handle(AbstractYarnTask.java:53)
	at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerTaskExecutor.executeTask(DefaultWorkerTaskExecutor.java:51)
	at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecutor.run(WorkerTaskExecutor.java:172)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist.
	at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:254)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	... 3 common frames omitted
Caused by: java.lang.RuntimeException: The driver pod does not exist.
	at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:244)
	... 7 common frames omitted
[INFO] 2025-09-12 14:21:01.328 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946, processId:49917 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0
[INFO] 2025-09-12 14:21:01.328 +0800 - Start finding appId in /opt/dolphinscheduler/logs/20250912/151590881017889/8/103282/91946.log, fetch way: log 
[INFO] 2025-09-12 14:21:01.330 +0800 - 
***********************************************************************************************
[INFO] 2025-09-12 14:21:01.330 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2025-09-12 14:21:01.330 +0800 - ***********************************************************************************************
[INFO] 2025-09-12 14:21:01.331 +0800 - Upload output files: [] successfully
[INFO] 2025-09-12 14:21:01.333 +0800 - Send task execute status: SUCCESS to master : dolphinscheduler-worker-1.dolphinscheduler-worker-headless:1234
[INFO] 2025-09-12 14:21:01.333 +0800 - Remove the current task execute context from worker cache
[INFO] 2025-09-12 14:21:01.334 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
[INFO] 2025-09-12 14:21:01.347 +0800 - Success clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
[INFO] 2025-09-12 14:21:01.347 +0800 - FINALIZE_SESSION

What you expected to happen

ds task instance page Each task instance can pull logs normally

How to reproduce

When using ds to schedule Spark tasks (spark-on-k8s mode), the task execution time is greater than 10 minutes.

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions