-
Notifications
You must be signed in to change notification settings - Fork 220
Closed
Description
Currently, the run error (termination reason) can be seen via the CLI with dstack ps -v or via the API:
NAME BACKEND INSTANCE PRICE STATUS SUBMITTED ERROR
shaggy-husky-1 local local $0.0 failed 4 weeks ago JOB_FAILED
(CONTAINER_EXITED_WITH_ERROR)
heavy-crab-1 local local $0.0 terminated 4 weeks ago STOPPED_BY_USER
tame-fox-1 local local $0.0 terminated 4 weeks ago STOPPED_BY_USER
ordinary-wombat-1 local local $0.0 done 4 weeks ago ALL_JOBS_DONE
There should also be a way to see run errors via the UI as expected by users (e.g. #1654). Add the Error field next to Status on the run page. It should display run.termination_reason (run.jobs[0].job_submissions[-1].termination_reason). Here the CLI logic:
dstack/src/dstack/_internal/cli/utils/run.py
Lines 177 to 199 in 1537163
| def _get_run_error(run: Run) -> str: | |
| if run._run.termination_reason is None: | |
| return "" | |
| if len(run._run.jobs) > 1: | |
| return run._run.termination_reason.name | |
| run_job_termination_reason = _get_run_job_termination_reason(run) | |
| # For failed runs, also show termination reason to provide more context. | |
| # For other run statuses, the job termination reason will duplicate run status. | |
| if run_job_termination_reason is not None and run._run.termination_reason in [ | |
| RunTerminationReason.JOB_FAILED, | |
| RunTerminationReason.SERVER_ERROR, | |
| RunTerminationReason.RETRY_LIMIT_EXCEEDED, | |
| ]: | |
| return f"{run._run.termination_reason.name}\n({run_job_termination_reason.name})" | |
| return run._run.termination_reason.name | |
| def _get_run_job_termination_reason(run: Run) -> Optional[JobTerminationReason]: | |
| for job in run._run.jobs: | |
| if len(job.job_submissions) > 0: | |
| if job.job_submissions[-1].termination_reason is not None: | |
| return job.job_submissions[-1].termination_reason | |
| return None |
Reactions are currently unavailable