Fix Edge worker reporting crashed tasks as SUCCESS#65833
Conversation
_run_job_via_supervisor returned 1 on caught exception, but multiprocessing.Process ignores the target's return value — exit code was always 0, so Job.is_success was True and crashed tasks were reported to the central server as SUCCESS despite the Task execution failed traceback in the log. Use sys.exit(1) so the subprocess exits with code 1 and fetch_and_run_job takes the FAILED branch.
@jscheffl, What you are reporting seems to be a different bug altogether. when a user task raises, task_runner.main() catches it, sends TaskState(FAILED) to the supervisor, and returns normally. _fork_main then exits the task-runner subprocess with code 0. supervise_task returns exit_code=0. edge3's _run_job_via_supervisor only looks at the return value, so Job.is_success is True and it calls jobs_set_state(..., SUCCESS) . I will raise another followup PR for the above. what I was trying to address in this PR is the following exception |
So in your case the task went down in a HTTP exception and then the worker expected "all is good". But in such case of failure if the API server is down (even if retries), then it can not report back to central instance anyway? Or do you have other /longer retries configured for the worker compared to the execution API? Because if consistently configured then either both workers as well as task retry until API server is back or anyway both will fail communicating back to origin. Handling the error "just" changes the logs cosmetically then? |


_run_job_via_supervisor returned 1 on caught exception, but
multiprocessing.Process ignores the target's return value — exit code
was always 0, so Job.is_success was True and crashed tasks were
reported to the central server as SUCCESS despite the Task execution
failed traceback in the log. Use sys.exit(1) so the subprocess exits
with code 1 and fetch_and_run_job takes the FAILED branch.
Was generative AI tooling used to co-author this PR?
ClaudeCode Opus 4.7