You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing intermittent job failures when using the k8s executor for job runs on our Kubernetes deployment. The issue arises from the failure of one of the steps in the run, which emits a "STEP SUCCESS" event but continues to run. After some time, when the ttlSecondsAfterFinished value kicks in, the Kubernetes pod is terminated, resulting in a "STEP FAILURE" event and causing the job to fail with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found.
What did you expect to happen?
Expected Behavior:
The step should terminate properly after emitting the "STEP SUCCESS" event.
The job should not fail due to TTL termination if the step has already succeeded.
Actual Behavior:
The step emits a "STEP SUCCESS" event but continues running.
The step is terminated by the TTL setting, causing the job to fail with a "STEP FAILURE" event.
How to reproduce?
Configure a job using the k8s executor on a Helm chart deployment.
Ensure the job has multiple steps, where one step emits a "STEP SUCCESS" event but does not immediately stop running.
Set a ttlSecondsAfterFinished value for the pods.
Observe that the step continues to run after emitting the success event.
Wait for the TTL to expire and the pod to be terminated by Kubernetes.
The job reports a failure with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found, resulting in a "STEP FAILURE" event.
Extracting events shows no signs of anything suspicious.
Relevant events for a failed step extracted with command kubectl get events --sort-by=.metadata.creationTimestamp | grep dagster-step-[xy]:
50m Normal SuccessfulCreate job/dagster-step-[yx] Created pod: dagster-step-[xyz]
50m Normal Scheduled pod/dagster-step-[xyz] Successfully assigned hermod/dagster-step-[xyz] to [xxx]
50m Normal Pulled pod/dagster-step-[xyz] Successfully pulled image "[yyy]" in 621ms (621ms including waiting)
50m Normal Created pod/dagster-step-[xyz] Created container dagster
50m Normal Pulling pod/dagster-step-[xyz] Pulling image "[yyy]"
50m Normal Started pod/dagster-step-[xyz] Started container dagster
47m Normal Completed job/dagster-step-[xy]
Suspected Cause
The issue seems to be related to the check in the executor codebase at:
Dagster version
1.7.9
What's the issue?
We are experiencing intermittent job failures when using the k8s executor for job runs on our Kubernetes deployment. The issue arises from the failure of one of the steps in the run, which emits a "STEP SUCCESS" event but continues to run. After some time, when the
ttlSecondsAfterFinished
value kicks in, the Kubernetes pod is terminated, resulting in a "STEP FAILURE" event and causing the job to fail with the message:Step xxx failed health check: Kubernetes job yyy for step zzz could not be found
.What did you expect to happen?
Expected Behavior:
The step should terminate properly after emitting the "STEP SUCCESS" event.
The job should not fail due to TTL termination if the step has already succeeded.
Actual Behavior:
The step emits a "STEP SUCCESS" event but continues running.
The step is terminated by the TTL setting, causing the job to fail with a "STEP FAILURE" event.
How to reproduce?
ttlSecondsAfterFinished
value for the pods.Step xxx failed health check: Kubernetes job yyy for step zzz could not be found
, resulting in a "STEP FAILURE" event.Deployment type
Dagster Helm chart
Deployment details
dagster-k8s version = 0.23.8
kubectl version --output=yaml
:Additional information
Screenshot
Checked events
Extracting events shows no signs of anything suspicious.
Relevant events for a failed step extracted with command
kubectl get events --sort-by=.metadata.creationTimestamp | grep dagster-step-[xy]
:Suspected Cause
The issue seems to be related to the check in the executor codebase at:
dagster/python_modules/libraries/dagster-k8s/dagster_k8s/executor.py
Line 320 in 8d1987f
There appears to be a delay in invoking this check, causing the step to continue running after emitting the success event.
Suggestions to Fix
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: