Intermittent step failures after step success on k8s executor #22565

miloszbednarzak · 2024-06-15T13:09:54Z

Dagster version

1.7.9

What's the issue?

We are experiencing intermittent job failures when using the k8s executor for job runs on our Kubernetes deployment. The issue arises from the failure of one of the steps in the run, which emits a "STEP SUCCESS" event but continues to run. After some time, when the ttlSecondsAfterFinished value kicks in, the Kubernetes pod is terminated, resulting in a "STEP FAILURE" event and causing the job to fail with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found.

What did you expect to happen?

Expected Behavior:
The step should terminate properly after emitting the "STEP SUCCESS" event.
The job should not fail due to TTL termination if the step has already succeeded.

Actual Behavior:
The step emits a "STEP SUCCESS" event but continues running.
The step is terminated by the TTL setting, causing the job to fail with a "STEP FAILURE" event.

How to reproduce?

Configure a job using the k8s executor on a Helm chart deployment.
Ensure the job has multiple steps, where one step emits a "STEP SUCCESS" event but does not immediately stop running.
Set a ttlSecondsAfterFinished value for the pods.
Observe that the step continues to run after emitting the success event.
Wait for the TTL to expire and the pod to be terminated by Kubernetes.
The job reports a failure with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found, resulting in a "STEP FAILURE" event.

Deployment type

Dagster Helm chart

Deployment details

dagster-k8s version = 0.23.8

kubectl version --output=yaml:

clientVersion:
  buildDate: "2023-01-18T15:51:24Z"
  compiler: gc
  gitCommit: 8f94681cd294aa8cfd3407b8191f6c70214973a4
  gitTreeState: clean
  gitVersion: v1.26.1
  goVersion: go1.19.5
  major: "1"
  minor: "26"
  platform: darwin/arm64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2024-04-18T09:15:27Z"
  compiler: gc
  gitCommit: 6182a9d75ea01b59cab8da1e38abe144c12196d1
  gitTreeState: clean
  gitVersion: v1.29.4-gke.1043002
  goVersion: go1.21.9 X:boringcrypto
  major: "1"
  minor: "29"
  platform: linux/amd64

Additional information

Screenshot

Checked events

Extracting events shows no signs of anything suspicious.

Relevant events for a failed step extracted with command kubectl get events --sort-by=.metadata.creationTimestamp | grep dagster-step-[xy]:

50m         Normal    SuccessfulCreate       job/dagster-step-[yx]       Created pod: dagster-step-[xyz]
50m         Normal    Scheduled              pod/dagster-step-[xyz]      Successfully assigned hermod/dagster-step-[xyz] to [xxx]
50m         Normal    Pulled                 pod/dagster-step-[xyz]      Successfully pulled image "[yyy]" in 621ms (621ms including waiting)
50m         Normal    Created                pod/dagster-step-[xyz]      Created container dagster
50m         Normal    Pulling                pod/dagster-step-[xyz]      Pulling image "[yyy]"
50m         Normal    Started                pod/dagster-step-[xyz]      Started container dagster
47m         Normal    Completed              job/dagster-step-[xy]

Suspected Cause

The issue seems to be related to the check in the executor codebase at:

dagster/python_modules/libraries/dagster-k8s/dagster_k8s/executor.py

Line 320 in 8d1987f

if not status:

There appears to be a delay in invoking this check, causing the step to continue running after emitting the success event.

Suggestions to Fix

Ensure Proper Step Termination: Modify the step code to explicitly exit after emitting the "STEP SUCCESS" event.
Implement Robust Health Checks: Ensure health checks accurately reflect the step's status and handle termination properly.
Enhance Logging: Add detailed logging around the status check in Dagster to better understand delays.

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

The text was updated successfully, but these errors were encountered:

miloszbednarzak added the type: bug Something isn't working label Jun 15, 2024

garethbrickman added the deployment: k8s Related to deploying Dagster to Kubernetes label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent step failures after step success on k8s executor #22565

Intermittent step failures after step success on k8s executor #22565

miloszbednarzak commented Jun 15, 2024 •

edited

Loading

Intermittent step failures after step success on k8s executor #22565

Intermittent step failures after step success on k8s executor #22565

Comments

miloszbednarzak commented Jun 15, 2024 • edited Loading

Dagster version

What's the issue?

What did you expect to happen?

How to reproduce?

Deployment type

Deployment details

Additional information

Screenshot

Checked events

Suspected Cause

Suggestions to Fix

Message from the maintainers

miloszbednarzak commented Jun 15, 2024 •

edited

Loading