Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent step failures after step success on k8s executor #22565

Open
miloszbednarzak opened this issue Jun 15, 2024 · 0 comments
Open

Intermittent step failures after step success on k8s executor #22565

miloszbednarzak opened this issue Jun 15, 2024 · 0 comments
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working

Comments

@miloszbednarzak
Copy link

miloszbednarzak commented Jun 15, 2024

Dagster version

1.7.9

What's the issue?

We are experiencing intermittent job failures when using the k8s executor for job runs on our Kubernetes deployment. The issue arises from the failure of one of the steps in the run, which emits a "STEP SUCCESS" event but continues to run. After some time, when the ttlSecondsAfterFinished value kicks in, the Kubernetes pod is terminated, resulting in a "STEP FAILURE" event and causing the job to fail with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found.

What did you expect to happen?

Expected Behavior:
The step should terminate properly after emitting the "STEP SUCCESS" event.
The job should not fail due to TTL termination if the step has already succeeded.

Actual Behavior:
The step emits a "STEP SUCCESS" event but continues running.
The step is terminated by the TTL setting, causing the job to fail with a "STEP FAILURE" event.

How to reproduce?

  1. Configure a job using the k8s executor on a Helm chart deployment.
  2. Ensure the job has multiple steps, where one step emits a "STEP SUCCESS" event but does not immediately stop running.
  3. Set a ttlSecondsAfterFinished value for the pods.
  4. Observe that the step continues to run after emitting the success event.
  5. Wait for the TTL to expire and the pod to be terminated by Kubernetes.
  6. The job reports a failure with the message: Step xxx failed health check: Kubernetes job yyy for step zzz could not be found, resulting in a "STEP FAILURE" event.

Deployment type

Dagster Helm chart

Deployment details

dagster-k8s version = 0.23.8

kubectl version --output=yaml:

clientVersion:
  buildDate: "2023-01-18T15:51:24Z"
  compiler: gc
  gitCommit: 8f94681cd294aa8cfd3407b8191f6c70214973a4
  gitTreeState: clean
  gitVersion: v1.26.1
  goVersion: go1.19.5
  major: "1"
  minor: "26"
  platform: darwin/arm64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2024-04-18T09:15:27Z"
  compiler: gc
  gitCommit: 6182a9d75ea01b59cab8da1e38abe144c12196d1
  gitTreeState: clean
  gitVersion: v1.29.4-gke.1043002
  goVersion: go1.21.9 X:boringcrypto
  major: "1"
  minor: "29"
  platform: linux/amd64

Additional information

Screenshot

Screenshot 2024-06-06 at 14 21 21

Checked events

Extracting events shows no signs of anything suspicious.

Relevant events for a failed step extracted with command kubectl get events --sort-by=.metadata.creationTimestamp | grep dagster-step-[xy]:

50m         Normal    SuccessfulCreate       job/dagster-step-[yx]       Created pod: dagster-step-[xyz]
50m         Normal    Scheduled              pod/dagster-step-[xyz]      Successfully assigned hermod/dagster-step-[xyz] to [xxx]
50m         Normal    Pulled                 pod/dagster-step-[xyz]      Successfully pulled image "[yyy]" in 621ms (621ms including waiting)
50m         Normal    Created                pod/dagster-step-[xyz]      Created container dagster
50m         Normal    Pulling                pod/dagster-step-[xyz]      Pulling image "[yyy]"
50m         Normal    Started                pod/dagster-step-[xyz]      Started container dagster
47m         Normal    Completed              job/dagster-step-[xy]

Suspected Cause

The issue seems to be related to the check in the executor codebase at:


There appears to be a delay in invoking this check, causing the step to continue running after emitting the success event.

Suggestions to Fix

  1. Ensure Proper Step Termination: Modify the step code to explicitly exit after emitting the "STEP SUCCESS" event.
  2. Implement Robust Health Checks: Ensure health checks accurately reflect the step's status and handle termination properly.
  3. Enhance Logging: Add detailed logging around the status check in Dagster to better understand delays.

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@miloszbednarzak miloszbednarzak added the type: bug Something isn't working label Jun 15, 2024
@garethbrickman garethbrickman added the deployment: k8s Related to deploying Dagster to Kubernetes label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants