Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoffLimit to DaskJobs #695

Open
eddienko opened this issue Mar 31, 2023 · 1 comment · May be fixed by #745
Open

Add backoffLimit to DaskJobs #695

eddienko opened this issue Mar 31, 2023 · 1 comment · May be fixed by #745

Comments

@eddienko
Copy link

eddienko commented Mar 31, 2023

Would it be possible to add backoffLimit to DaskJobs? Kubernetes jobs have this argument so that the job is reported as failed only it the pod fails a certain number of times (see below). Could we add these to DaskJobs as well? I have been using this argument in jobs because Dask sometimes "just hangs/crashes" in very long jobs and restarting the job fixes that.

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4
@jacobtomlinson
Copy link
Member

I agree that this would be a good improvement. Perhaps instead of making the DaskJob behave the same way as Job we should replace the internal Pod in the DaskJob with a Job so that we can leverage the existing functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants