Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unschedulable jobs in k8s runner #10496

Merged
merged 6 commits into from
Apr 7, 2021
Merged

Conversation

almahmoud
Copy link
Member

@almahmoud almahmoud commented Oct 21, 2020

Addresses #10480

  • Adds k8s_unschedulable_walltime_limit which specifies the time limit for a job to wait while pods are unschedulable.
  • Makes job not be marked as running in Galaxy if Job is dispatched but job is still unschedulable
  • Handles marking the job as failed with a reasonable message and cleaning up the job when unschedulable past the specified limit

Testing everything together:

  • Job is not marked as running if the pod is unschedulable
  • Specified time limit is respected (tested with 30 seconds, defaulted to 30 mins)
  • Job error is properly reported
    image
  • Job resource is cleaned up in k8s
  • Schedulable jobs still run unaffected

@almahmoud almahmoud changed the title Handle unschedulable jobs Handle unschedulable jobs in k8s runner Oct 21, 2020
@galaxybot galaxybot added this to the 21.01 milestone Oct 21, 2020
@@ -515,6 +525,20 @@ def check_watched_item(self, job_state):
# job is no longer viable - remove from watched jobs
return None

def _handle_unschedulable_job(self, job, job_state):
# Handle unschedulable job that exceeded deadline
with open(job_state.error_file, 'a') as error_file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to guard against the condition where this file doesn't exist, because I've seen several instances (in other places in the runner) where writing to the error file fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need this file for ? Can we eliminate this ? This was used in older traditional HPC runners that would kill a job and then write something to a specific file, but kubernetes doesn't do that, and we can write to job_state.fail_message. I would assume that ends up on the job_stderr column of the job table.

Copy link
Member Author

@almahmoud almahmoud Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following what is done to handle failures in other places eg https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/kubernetes.py#L521 . Can we just change this as part of moving to the pulsar runner? Not really worth the time sink now? I can just remove the file and do fail message

@mvdbeek mvdbeek merged commit ba88c91 into galaxyproject:dev Apr 7, 2021
@mvdbeek
Copy link
Member

mvdbeek commented Apr 7, 2021

Thanks @almahmoud!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants