-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle unschedulable jobs in k8s runner #10496
Conversation
@@ -515,6 +525,20 @@ def check_watched_item(self, job_state): | |||
# job is no longer viable - remove from watched jobs | |||
return None | |||
|
|||
def _handle_unschedulable_job(self, job, job_state): | |||
# Handle unschedulable job that exceeded deadline | |||
with open(job_state.error_file, 'a') as error_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to guard against the condition where this file doesn't exist, because I've seen several instances (in other places in the runner) where writing to the error file fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need this file for ? Can we eliminate this ? This was used in older traditional HPC runners that would kill a job and then write something to a specific file, but kubernetes doesn't do that, and we can write to job_state.fail_message
. I would assume that ends up on the job_stderr column of the job table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just following what is done to handle failures in other places eg https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/kubernetes.py#L521 . Can we just change this as part of moving to the pulsar runner? Not really worth the time sink now? I can just remove the file and do fail message
Thanks @almahmoud! |
Addresses #10480
k8s_unschedulable_walltime_limit
which specifies the time limit for a job to wait while pods are unschedulable.Testing everything together: