Handle unschedulable jobs in k8s runner #10496

almahmoud · 2020-10-21T17:46:26Z

Addresses #10480

Adds k8s_unschedulable_walltime_limit which specifies the time limit for a job to wait while pods are unschedulable.
Makes job not be marked as running in Galaxy if Job is dispatched but job is still unschedulable
Handles marking the job as failed with a reasonable message and cleaning up the job when unschedulable past the specified limit

Testing everything together:

Job is not marked as running if the pod is unschedulable
Specified time limit is respected (tested with 30 seconds, defaulted to 30 mins)
Job error is properly reported
Job resource is cleaned up in k8s
Schedulable jobs still run unaffected

lib/galaxy/jobs/runners/kubernetes.py

nuwang · 2021-03-10T17:49:33Z

lib/galaxy/jobs/runners/kubernetes.py

@@ -515,6 +525,20 @@ def check_watched_item(self, job_state):
            # job is no longer viable - remove from watched jobs
            return None

+    def _handle_unschedulable_job(self, job, job_state):
+        # Handle unschedulable job that exceeded deadline
+        with open(job_state.error_file, 'a') as error_file:


I think it would be good to guard against the condition where this file doesn't exist, because I've seen several instances (in other places in the runner) where writing to the error file fails.

What do we need this file for ? Can we eliminate this ? This was used in older traditional HPC runners that would kill a job and then write something to a specific file, but kubernetes doesn't do that, and we can write to job_state.fail_message. I would assume that ends up on the job_stderr column of the job table.

I was just following what is done to handle failures in other places eg https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/kubernetes.py#L521 . Can we just change this as part of moving to the pulsar runner? Not really worth the time sink now? I can just remove the file and do fail message

mvdbeek · 2021-04-07T15:30:02Z

Thanks @almahmoud!

almahmoud changed the title ~~Handle unschedulable jobs~~ Handle unschedulable jobs in k8s runner Oct 21, 2020

almahmoud requested review from nuwang and jmchilton October 21, 2020 18:05

almahmoud mentioned this pull request Oct 21, 2020

Jobs appear as perpetually running with k8s runner when pods are unschedulable #10480

Open

galaxybot added the triage label Oct 21, 2020

galaxybot added this to the 21.01 milestone Oct 21, 2020

almahmoud added area/jobs kind/enhancement paper-cut labels Oct 21, 2020

jmchilton reviewed Oct 26, 2020

View reviewed changes

lib/galaxy/jobs/runners/kubernetes.py Show resolved Hide resolved

mvdbeek modified the milestones: 21.01, 21.05 Jan 6, 2021

mvdbeek added status/needs feedback status/WIP labels Jan 6, 2021

Alexandru Mahmoud added 4 commits March 9, 2021 22:04

Handle unschedulable jobs

d31884e

Oopsie

424af0c

Fixing linting issues

8032fa6

move function to pykube util

dc7d3fd

almahmoud force-pushed the patch-8 branch from 1ab83a0 to dc7d3fd Compare March 10, 2021 16:20

lint

4a8a394

almahmoud requested a review from jmchilton March 10, 2021 16:28

jmchilton approved these changes Mar 10, 2021

View reviewed changes

nuwang reviewed Mar 10, 2021

View reviewed changes

Remove error file handling

edb047b

mvdbeek removed status/WIP triage labels Apr 7, 2021

mvdbeek merged commit ba88c91 into galaxyproject:dev Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unschedulable jobs in k8s runner #10496

Handle unschedulable jobs in k8s runner #10496

almahmoud commented Oct 21, 2020 •

edited

Loading

nuwang Mar 10, 2021

mvdbeek Mar 10, 2021

almahmoud Mar 17, 2021 •

edited

Loading

mvdbeek commented Apr 7, 2021

Handle unschedulable jobs in k8s runner #10496

Handle unschedulable jobs in k8s runner #10496

Conversation

almahmoud commented Oct 21, 2020 • edited Loading

nuwang Mar 10, 2021

Choose a reason for hiding this comment

mvdbeek Mar 10, 2021

Choose a reason for hiding this comment

almahmoud Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

mvdbeek commented Apr 7, 2021

almahmoud commented Oct 21, 2020 •

edited

Loading

almahmoud Mar 17, 2021 •

edited

Loading