Job probes #127

luke-c-sargent · 2020-03-12T23:50:55Z

Hello;

This PR seeks to add liveness / readiness checks to the job handler pods; should this approach be deemed acceptable, I am fairly certain this can be extended to the workflow scheduler pod as well (and web handlers for that matter, though the http approach is probably a more relevant test).

The approach is based on the observation that the worker_process database table contains heartbeat data for all galaxy-associated processes (e.g., web, job and workflow handlers); this heartbeat happens every 60 seconds and simply updates a timestamp. by checking the current time against this time stamp and ensuring that the difference is <= 60s, we can determine that the process is 'live.' fortunately, galaxy dependencies include a python library for communicating with postgres DBs (psycopg2) which is being used here to avoid extra dependencies.

this is added to values files:

  # Probe variables
  readinessProbe:
    interval: 60
    initialDelaySeconds: 60
    periodSeconds: 60
    failureThreshold: 10
    timeoutSeconds: 3
  livenessProbe:
    interval: 60
    initialDelaySeconds: 60
    periodSeconds: 60
    failureThreshold: 10
    timeoutSeconds: 3

these initial defaults work for the time being, but could probably be tweaked some. ideally, we'd include startupProbes to tighten things up, but we need to have k8s 1.16+ available for that.

these values are then used in probes (e.g., liveness shown here):

          livenessProbe:
            exec:
              command: [
                'sh', '-c',
                'python /tmp/probedb.py -e $GALAXY_CONFIG_OVERRIDE_DATABASE_CONNECTION -o $POD_NAME -i {{ $.Values.jobHandlers.livenessProbe.interval }}'
              ]
            initialDelaySeconds: {{ $.Values.jobHandlers.livenessProbe.initialDelaySeconds }}
            periodSeconds: {{ $.Values.jobHandlers.livenessProbe.periodSeconds }}
            failureThreshold: {{ $.Values.jobHandlers.livenessProbe.failureThreshold }}
            timeoutSeconds: {{ $.Values.jobHandlers.livenessProbe.failureThreshold }}

Notes:

in the case of the job handler pod (and possibly workflow handler pod), communication is done through a shared database. Since the pod will be accepting no traditional traffic, is a readiness probe even appropriate here? being live and not 'ready' (e.g., available to work but not accepting traffic) is not detrimental because it only needs database access, which the liveness probe already checks.
if/when kubernetes 1.16+ is available on GKE (etc.) for use with galaxy-helm, a startupProbe would really help clean up the other probes who have to be resilient enough to be a meaningful check for a healthy container but also tolerate the long startup time. With a startup probe, we can increase the specificity of the liveness probes.

thanks for reading, please let me know what i can do to improve the quality of the submission. if this is a reasonable approach, my next steps (after implementing suggested changes) would be applying this approach to the other pods, where appropriate.

…ly in k8s versions >1.16

galaxy/values-cvmfs.yaml

nuwang

@luke-c-sargent This is great! We can finally have some confidence that the handlers are running as expected, so thanks for doing this. Agree with your comment that the readiness probe is not necessary. Partly because until we resolve the handler naming issue, it might be dangerous to have a readiness probe, since 2 handlers with the same name could come up and be potentially handling requests at the same time.

galaxy/scripts/probedb.py

…igmap)

…ly in k8s versions >1.16

…igmap)

…-helm into job_probes

luke-c-sargent added 16 commits March 10, 2020 12:20

add probe script

ba2a07e

add fail condition: no handler found

3ac7ba8

removing offending underscore

b6aeb88

adding logic for startup and liveness probes

99e49f3

missed a comma :(

decf11f

replacing startupProbe with readinessProbe because startupProbe is on…

0348579

…ly in k8s versions >1.16

misunderstood difference between readiness and liveness probes

540e149

changing threshold and delay values for probes to be more robust

cbd4c7c

variablizing stuff from deployment yaml into values file

3918a5a

whoops didn't save a file

2413d47

templating didn't work as planned

960210a

templating didn't work as planned part 2

010e97c

adding liveness probe variables to values yaml

fabf4ec

ALWAYS SAVE YOUR FILES BEFORE COMMITTING

1eb0318

variablize timeout interval for db probe script

d1235aa

mirroring values-cvmfs changes into values

f58012b

almahmoud reviewed Mar 13, 2020

View reviewed changes

galaxy/values-cvmfs.yaml Outdated Show resolved Hide resolved

nuwang reviewed Mar 13, 2020

View reviewed changes

luke-c-sargent added 12 commits March 13, 2020 12:26

testing refactoring how probe is provisioned (out of values into conf…

251fdcc

…igmap)

still figuring out this templating

9da91b7

configMap revisions to be syntactically correct

a1dcc52

indentation fun with YAML->JSON

7708931

Merge remote-tracking branch 'upstream/master' into job_probes

edac9b3

refining python script

ee2e8cb

revisions suggested by flake8 linter

3bc434a

cleaning comments

10e20ed

removing an option that the user probably shouldn't be exposed to

dea6f2e

add probe script

baac0d5

add fail condition: no handler found

7959b8c

removing offending underscore

63ba021

luke-c-sargent and others added 22 commits April 1, 2020 13:57

missed a comma :(

20207df

replacing startupProbe with readinessProbe because startupProbe is on…

45e5466

…ly in k8s versions >1.16

misunderstood difference between readiness and liveness probes

88a431b

changing threshold and delay values for probes to be more robust

fb3e97d

variablizing stuff from deployment yaml into values file

8cb7b67

whoops didn't save a file

9825062

templating didn't work as planned

40978c5

templating didn't work as planned part 2

84be1ae

adding liveness probe variables to values yaml

a9fe459

ALWAYS SAVE YOUR FILES BEFORE COMMITTING

5ce178a

variablize timeout interval for db probe script

fcc9809

mirroring values-cvmfs changes into values

3117f9c

testing refactoring how probe is provisioned (out of values into conf…

386e9b5

…igmap)

still figuring out this templating

853fe60

configMap revisions to be syntactically correct

9bc9284

indentation fun with YAML->JSON

b29cea6

Fixing job handlers scaling

ec03ca1

refining python script

457da47

revisions suggested by flake8 linter

0bb7ed5

cleaning comments

0c8337f

removing an option that the user probably shouldn't be exposed to

8c89f82

Merge branch 'job_probes' of https://github.com/luke-c-sargent/galaxy…

827a5c4

…-helm into job_probes

luke-c-sargent force-pushed the job_probes branch from 827a5c4 to 8c89f82 Compare April 1, 2020 21:54

Merge remote-tracking branch 'upstream/master' into job_probes

a3b8ca2

almahmoud approved these changes Apr 1, 2020

View reviewed changes

almahmoud merged commit dfbf64b into galaxyproject:master Apr 1, 2020

almahmoud mentioned this pull request Apr 1, 2020

Galaxy job readiness probe #45

Closed

luke-c-sargent mentioned this pull request Apr 9, 2020

adding health check functionality, api endpoint and unit tests galaxyproject/galaxy#9264

Closed

kozbo mentioned this pull request May 11, 2020

Add readiness and liveness probes to Galaxy web and job deployments anvilproject/AnVIL-JIRA#279

Closed

jmchilton mentioned this pull request Oct 20, 2020

Adding health checks accessible via API galaxyproject/galaxy#9180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job probes #127

Job probes #127

luke-c-sargent commented Mar 12, 2020

nuwang left a comment

Job probes #127

Job probes #127

Conversation

luke-c-sargent commented Mar 12, 2020

nuwang left a comment

Choose a reason for hiding this comment