Adding health checks accessible via API #9180

luke-c-sargent · 2020-01-06T22:31:59Z

hello,

tl;dr: i want to add a health check API endpoint; what is the best way to determine if web/job handlers are performing as desired, and how verbose should this output be?

i would like to add health check functionality to Galaxy for web and job handlers, and was looking to implement something similar to what is described in this IETF proposal draft, whose schema could look like this:

status: <pass/fail/warn>
notes: "..."
checks:
  - web: 
    - componentId: <id>
      componentType: webhandler
      status: <pass/fail/warn>
      notes: "..."
    - <..other web handlers..>
  - job:
    - componentId: <id>
      componentType: jobhandler
      status: <pass/fail/warn>
      notes: "..."
    - <..other job handlers..>

roughly, we get an overall status, and further details are provided in the 'checks' field, which contains a list of responses from individual systems. these checks can be more (or less) granular as desired.

this effort was started for kubernetes reasons (readiness / liveness probes) but could be broadly useful to sysadmins or extra-nerdy users curious about server status.

my naive approach has been to check the app member of the trans object:

for web: check application_stack.workers(), iterate through, accept idle or busy status.
for job: check job_manager.job_handler.dispatcher.job_runners, ensure there are workers associated with it (nworkers > 0)

i am sure there are many caveats and oversights in the above (e.g., many runner/handler types), so i would love to hear from my learned colleagues. thanks for reading!

The text was updated successfully, but these errors were encountered:

luke-c-sargent · 2020-01-08T22:52:54Z

to expand, these are the additional routes added:

    webapp.mapper.connect('healthcheck',
                          '/api/healthcheck/',
                          controller='healthcheck',
                          action='get',
                          conditions=dict(method=["GET"]))

    webapp.mapper.connect('healthcheck_web',
                          '/api/healthcheck/web',
                          controller='healthcheck',
                          action='get_web',
                          conditions=dict(method=["GET"]))

    webapp.mapper.connect('healthcheck_job',
                          '/api/healthcheck/job',
                          controller='healthcheck',
                          action='get_job',
                          conditions=dict(method=["GET"]))

and sample output from my dev instance looks like this:

{
  "status": "pass", 
  "notes": "", 
  "checks": {
        "web": [
             {"componentType": "uWSGIworker", "status": "pass", "componentId": 1},
             {"componentType": "uWSGIworker", "status": "pass", "componentId": 2}], 
       "job": [
             {"componentType": "LocalRunner", "status": "pass"}]}
}

note that the above is for the api/healthcheck endpoint; requesting api/healthcheck/<web/job> will give you the list of components under that heading in checks

nuwang · 2020-02-01T14:34:24Z

@luke-c-sargent Really like the idea of basing the schema off the IETF draft and great to see the PR with the initial implementation too. I think this issue is related to several others, so I'm hoping that we can get more input about this as well as make the design compatible with those other issues too. The issues are, namely.

Kubernetes health checks - judging from the PR, this health check endpoint is on the web API. What are the implications from the k8s side, in particular for job handlers? Does this mean that the job handler's liveness probe would have to query the web API to determine the liveness of the job handler? I feel that the job handler's liveness and readiness probes should be self contained, and not dependent on the liveness or readiness of another container.
Job handler reassignment - Currently, if a job handler dies, and a new job handler comes up, the jobs previously assigned to the dead handler become orphaned (unless the new handler happens to have the same exact name, which is a challenge if handlers scale up and down on demand and thus won't have the same name). There is no automatic process for reassigning such jobs to a new handler and @natefoo mentioned that he was hoping to implement a solution for this. I'm thinking that solution would have to tie into the implementation of health checks themselves, since only jobs belonging to dead handlers should be reassigned. Hoping @natefoo can comment.
Celery implementation - I remember seeing @mvdbeek doing some preliminary work on a celery based implementation of what I thought were job handlers. I really like the idea of using celery to manage Galaxy job handlers because it could potentially decouple the communication implementation between the web handler and the job handlers, and thus improve scalability + reduce complexity. I think it may also just obviate issue 2. Hoping @mvdbeek can chime in on this.

jmchilton · 2020-10-20T21:31:50Z

Included in galaxyproject/galaxy-helm#127

Is there any plan to merge this back upstream into Galaxy? I suspect it would be good to have something like this in the code and apologies if the PR was abandoned because of committer inaction 😢. I told myself several times I was going to review the PR and never did.

luke-c-sargent added the kind/feature label Jan 8, 2020

luke-c-sargent mentioned this issue Jan 23, 2020

adding health check functionality, api endpoint and unit tests #9264

Closed

nuwang mentioned this issue Feb 1, 2020

Assign fixed handler name since we only support one handler for now galaxyproject/galaxy-helm#105

Merged

jmchilton added the area/jobs label Oct 20, 2020

innovate-invent mentioned this issue Jul 20, 2021

Improve health checks brinkmanlab/galaxy-container#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding health checks accessible via API #9180

Adding health checks accessible via API #9180

luke-c-sargent commented Jan 6, 2020

luke-c-sargent commented Jan 8, 2020

nuwang commented Feb 1, 2020 •

edited

Loading

jmchilton commented Oct 20, 2020

Adding health checks accessible via API #9180

Adding health checks accessible via API #9180

Comments

luke-c-sargent commented Jan 6, 2020

luke-c-sargent commented Jan 8, 2020

nuwang commented Feb 1, 2020 • edited Loading

jmchilton commented Oct 20, 2020

nuwang commented Feb 1, 2020 •

edited

Loading