Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding health checks accessible via API #9180

Open
luke-c-sargent opened this issue Jan 6, 2020 · 3 comments
Open

Adding health checks accessible via API #9180

luke-c-sargent opened this issue Jan 6, 2020 · 3 comments

Comments

@luke-c-sargent
Copy link
Member

hello,

tl;dr: i want to add a health check API endpoint; what is the best way to determine if web/job handlers are performing as desired, and how verbose should this output be?


i would like to add health check functionality to Galaxy for web and job handlers, and was looking to implement something similar to what is described in this IETF proposal draft, whose schema could look like this:

status: <pass/fail/warn>
notes: "..."
checks:
  - web: 
    - componentId: <id>
      componentType: webhandler
      status: <pass/fail/warn>
      notes: "..."
    - <..other web handlers..>
  - job:
    - componentId: <id>
      componentType: jobhandler
      status: <pass/fail/warn>
      notes: "..."
    - <..other job handlers..>

roughly, we get an overall status, and further details are provided in the 'checks' field, which contains a list of responses from individual systems. these checks can be more (or less) granular as desired.

this effort was started for kubernetes reasons (readiness / liveness probes) but could be broadly useful to sysadmins or extra-nerdy users curious about server status.

my naive approach has been to check the app member of the trans object:

  • for web: check application_stack.workers(), iterate through, accept idle or busy status.
  • for job: check job_manager.job_handler.dispatcher.job_runners, ensure there are workers associated with it (nworkers > 0)

i am sure there are many caveats and oversights in the above (e.g., many runner/handler types), so i would love to hear from my learned colleagues. thanks for reading!

@luke-c-sargent
Copy link
Member Author

to expand, these are the additional routes added:

    webapp.mapper.connect('healthcheck',
                          '/api/healthcheck/',
                          controller='healthcheck',
                          action='get',
                          conditions=dict(method=["GET"]))

    webapp.mapper.connect('healthcheck_web',
                          '/api/healthcheck/web',
                          controller='healthcheck',
                          action='get_web',
                          conditions=dict(method=["GET"]))

    webapp.mapper.connect('healthcheck_job',
                          '/api/healthcheck/job',
                          controller='healthcheck',
                          action='get_job',
                          conditions=dict(method=["GET"]))

and sample output from my dev instance looks like this:

{
  "status": "pass", 
  "notes": "", 
  "checks": {
        "web": [
             {"componentType": "uWSGIworker", "status": "pass", "componentId": 1},
             {"componentType": "uWSGIworker", "status": "pass", "componentId": 2}], 
       "job": [
             {"componentType": "LocalRunner", "status": "pass"}]}
}

note that the above is for the api/healthcheck endpoint; requesting api/healthcheck/<web/job> will give you the list of components under that heading in checks

@nuwang
Copy link
Member

nuwang commented Feb 1, 2020

@luke-c-sargent Really like the idea of basing the schema off the IETF draft and great to see the PR with the initial implementation too. I think this issue is related to several others, so I'm hoping that we can get more input about this as well as make the design compatible with those other issues too. The issues are, namely.

  1. Kubernetes health checks - judging from the PR, this health check endpoint is on the web API. What are the implications from the k8s side, in particular for job handlers? Does this mean that the job handler's liveness probe would have to query the web API to determine the liveness of the job handler? I feel that the job handler's liveness and readiness probes should be self contained, and not dependent on the liveness or readiness of another container.
  2. Job handler reassignment - Currently, if a job handler dies, and a new job handler comes up, the jobs previously assigned to the dead handler become orphaned (unless the new handler happens to have the same exact name, which is a challenge if handlers scale up and down on demand and thus won't have the same name). There is no automatic process for reassigning such jobs to a new handler and @natefoo mentioned that he was hoping to implement a solution for this. I'm thinking that solution would have to tie into the implementation of health checks themselves, since only jobs belonging to dead handlers should be reassigned. Hoping @natefoo can comment.
  3. Celery implementation - I remember seeing @mvdbeek doing some preliminary work on a celery based implementation of what I thought were job handlers. I really like the idea of using celery to manage Galaxy job handlers because it could potentially decouple the communication implementation between the web handler and the job handlers, and thus improve scalability + reduce complexity. I think it may also just obviate issue 2. Hoping @mvdbeek can chime in on this.

@jmchilton
Copy link
Member

Included in galaxyproject/galaxy-helm#127

Is there any plan to merge this back upstream into Galaxy? I suspect it would be good to have something like this in the code and apologies if the PR was abandoned because of committer inaction 😢. I told myself several times I was going to review the PR and never did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants