-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job probes #127
Job probes #127
Conversation
…ly in k8s versions >1.16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luke-c-sargent This is great! We can finally have some confidence that the handlers are running as expected, so thanks for doing this. Agree with your comment that the readiness probe is not necessary. Partly because until we resolve the handler naming issue, it might be dangerous to have a readiness probe, since 2 handlers with the same name could come up and be potentially handling requests at the same time.
…ly in k8s versions >1.16
827a5c4
to
8c89f82
Compare
Hello;
This PR seeks to add liveness / readiness checks to the job handler pods; should this approach be deemed acceptable, I am fairly certain this can be extended to the workflow scheduler pod as well (and web handlers for that matter, though the http approach is probably a more relevant test).
The approach is based on the observation that the
worker_process
database table contains heartbeat data for all galaxy-associated processes (e.g., web, job and workflow handlers); this heartbeat happens every 60 seconds and simply updates a timestamp. by checking the current time against this time stamp and ensuring that the difference is <= 60s, we can determine that the process is 'live.' fortunately, galaxy dependencies include a python library for communicating with postgres DBs (psycopg2) which is being used here to avoid extra dependencies.this is added to values files:
these initial defaults work for the time being, but could probably be tweaked some. ideally, we'd include startupProbes to tighten things up, but we need to have k8s 1.16+ available for that.
these values are then used in probes (e.g., liveness shown here):
Notes:
galaxy-helm
, a startupProbe would really help clean up the other probes who have to be resilient enough to be a meaningful check for a healthy container but also tolerate the long startup time. With a startup probe, we can increase the specificity of the liveness probes.thanks for reading, please let me know what i can do to improve the quality of the submission. if this is a reasonable approach, my next steps (after implementing suggested changes) would be applying this approach to the other pods, where appropriate.