Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job probes #127

Merged
merged 52 commits into from
Apr 1, 2020
Merged

Job probes #127

merged 52 commits into from
Apr 1, 2020

Conversation

luke-c-sargent
Copy link
Member

Hello;

This PR seeks to add liveness / readiness checks to the job handler pods; should this approach be deemed acceptable, I am fairly certain this can be extended to the workflow scheduler pod as well (and web handlers for that matter, though the http approach is probably a more relevant test).

The approach is based on the observation that the worker_process database table contains heartbeat data for all galaxy-associated processes (e.g., web, job and workflow handlers); this heartbeat happens every 60 seconds and simply updates a timestamp. by checking the current time against this time stamp and ensuring that the difference is <= 60s, we can determine that the process is 'live.' fortunately, galaxy dependencies include a python library for communicating with postgres DBs (psycopg2) which is being used here to avoid extra dependencies.

this is added to values files:

  # Probe variables
  readinessProbe:
    interval: 60
    initialDelaySeconds: 60
    periodSeconds: 60
    failureThreshold: 10
    timeoutSeconds: 3
  livenessProbe:
    interval: 60
    initialDelaySeconds: 60
    periodSeconds: 60
    failureThreshold: 10
    timeoutSeconds: 3

these initial defaults work for the time being, but could probably be tweaked some. ideally, we'd include startupProbes to tighten things up, but we need to have k8s 1.16+ available for that.

these values are then used in probes (e.g., liveness shown here):

          livenessProbe:
            exec:
              command: [
                'sh', '-c',
                'python /tmp/probedb.py -e $GALAXY_CONFIG_OVERRIDE_DATABASE_CONNECTION -o $POD_NAME -i {{ $.Values.jobHandlers.livenessProbe.interval }}'
              ]
            initialDelaySeconds: {{ $.Values.jobHandlers.livenessProbe.initialDelaySeconds }}
            periodSeconds: {{ $.Values.jobHandlers.livenessProbe.periodSeconds }}
            failureThreshold: {{ $.Values.jobHandlers.livenessProbe.failureThreshold }}
            timeoutSeconds: {{ $.Values.jobHandlers.livenessProbe.failureThreshold }}

Notes:

  • in the case of the job handler pod (and possibly workflow handler pod), communication is done through a shared database. Since the pod will be accepting no traditional traffic, is a readiness probe even appropriate here? being live and not 'ready' (e.g., available to work but not accepting traffic) is not detrimental because it only needs database access, which the liveness probe already checks.
  • if/when kubernetes 1.16+ is available on GKE (etc.) for use with galaxy-helm, a startupProbe would really help clean up the other probes who have to be resilient enough to be a meaningful check for a healthy container but also tolerate the long startup time. With a startup probe, we can increase the specificity of the liveness probes.

thanks for reading, please let me know what i can do to improve the quality of the submission. if this is a reasonable approach, my next steps (after implementing suggested changes) would be applying this approach to the other pods, where appropriate.

galaxy/values-cvmfs.yaml Outdated Show resolved Hide resolved
Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luke-c-sargent This is great! We can finally have some confidence that the handlers are running as expected, so thanks for doing this. Agree with your comment that the readiness probe is not necessary. Partly because until we resolve the handler naming issue, it might be dangerous to have a readiness probe, since 2 handlers with the same name could come up and be potentially handling requests at the same time.

galaxy/scripts/probedb.py Outdated Show resolved Hide resolved
galaxy/scripts/probedb.py Outdated Show resolved Hide resolved
galaxy/scripts/probedb.py Outdated Show resolved Hide resolved
galaxy/scripts/probedb.py Outdated Show resolved Hide resolved
galaxy/scripts/probedb.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants