Restart handler strategy behaviour #231

guidograzioli · 2024-05-15T11:41:09Z

Builds on and supersedes #230; a directory for restart strategies is provided with three implementations: none (nothing is restarted), serial (restart hosts in a serial fashion), serial_then_parallel (restart first node, verify health url, proceed with the rest in parallel). Restart health check can be mix-and-matched with the wait_for parameters.

The default strategy file, imported from the restart handler, is 'serial_then_parallel'; user can use that, choose one of the other too, or specify their custom taskfile (in a relative path of the calling playbook).

serial is the only viable restart behaviour on keycloak with default embedded infinispan caches (distributed-caches with owners = 2 implies a single point of failure).
serial_then_parallel is for the case of remote infinispan caches, where liveliness of the first node to be restarted allows the other keycloak node to be restarted in parallel

The molecule scenario quarkus_ha tests a cluster of two keycloak instances with a shared postgresql instance, restart in serial strategy

New parameters:

Variable	Description	Default
`keycloak_quarkus_restart_strategy`	Strategy task file for restarting in HA (one of provided restart/['serial.yml','none.yml','serial_then_parallel.yml']) or path to file when providing custom strategy	`restart/serial.yml`
`keycloak_quarkus_restart_health_check`	Whether to wait for successful health check after restart	`{{ keycloak_quarkus_ha_enabled }}`
`keycloak_quarkus_restart_health_check_delay`	Seconds to let pass before starting healch checks	`10`
`keycloak_quarkus_restart_health_check_reries`	Number of attempts for successful health check before failing	`25`
`keycloak_quarkus_restart_pause`	Seconds to wait between restarts in HA strategy	`15`

Fix #182
Fix #221

Co-authored-by: Helmut Wolf <hwo@world-direct.at> Co-authored-by: Guido Grazioli <ggraziol@redhat.com>

guidograzioli · 2024-05-15T11:58:23Z

@hwo-wd please review; if we agree on this coauthored PR, I'll find another pair of eye to do the formal review

hwo-wd

Mostly nitpicking

hwo-wd · 2024-05-15T12:03:43Z

roles/keycloak_quarkus/tasks/restart/serial.yml

+  throttle: 1
+  loop: "{{ ansible_play_hosts }}"
+  block:
+    - name: "Restart and enable {{ keycloak.service_name }} service on first host"


... service on {{ item }}"

guidograzioli · 2024-05-15T13:21:39Z

roles/keycloak_quarkus/tasks/restart.yml

-      when: inventory_hostname != ansible_play_hosts | first
+- name: "Wait until {{ keycloak.service_name }} service becomes active {{ keycloak.health_url }}"
+  ansible.builtin.uri:
+    url: "{{ keycloak.health_url }}"


My intent here was indeed to contact localhost (localhost from delegate_to being the restarted node); do you mean in production scenario keycloak_quarkus_host would change from localhost? Ideally it should be keycloak_quarkus_frontend_url that takes the load balanced / reverse proxied domainname (no?).

you're right, I implied that keycloak_quarkus_host (=localhost, by default) ==rhbk_frontend_url (=load balanced) which is not the case -- all good, sry for the confusion

roles/keycloak_quarkus/meta/argument_specs.yml

@@ -425,6 +431,20 @@ argument_specs:
                default: true
                description: "Allow the option to ignore invalid certificates when downloading JDBC drivers from a custom URL"
                type: "bool"
+            keycloak_quarkus_restart_health_check:


hwo-wd · 2024-05-15T12:14:28Z

roles/keycloak_quarkus/meta/argument_specs.yml

+                type: "bool"
+            keycloak_quarkus_restart_strategy:
+                description: >
+                  Strategy task file for restarting in HA, one of [ 'serial', 'none', 'verify_first' ] below, or path to


Strategy task file for restarting if keycloak_quarkus_restart_health_check==True,...

hwo-wd · 2024-05-15T12:19:02Z

roles/keycloak_quarkus/tasks/restart/serial.yml

+  block:
+    - name: "Restart and enable {{ keycloak.service_name }} service on first host"
+      ansible.builtin.include_tasks: ../restart.yml
+      delegate_to: "{{ item }}"


I guess we might need the pause task here as well, otherwise the health check returns 200 OK quite early (since the 2nd node is up) -> we restart the 2nd service -> ispn cache lost; considering this is the default strategy (if ha enabled), we should make this the most failsafe, ableit the slowest

- name: Pause to give distributed ispn caches time to (re-)replicate back onto first host ansible.builtin.pause: seconds: "{{ keycloak_quarkus_restart_pause }}" when: - keycloak_quarkus_ha_enabled

That's very true, I'll move the sleeping out of the strategy and directly inside the main restart.yml block

hwo-wd · 2024-05-15T12:21:23Z

roles/keycloak_quarkus/tasks/restart/verify_first.yml

not sure about verify_first naming, the longer I think about it the more it confuses me;
how about restart_first_verify_restart_rest - not a nice name either, but more descriptive

restart is implied by the restart/ subdirectory... how about serial_then_parallel ?

let me expand on this, I probably jumped to fast to code with out design notes.

for HA clusters with embedded infinispan (on defaults, so with distributed-caches with 2 owners), we can tolerate just one node to go down at a time, so serial is the really only one strategy that works

restarting a single node, verifying its liveliness, then parallelize all other node restarts is for remote caches. When the first node goes down, the others preserve the service; when the first is up again, it will take the traffic and all others can restart. Ideally that "first" should become a parameter itself, could be "2", "owners-1" or even, one day, "one per region".

hwo-wd · 2024-05-16T07:54:00Z

roles/keycloak_quarkus/tasks/restart/serial_then_parallel.yml

+---
+- name: Verify first restarted service with health URL, then rest restart in parallel
+  block:
+    - name: "Restart and enable {{ keycloak.service_name }} service on first host"


we should re-use restart.yml here with delegate_to: "{{ ansible_play_hosts | first }}" and run_once: true, i.e. replacing the next three tasks with this include instead

not really; restart.yml can have the health check turned off, while in there is enforced

your right, but still: e.g. introducing keycloak_quarkus_internal_enforce_health_check and setting it to true here would lower the maintenance burden on the long run -- your call, of course

mmh after second thought, yeah you're right, let me amend that

hwo-wd · 2024-05-16T07:57:37Z

roles/keycloak_quarkus/handlers/main.yml

@@ -7,7 +7,8 @@
  ansible.builtin.include_tasks: bootstrapped.yml
  listen: bootstrapped
 - name: "Restart {{ keycloak.service_name }}"
-  ansible.builtin.include_tasks: restart.yml
+  ansible.builtin.include_tasks:
+    file: "{{ keycloak_quarkus_restart_strategy if keycloak_quarkus_ha_enabled else 'restart.yml' }}"
  listen: "restart keycloak"


the linter seems to prefer Restart keycloak here

That's new.. and kind of non-sense to me ansible/ansible-lint#4168

sabre1041

LGTM

guidograzioli · 2024-05-16T09:22:02Z

@hwo-wd I believe all your notes were addressed, and we got the green flag from Andy; if you have other commits to push or comments please do; or, I'll proceed and merge. This has been pretty epic :)

hwo-wd · 2024-05-16T09:23:44Z

lgtm, thanks for the journey :)

hwo-wd added 3 commits May 15, 2024 09:47

Close #182, #221: improve restart handler logic

d57be1f

#182 - CR changes

db831fa

#221 - add keycloak_quarkus_health_check_url_path config option

1e9a669

guidograzioli force-pushed the feature/182_restart_handler branch from 8f3fe9a to 7349e42 Compare May 15, 2024 11:42

Add restart strategies, and allow custom task include

2d573c2

Co-authored-by: Helmut Wolf <hwo@world-direct.at> Co-authored-by: Guido Grazioli <ggraziol@redhat.com>

guidograzioli force-pushed the feature/182_restart_handler branch from 7349e42 to 2d573c2 Compare May 15, 2024 11:48

guidograzioli requested a review from hwo-wd May 15, 2024 11:54

guidograzioli added the major_changes Major changes mean the user can CHOOSE to make a change when they update but do not have to label May 15, 2024

hwo-wd suggested changes May 15, 2024

View reviewed changes

guidograzioli added 3 commits May 15, 2024 15:58

address review reqs

c22389c

Add molecule scenario for HA restart

fdcf1b2

Update verify steps

f63b20b

guidograzioli changed the title ~~Feature/182 restart handler~~ Restart handler strategy behaviour May 15, 2024

guidograzioli requested a review from sabre1041 May 15, 2024 18:03

hwo-wd self-requested a review May 16, 2024 07:48

hwo-wd reviewed May 16, 2024

View reviewed changes

sabre1041 approved these changes May 16, 2024

View reviewed changes

parameterize health check; refactor serial_then_parallel

4b21569

guidograzioli merged commit 1519d46 into main May 16, 2024
24 checks passed

guidograzioli mentioned this pull request Jun 4, 2024

Service availability wait after restart not happening by default #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart handler strategy behaviour #231

Restart handler strategy behaviour #231

guidograzioli commented May 15, 2024 •

edited

guidograzioli commented May 15, 2024

hwo-wd left a comment

hwo-wd May 15, 2024

This comment was marked as resolved.

guidograzioli May 15, 2024

hwo-wd May 15, 2024

This comment was marked as outdated.

hwo-wd May 15, 2024

hwo-wd May 15, 2024

guidograzioli May 15, 2024

hwo-wd May 15, 2024

guidograzioli May 15, 2024 •

edited

guidograzioli May 15, 2024

hwo-wd May 16, 2024

guidograzioli May 16, 2024

hwo-wd May 16, 2024

guidograzioli May 16, 2024

hwo-wd May 16, 2024

guidograzioli May 16, 2024

sabre1041 left a comment

guidograzioli commented May 16, 2024

hwo-wd commented May 16, 2024

Restart handler strategy behaviour #231

Restart handler strategy behaviour #231

Conversation

guidograzioli commented May 15, 2024 • edited

guidograzioli commented May 15, 2024

hwo-wd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guidograzioli May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sabre1041 left a comment

Choose a reason for hiding this comment

guidograzioli commented May 16, 2024

hwo-wd commented May 16, 2024

guidograzioli commented May 15, 2024 •

edited

guidograzioli May 15, 2024 •

edited