Skip to content

feat(tamanu): TAM-6782: lifecycle subcommands (start/stop/restart/status)#352

Merged
passcod merged 10 commits into
mainfrom
feat/tamanu-lifecycle
May 23, 2026
Merged

feat(tamanu): TAM-6782: lifecycle subcommands (start/stop/restart/status)#352
passcod merged 10 commits into
mainfrom
feat/tamanu-lifecycle

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented May 23, 2026

🤖

Closes #313.

Four new subcommands on top of the services::expected() model from #336 and the post-#341 Context API.

Subcommands

  • tamanu status [NAMES...] — discovery render against the canonical expectation set. No sudo, no probes, no DB. Exits non-zero if any Up expectation is short of its min_count or any Down expectation has a running instance. --json emits a wire shape.
  • tamanu start [NAMES...] — idempotent bring-up. Computes the canonical units a Single/NumericAtLeast/Named expectation requires, diffs against discovery, and issues a single batched systemctl start / pm2 start for whatever's missing. Bails if a pm2-side expectation has fewer registered processes than needed (first-time pm2 setup stays with the ops playbook).
  • tamanu stop [NAMES...] — symmetric with start. Single batched stop call across every running matched instance. Caddy untouched.
  • tamanu restart [NAMES...] — rolling restart split by criticality:
    • Background (tasks, sync, fhir-*): single batched supervisor call.
    • Critical (api, frontend): rolling one instance at a time, with wait_running_one + per-instance HTTP probe + caddy reload + resolvectl flush-caches + cooldown between each, so caddy picks up the new netavark IP before the next probe lands.
    • Flags: --cooldown (default 30s, jiff-parsed), --no-probe-http, --check-url URL for a final end-to-end probe.

NAMES matcher

All four subcommands take a variadic positional NAMES.... Each name is a substring against the expectation name; an expectation matches if any name matches it (union). Empty NAMES = all expectations. Any zero-match name bails listing both the bad pattern and the available names (typo safety in multi-name invocations).

Implementation

  • New Criticality field on Expectation plus tests fixing the matrix per kind/supervisor.
  • New lifecycle.rs module holding: config_and_expectations, match_names, discover / Instance / group_by_expectation, ensure_root_or_reexec (sudo re-exec on systemd hosts), restart_one, wait_running / wait_running_one / wait_stopped, reload_caddy, container_ip_for_unit, pm2_port_for. The container-IP and caddy-reload helpers are lifted verbatim from feat(tamanu): TAM-6782: add reload subcommand for safe rolling restart #313's reload.rs.
  • New tamanu-lifecycle cargo feature; gated alongside the existing tamanu-* feature set.

logs.rs is untouched in this PR. A follow-up reshapes it to use lifecycle::match_names and adds caddy-as-pseudo-service.

The plan file (docs/plans/tamanu-lifecycle.md) is committed as the base of the stack and will be unplanned at the end.

passcod added 9 commits May 23, 2026 16:06
Annotate API and frontend as Critical (must always have one instance
up), everything else expected Up as Background. Drives the upcoming
restart subcommand's rolling vs bulk decision; ignored for start, stop,
and status.
New tamanu-lifecycle feature gates a new module shared by the upcoming
start/stop/restart/status subcommands. First primitive is match_names:
a substring-based filter over the expectation set with union semantics
across multiple names. Empty names = pass-through; any zero-match name
bails with the available list so a typo in a multi-name invocation
doesn't silently drop.
Discovery lifts the systemd/pm2 enumeration logic that doctor was
keeping private. Instance carries the supervisor identifiers needed to
build the eventual systemctl/pm2 commands (unit() / display()).
group_by_expectation joins discovered instances onto the expectation
they belong to, dropping anything not in the expected set.
A lighter cousin of tamanu doctor: enumerates services known to the
supervisor and renders them against the canonical expectation set with
running/missing counts. No HTTP probes, no DB queries. Exits non-zero
if any Up expectation is short of its min_count or any Down expectation
has a running instance.

Takes a variadic NAMES positional matcher (matches via
lifecycle::match_names, substring union). --json emits a serialisable
wire shape for piping into other tools.
Idempotent bring-up: enumerates required units against discovery and
issues a single systemctl start (or pm2 start) for whatever's missing.
Self-elevates via sudo on systemd if not root. Waits for everything it
started to become active before returning.

Adds Instances::required_systemd_units for computing the canonical
unit names a Single/NumericAtLeast/Named expectation requires.

For pm2, bails if the deployment has fewer registered processes than
the expectation needs; first-time pm2 setup stays in the ops playbook.
Mirror of start: gathers every running instance under the matched
expectations and issues a single supervisor stop call. Self-elevates
under sudo on systemd. Waits for everything to be inactive before
returning. Caddy is not touched; its upstreams just become unreachable
which is the operator's intent for a maintenance window.

No critical/background ordering — once the operator decides to bring
things down, the supervisor's synchronous stop is enough.
Rolling restart that splits running instances by criticality:
background services restart in a single bulk supervisor call, then
critical services (api, frontend) roll one instance at a time with a
per-instance HTTP probe + caddy reload + cooldown between each. The
probe URL is derived from podman netavark on systemd (container IP +
:3000) or pm2's PORT env var.

Flags from #313: --cooldown (default 30s, jiff-parsed), --no-probe-http,
--check-url for an end-to-end probe after the roll.

Lifts reload_caddy, container_ip_for_unit, and the pm2 port lookup
verbatim from #313's reload.rs into the lifecycle module, where they
join restart_one, wait_running_one, and the new bulk_restart helper.
All four subcommands (status/start/stop/restart) are implemented with
the planned shape, plus self-elevation, USAGE.md regen, and the
criticality field on Expectation.
@passcod passcod added this pull request to the merge queue May 23, 2026
Merged via the queue into main with commit 006daee May 23, 2026
8 checks passed
@passcod passcod deleted the feat/tamanu-lifecycle branch May 23, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant