Skip to content

ops: add minimum production monitoring (notification channel, healthz route + uptime checks, log-based ERROR alerts, 5xx/latency alerts) #693

@bpowers

Description

@bpowers

Problem

Production (GAE default service) has zero alerting and error reporting. docs/dev/deploy.md:132 states it plainly:

There's no error reporting or alerting. Cloud Logging and the GAE metrics dashboard are it.

The documented rollback procedure (docs/dev/deploy.md:111-120) is a manual traffic split:

gcloud app services set-traffic default --splits=<known-good-version>=1

That rollback has no detection mechanism to trigger it. Nothing pages, emails, or otherwise signals that production is degraded, so mean-time-to-detect is effectively unbounded -- a regression sits live until a human happens to notice or a user complains. The rollback being "instant and lossless-for-envelopes" is moot if we never learn we need to run it.

Why it matters

  • Reliability / MTTD: An undetected outage or crash-loop stays live indefinitely. The rollback tooling exists but is never armed.
  • Known high-severity failure modes already exist and would currently fail silently in production:
    • The WASM-preload ServerInitError (Express instances failing to initialize).
    • The in-process PNG preview path (src/server/render.ts) rasterizes user-uploaded models in-process with no timeout and only the 10 MB body cap (docs/dev/deploy.md:131) -- a preview-regeneration storm is a latency event, not a 5xx spike.
    • Client-side login failures produce no server-side signal at all.

The static-handler blind spot (the non-obvious part)

A naive uptime check on / would not detect the highest-severity failure mode. app.yaml serves / as a static file via GAE's static handler, so / can stay green (HTTP 200) while every Express instance is crash-looping (e.g. the ServerInitError above). An uptime check must target an Express-routed path to actually exercise the Node server.

There is currently no health route: grepping src/server/ for health|_ah|readiness|liveness|healthz returns nothing. Options for the uptime target:

  • An existing Express route such as /api/user (treat 401 as "up").
  • A purpose-built unauthenticated /healthz route (does not exist today) -- this would make the uptime check clean and unambiguous. A tiny /healthz handler in src/server is the smallest enabling change.

Minimum viable setup

  1. One notification channel -- hard prerequisite: Cloud Monitoring refuses to save alerting policies without a channel attached.
  2. Uptime checks: one on an Express-routed path (/api/user accepting 401-as-up, or a new /healthz) plus one on / (to distinguish "static serving up, app down" from "everything down").
  3. Log-based alerts: on severity>=ERROR, and on the literal strings ServerInitError and renderToPNG: (the preview-render failure marker).
  4. 5xx-ratio + p95-latency pair: 5xx ratio catches hard failures; p95 latency catches the preview-regeneration storm (a latency event that does not show up as 5xx).

Note: client-side login failures produce no server signal, so they remain out of scope for server-side alerting -- called out so it is a known gap, not an oversight.

Components affected

  • src/server (would gain a tiny /healthz route to make the uptime check clean)
  • app.yaml (static handler for / is the reason / is not a valid liveness signal)
  • Ops/infra: GCP Cloud Monitoring config (notification channel, uptime checks, alerting policies) -- not in-repo today

Possible approaches

  • Add an unauthenticated /healthz Express route returning 200 once the WASM engine has finished preloading (so it doubles as a ServerInitError readiness signal), and point a Cloud Monitoring uptime check at it.
  • Stand up the Cloud Monitoring config described above (channel + 2 uptime checks + log-based ERROR/ServerInitError/renderToPNG: alerts + 5xx-ratio and p95-latency policies). Decide whether this config lives in-repo (e.g. Terraform / gcloud script) or is documented runbook-style; today none of it is tracked.

Discovery context

Identified during a deploy-risk audit of docs/dev/deploy.md. The doc's "Rough edges" section (line 132) states the bare fact ("no error reporting or alerting"); this issue captures the audit's additions beyond that statement: the static-handler blind spot that makes a / uptime check insufficient, the absence of any health route, and the concrete minimum-viable channel/check/alert set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendInvolves the Google App Engine node app

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions