Problem
Production (GAE default service) has zero alerting and error reporting. docs/dev/deploy.md:132 states it plainly:
There's no error reporting or alerting. Cloud Logging and the GAE metrics dashboard are it.
The documented rollback procedure (docs/dev/deploy.md:111-120) is a manual traffic split:
gcloud app services set-traffic default --splits=<known-good-version>=1
That rollback has no detection mechanism to trigger it. Nothing pages, emails, or otherwise signals that production is degraded, so mean-time-to-detect is effectively unbounded -- a regression sits live until a human happens to notice or a user complains. The rollback being "instant and lossless-for-envelopes" is moot if we never learn we need to run it.
Why it matters
- Reliability / MTTD: An undetected outage or crash-loop stays live indefinitely. The rollback tooling exists but is never armed.
- Known high-severity failure modes already exist and would currently fail silently in production:
- The WASM-preload
ServerInitError (Express instances failing to initialize).
- The in-process PNG preview path (
src/server/render.ts) rasterizes user-uploaded models in-process with no timeout and only the 10 MB body cap (docs/dev/deploy.md:131) -- a preview-regeneration storm is a latency event, not a 5xx spike.
- Client-side login failures produce no server-side signal at all.
The static-handler blind spot (the non-obvious part)
A naive uptime check on / would not detect the highest-severity failure mode. app.yaml serves / as a static file via GAE's static handler, so / can stay green (HTTP 200) while every Express instance is crash-looping (e.g. the ServerInitError above). An uptime check must target an Express-routed path to actually exercise the Node server.
There is currently no health route: grepping src/server/ for health|_ah|readiness|liveness|healthz returns nothing. Options for the uptime target:
- An existing Express route such as
/api/user (treat 401 as "up").
- A purpose-built unauthenticated
/healthz route (does not exist today) -- this would make the uptime check clean and unambiguous. A tiny /healthz handler in src/server is the smallest enabling change.
Minimum viable setup
- One notification channel -- hard prerequisite: Cloud Monitoring refuses to save alerting policies without a channel attached.
- Uptime checks: one on an Express-routed path (
/api/user accepting 401-as-up, or a new /healthz) plus one on / (to distinguish "static serving up, app down" from "everything down").
- Log-based alerts: on
severity>=ERROR, and on the literal strings ServerInitError and renderToPNG: (the preview-render failure marker).
- 5xx-ratio + p95-latency pair: 5xx ratio catches hard failures; p95 latency catches the preview-regeneration storm (a latency event that does not show up as 5xx).
Note: client-side login failures produce no server signal, so they remain out of scope for server-side alerting -- called out so it is a known gap, not an oversight.
Components affected
src/server (would gain a tiny /healthz route to make the uptime check clean)
app.yaml (static handler for / is the reason / is not a valid liveness signal)
- Ops/infra: GCP Cloud Monitoring config (notification channel, uptime checks, alerting policies) -- not in-repo today
Possible approaches
- Add an unauthenticated
/healthz Express route returning 200 once the WASM engine has finished preloading (so it doubles as a ServerInitError readiness signal), and point a Cloud Monitoring uptime check at it.
- Stand up the Cloud Monitoring config described above (channel + 2 uptime checks + log-based ERROR/
ServerInitError/renderToPNG: alerts + 5xx-ratio and p95-latency policies). Decide whether this config lives in-repo (e.g. Terraform / gcloud script) or is documented runbook-style; today none of it is tracked.
Discovery context
Identified during a deploy-risk audit of docs/dev/deploy.md. The doc's "Rough edges" section (line 132) states the bare fact ("no error reporting or alerting"); this issue captures the audit's additions beyond that statement: the static-handler blind spot that makes a / uptime check insufficient, the absence of any health route, and the concrete minimum-viable channel/check/alert set.
Problem
Production (GAE
defaultservice) has zero alerting and error reporting.docs/dev/deploy.md:132states it plainly:The documented rollback procedure (
docs/dev/deploy.md:111-120) is a manual traffic split:That rollback has no detection mechanism to trigger it. Nothing pages, emails, or otherwise signals that production is degraded, so mean-time-to-detect is effectively unbounded -- a regression sits live until a human happens to notice or a user complains. The rollback being "instant and lossless-for-envelopes" is moot if we never learn we need to run it.
Why it matters
ServerInitError(Express instances failing to initialize).src/server/render.ts) rasterizes user-uploaded models in-process with no timeout and only the 10 MB body cap (docs/dev/deploy.md:131) -- a preview-regeneration storm is a latency event, not a 5xx spike.The static-handler blind spot (the non-obvious part)
A naive uptime check on
/would not detect the highest-severity failure mode.app.yamlserves/as a static file via GAE's static handler, so/can stay green (HTTP 200) while every Express instance is crash-looping (e.g. theServerInitErrorabove). An uptime check must target an Express-routed path to actually exercise the Node server.There is currently no health route: grepping
src/server/forhealth|_ah|readiness|liveness|healthzreturns nothing. Options for the uptime target:/api/user(treat 401 as "up")./healthzroute (does not exist today) -- this would make the uptime check clean and unambiguous. A tiny/healthzhandler insrc/serveris the smallest enabling change.Minimum viable setup
/api/useraccepting 401-as-up, or a new/healthz) plus one on/(to distinguish "static serving up, app down" from "everything down").severity>=ERROR, and on the literal stringsServerInitErrorandrenderToPNG:(the preview-render failure marker).Note: client-side login failures produce no server signal, so they remain out of scope for server-side alerting -- called out so it is a known gap, not an oversight.
Components affected
src/server(would gain a tiny/healthzroute to make the uptime check clean)app.yaml(static handler for/is the reason/is not a valid liveness signal)Possible approaches
/healthzExpress route returning 200 once the WASM engine has finished preloading (so it doubles as aServerInitErrorreadiness signal), and point a Cloud Monitoring uptime check at it.ServerInitError/renderToPNG:alerts + 5xx-ratio and p95-latency policies). Decide whether this config lives in-repo (e.g. Terraform /gcloudscript) or is documented runbook-style; today none of it is tracked.Discovery context
Identified during a deploy-risk audit of
docs/dev/deploy.md. The doc's "Rough edges" section (line 132) states the bare fact ("no error reporting or alerting"); this issue captures the audit's additions beyond that statement: the static-handler blind spot that makes a/uptime check insufficient, the absence of any health route, and the concrete minimum-viable channel/check/alert set.