ops: add minimum production monitoring (notification channel, healthz route + uptime checks, log-based ERROR alerts, 5xx/latency alerts)

## Problem

Production (GAE `default` service) has **zero alerting and error reporting**. `docs/dev/deploy.md:132` states it plainly:

> There's no error reporting or alerting. Cloud Logging and the GAE metrics dashboard are it.

The documented rollback procedure (`docs/dev/deploy.md:111-120`) is a manual traffic split:

```bash
gcloud app services set-traffic default --splits=<known-good-version>=1
```

That rollback has **no detection mechanism to trigger it**. Nothing pages, emails, or otherwise signals that production is degraded, so mean-time-to-detect is effectively unbounded -- a regression sits live until a human happens to notice or a user complains. The rollback being "instant and lossless-for-envelopes" is moot if we never learn we need to run it.

## Why it matters

- **Reliability / MTTD**: An undetected outage or crash-loop stays live indefinitely. The rollback tooling exists but is never armed.
- **Known high-severity failure modes already exist** and would currently fail silently in production:
  - The WASM-preload `ServerInitError` (Express instances failing to initialize).
  - The in-process PNG preview path (`src/server/render.ts`) rasterizes user-uploaded models in-process with no timeout and only the 10 MB body cap (`docs/dev/deploy.md:131`) -- a preview-regeneration storm is a **latency** event, not a 5xx spike.
  - Client-side login failures produce **no server-side signal at all**.

## The static-handler blind spot (the non-obvious part)

A naive uptime check on `/` would **not** detect the highest-severity failure mode. `app.yaml` serves `/` as a **static file via GAE's static handler**, so `/` can stay green (HTTP 200) while **every Express instance is crash-looping** (e.g. the `ServerInitError` above). An uptime check must target an **Express-routed path** to actually exercise the Node server.

There is currently **no health route**: grepping `src/server/` for `health|_ah|readiness|liveness|healthz` returns nothing. Options for the uptime target:
- An existing Express route such as `/api/user` (treat 401 as "up").
- A purpose-built unauthenticated `/healthz` route (does not exist today) -- this would make the uptime check clean and unambiguous. A tiny `/healthz` handler in `src/server` is the smallest enabling change.

## Minimum viable setup

1. **One notification channel** -- hard prerequisite: Cloud Monitoring refuses to save alerting policies without a channel attached.
2. **Uptime checks**: one on an **Express-routed path** (`/api/user` accepting 401-as-up, or a new `/healthz`) **plus** one on `/` (to distinguish "static serving up, app down" from "everything down").
3. **Log-based alerts**: on `severity>=ERROR`, and on the literal strings `ServerInitError` and `renderToPNG:` (the preview-render failure marker).
4. **5xx-ratio + p95-latency pair**: 5xx ratio catches hard failures; p95 latency catches the preview-regeneration storm (a latency event that does not show up as 5xx).

Note: client-side login failures produce no server signal, so they remain out of scope for server-side alerting -- called out so it is a known gap, not an oversight.

## Components affected

- `src/server` (would gain a tiny `/healthz` route to make the uptime check clean)
- `app.yaml` (static handler for `/` is the reason `/` is not a valid liveness signal)
- Ops/infra: GCP Cloud Monitoring config (notification channel, uptime checks, alerting policies) -- not in-repo today

## Possible approaches

- Add an unauthenticated `/healthz` Express route returning 200 once the WASM engine has finished preloading (so it doubles as a `ServerInitError` readiness signal), and point a Cloud Monitoring uptime check at it.
- Stand up the Cloud Monitoring config described above (channel + 2 uptime checks + log-based ERROR/`ServerInitError`/`renderToPNG:` alerts + 5xx-ratio and p95-latency policies). Decide whether this config lives in-repo (e.g. Terraform / `gcloud` script) or is documented runbook-style; today none of it is tracked.

## Discovery context

Identified during a deploy-risk audit of `docs/dev/deploy.md`. The doc's "Rough edges" section (line 132) states the bare fact ("no error reporting or alerting"); this issue captures the audit's additions beyond that statement: the static-handler blind spot that makes a `/` uptime check insufficient, the absence of any health route, and the concrete minimum-viable channel/check/alert set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: add minimum production monitoring (notification channel, healthz route + uptime checks, log-based ERROR alerts, 5xx/latency alerts) #693

Problem

Why it matters

The static-handler blind spot (the non-obvious part)

Minimum viable setup

Components affected

Possible approaches

Discovery context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ops: add minimum production monitoring (notification channel, healthz route + uptime checks, log-based ERROR alerts, 5xx/latency alerts) #693

Description

Problem

Why it matters

The static-handler blind spot (the non-obvious part)

Minimum viable setup

Components affected

Possible approaches

Discovery context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions