server: preview rendering lost worker-thread isolation; no timeout + unbounded GAE instances compound the DoS/cost surface

## Summary

Server-side PNG preview rendering parses and rasterizes user-uploaded SD models **synchronously, in-process, on the HTTP event loop**, with no timeout. This regressed an isolation boundary that existed in the 2022 production code (a per-request `worker_threads.Worker`), and it compounds the already-known missing timeout/size-cap. Combined with `app.yaml`'s unbounded `max_instances`, a single slow or pathological model render can stall up to 100 co-tenant requests per instance and/or fan out GAE instance scaling on a route any authenticated user can hit for any public project.

This issue captures **three distinct facets** so none is lost:

1. No timeout / weak size cap on render.
2. **Isolation regression**: per-request worker thread -> shared in-process WASM backend on the HTTP event loop.
3. **Instance-scale cost cap**: GAE `automatic_scaling` has no `max_instances`.

The deploy doc already notes facet 1 in prose (`docs/dev/deploy.md` "Rough edges": "no size cap beyond the 10 MB request body limit and no timeout"), but that line is the *only* existing mention, it is not in `docs/tech-debt.md`, and it does not capture facets 2 or 3.

## What changed since 2022 (the isolation regression)

At the 2022 production commit, `src/server/render.ts` rasterized the SVG in a **per-request worker thread**:

```ts
// git show ae68caa4:src/server/render.ts
import { Worker } from 'worker_threads';
// ...
const worker = new Worker(__dirname + '/render-worker.js', {
  workerData: { svgString, viewbox },
});
worker.on('message', (result: Uint8Array) => { ok(result); });
```

`render-worker.js` (resvg-wasm) ran the CPU-heavy rasterize off the main thread, so a slow/pathological render was isolated from the HTTP event loop, and a WASM OOM/panic was contained to the worker.

At HEAD, `render-inner.ts` / `render-worker.ts` are gone (last touched the `render-worker` files: `78f021f1 server: calculate previews in async workers`; no worker source exists at HEAD -- only `src/server/render.ts`). `renderToPNG` now runs the whole pipeline synchronously, in-process, via the engine WASM `DirectBackend`:

```ts
// src/server/render.ts:54-64 (HEAD)
export async function renderToPNG(fileDoc: File): Promise<Uint8Array> {
  const engineProject = await EngineProject.openProtobuf(fileDoc.getProjectContents_asU8());
  try {
    const svg = await engineProject.renderSvgString('main');
    const intrinsic = parseSvgDimensions(svg);
    const dims = previewDimensions(intrinsic.width, intrinsic.height, MAX_PREVIEW_SIZE);
    return await engineProject.renderPng('main', dims.width, dims.height);
  } finally {
    await engineProject.dispose();
  }
}
```

On Node the backend is a **process-wide shared singleton** `DirectBackend` (`src/engine/src/backend-factory.node.ts` `getBackend()` memoizes one `DirectBackend`), i.e. the same WASM instance and the same OS thread/event loop as the Express server. There is no worker boundary anymore.

## Why it matters

- **Co-tenant stall (correctness/availability).** `app.yaml` sets `instance_class: F4`, `automatic_scaling.max_concurrent_requests: 100`. `openProtobuf` + `renderSvgString` + `renderPng` is synchronous CPU work on the event loop; one slow/pathological model render blocks the loop and stalls up to ~100 concurrently-multiplexed requests on that instance (health checks, other users' API calls, static-handler fallthroughs to Express).
- **Weaker crash isolation.** A WASM OOM or panic during rasterize now happens inside the server process rather than inside a disposable worker thread. The 2022 worker design degraded gracefully (kill the worker, fail one preview); the in-process design risks taking the instance with it.
- **No timeout, weak size cap.** The only gates are the 10 MB request body limit (`src/server/app.ts:235,240`, `limit: '10mb'`) and the 800px output cap (`src/server/render.ts:10`, `MAX_PREVIEW_SIZE = 400 * 2`). Neither bounds *render wall-clock*: a small protobuf can describe an expensive-to-rasterize diagram. There is no `Promise.race`/`AbortController`/worker-kill timeout anywhere on the path.
- **Unbounded instance scaling (cost).** `app.yaml`'s `automatic_scaling` block has only `max_concurrent_requests: 100` and no `max_instances`. Under sustained slow renders, GAE scales out F4 instances without an upper bound -- a cost-amplification angle distinct from the latency angle.

## Reachability

- Route: `GET /api/preview/:username/:projectName` (`src/server/api.ts:192`).
- Authenticated users only in the general case, **but** the username/ownership check is skipped for public projects: `if (!projectModel?.getIsPublic()) { ...require session owner... }` (`src/server/api.ts:206`). So **any authenticated user can trigger a render for any public project**.
- The render is on-demand and routine: when no cached preview exists the handler calls `updatePreview(app.db, projectModel)` -> `renderToPNG` synchronously (`src/server/api.ts:224-227`), and previews are **invalidated on every save** (the POST handler deletes the cached preview; see also tech-debt #41), so re-render on next view is the normal case, not an edge case.

## Components affected

- `src/server/render.ts` (in-process render)
- `src/server/api.ts` (`/preview` route, public-skip, on-demand `updatePreview`)
- `src/server/app.ts` (10 MB body limit)
- `src/engine/src/backend-factory.node.ts` (shared singleton `DirectBackend`)
- `app.yaml` (`automatic_scaling`, no `max_instances`)

## Possible approaches

- **Restore an off-event-loop boundary**: move rasterize back into a `worker_threads.Worker` (or a small worker pool), or render previews in a separate GAE service / background task rather than inline in the request that needs them. This re-establishes both the event-loop isolation and the crash isolation the 2022 design had.
- **Hard render timeout**: wrap the render in an `AbortController`/`Promise.race` (or a kill-the-worker deadline if worker-based) so a pathological model fails fast with a 5xx instead of pinning the loop.
- **Tighten the input gate**: a render-specific size/complexity cap (variable/element/view-element counts) below the generic 10 MB body limit; reject obviously oversized models before `openProtobuf`.
- **Bound instance scale**: set `automatic_scaling.max_instances` in the deployed config to cap the blast-radius cost. (Note `app.yaml` is the committed reference; production deploys `.app.prod.yaml`, which is gitignored -- so this needs to be reflected in both the reference and the prod config per `docs/dev/deploy.md`.)

## How discovered

Identified during a deploy-risk audit of the server-side preview path. Confirmed against the 2022 production commit `ae68caa4` (worker-thread render) vs. HEAD (in-process render), the current `app.yaml` scaling block, and the `/preview` route's public-skip and on-demand re-render behavior.

## Related existing tracking

- `docs/dev/deploy.md` "Rough edges" already notes the no-size-cap/no-timeout facet in prose (facet 1 only); not in `docs/tech-debt.md`, and silent on facets 2 (isolation) and 3 (instance cost cap).
- Tech-debt #41 (POST /api/projects is not transactional) is adjacent -- it covers unconditional preview *invalidation* on save -- but is about Firestore transaction atomicity, a different root issue.
- Tech-debt #14 lists the server "rendering pipeline" only as an untested area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: preview rendering lost worker-thread isolation; no timeout + unbounded GAE instances compound the DoS/cost surface #694

Summary

What changed since 2022 (the isolation regression)

Why it matters

Reachability

Components affected

Possible approaches

How discovered

Related existing tracking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

server: preview rendering lost worker-thread isolation; no timeout + unbounded GAE instances compound the DoS/cost surface #694

Description

Summary

What changed since 2022 (the isolation regression)

Why it matters

Reachability

Components affected

Possible approaches

How discovered

Related existing tracking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions