You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Server-side PNG preview rendering parses and rasterizes user-uploaded SD models synchronously, in-process, on the HTTP event loop, with no timeout. This regressed an isolation boundary that existed in the 2022 production code (a per-request worker_threads.Worker), and it compounds the already-known missing timeout/size-cap. Combined with app.yaml's unbounded max_instances, a single slow or pathological model render can stall up to 100 co-tenant requests per instance and/or fan out GAE instance scaling on a route any authenticated user can hit for any public project.
This issue captures three distinct facets so none is lost:
No timeout / weak size cap on render.
Isolation regression: per-request worker thread -> shared in-process WASM backend on the HTTP event loop.
Instance-scale cost cap: GAE automatic_scaling has no max_instances.
The deploy doc already notes facet 1 in prose (docs/dev/deploy.md "Rough edges": "no size cap beyond the 10 MB request body limit and no timeout"), but that line is the only existing mention, it is not in docs/tech-debt.md, and it does not capture facets 2 or 3.
What changed since 2022 (the isolation regression)
At the 2022 production commit, src/server/render.ts rasterized the SVG in a per-request worker thread:
// git show ae68caa4:src/server/render.tsimport{Worker}from'worker_threads';// ...constworker=newWorker(__dirname+'/render-worker.js',{workerData: { svgString, viewbox },});worker.on('message',(result: Uint8Array)=>{ok(result);});
render-worker.js (resvg-wasm) ran the CPU-heavy rasterize off the main thread, so a slow/pathological render was isolated from the HTTP event loop, and a WASM OOM/panic was contained to the worker.
At HEAD, render-inner.ts / render-worker.ts are gone (last touched the render-worker files: 78f021f1 server: calculate previews in async workers; no worker source exists at HEAD -- only src/server/render.ts). renderToPNG now runs the whole pipeline synchronously, in-process, via the engine WASM DirectBackend:
On Node the backend is a process-wide shared singletonDirectBackend (src/engine/src/backend-factory.node.tsgetBackend() memoizes one DirectBackend), i.e. the same WASM instance and the same OS thread/event loop as the Express server. There is no worker boundary anymore.
Why it matters
Co-tenant stall (correctness/availability).app.yaml sets instance_class: F4, automatic_scaling.max_concurrent_requests: 100. openProtobuf + renderSvgString + renderPng is synchronous CPU work on the event loop; one slow/pathological model render blocks the loop and stalls up to ~100 concurrently-multiplexed requests on that instance (health checks, other users' API calls, static-handler fallthroughs to Express).
Weaker crash isolation. A WASM OOM or panic during rasterize now happens inside the server process rather than inside a disposable worker thread. The 2022 worker design degraded gracefully (kill the worker, fail one preview); the in-process design risks taking the instance with it.
No timeout, weak size cap. The only gates are the 10 MB request body limit (src/server/app.ts:235,240, limit: '10mb') and the 800px output cap (src/server/render.ts:10, MAX_PREVIEW_SIZE = 400 * 2). Neither bounds render wall-clock: a small protobuf can describe an expensive-to-rasterize diagram. There is no Promise.race/AbortController/worker-kill timeout anywhere on the path.
Unbounded instance scaling (cost).app.yaml's automatic_scaling block has only max_concurrent_requests: 100 and no max_instances. Under sustained slow renders, GAE scales out F4 instances without an upper bound -- a cost-amplification angle distinct from the latency angle.
Reachability
Route: GET /api/preview/:username/:projectName (src/server/api.ts:192).
Authenticated users only in the general case, but the username/ownership check is skipped for public projects: if (!projectModel?.getIsPublic()) { ...require session owner... } (src/server/api.ts:206). So any authenticated user can trigger a render for any public project.
The render is on-demand and routine: when no cached preview exists the handler calls updatePreview(app.db, projectModel) -> renderToPNG synchronously (src/server/api.ts:224-227), and previews are invalidated on every save (the POST handler deletes the cached preview; see also tech-debt run management #41), so re-render on next view is the normal case, not an edge case.
Restore an off-event-loop boundary: move rasterize back into a worker_threads.Worker (or a small worker pool), or render previews in a separate GAE service / background task rather than inline in the request that needs them. This re-establishes both the event-loop isolation and the crash isolation the 2022 design had.
Hard render timeout: wrap the render in an AbortController/Promise.race (or a kill-the-worker deadline if worker-based) so a pathological model fails fast with a 5xx instead of pinning the loop.
Tighten the input gate: a render-specific size/complexity cap (variable/element/view-element counts) below the generic 10 MB body limit; reject obviously oversized models before openProtobuf.
Bound instance scale: set automatic_scaling.max_instances in the deployed config to cap the blast-radius cost. (Note app.yaml is the committed reference; production deploys .app.prod.yaml, which is gitignored -- so this needs to be reflected in both the reference and the prod config per docs/dev/deploy.md.)
How discovered
Identified during a deploy-risk audit of the server-side preview path. Confirmed against the 2022 production commit ae68caa4 (worker-thread render) vs. HEAD (in-process render), the current app.yaml scaling block, and the /preview route's public-skip and on-demand re-render behavior.
Related existing tracking
docs/dev/deploy.md "Rough edges" already notes the no-size-cap/no-timeout facet in prose (facet 1 only); not in docs/tech-debt.md, and silent on facets 2 (isolation) and 3 (instance cost cap).
Tech-debt run management #41 (POST /api/projects is not transactional) is adjacent -- it covers unconditional preview invalidation on save -- but is about Firestore transaction atomicity, a different root issue.
Tech-debt create flow #14 lists the server "rendering pipeline" only as an untested area.
Summary
Server-side PNG preview rendering parses and rasterizes user-uploaded SD models synchronously, in-process, on the HTTP event loop, with no timeout. This regressed an isolation boundary that existed in the 2022 production code (a per-request
worker_threads.Worker), and it compounds the already-known missing timeout/size-cap. Combined withapp.yaml's unboundedmax_instances, a single slow or pathological model render can stall up to 100 co-tenant requests per instance and/or fan out GAE instance scaling on a route any authenticated user can hit for any public project.This issue captures three distinct facets so none is lost:
automatic_scalinghas nomax_instances.The deploy doc already notes facet 1 in prose (
docs/dev/deploy.md"Rough edges": "no size cap beyond the 10 MB request body limit and no timeout"), but that line is the only existing mention, it is not indocs/tech-debt.md, and it does not capture facets 2 or 3.What changed since 2022 (the isolation regression)
At the 2022 production commit,
src/server/render.tsrasterized the SVG in a per-request worker thread:render-worker.js(resvg-wasm) ran the CPU-heavy rasterize off the main thread, so a slow/pathological render was isolated from the HTTP event loop, and a WASM OOM/panic was contained to the worker.At HEAD,
render-inner.ts/render-worker.tsare gone (last touched therender-workerfiles:78f021f1 server: calculate previews in async workers; no worker source exists at HEAD -- onlysrc/server/render.ts).renderToPNGnow runs the whole pipeline synchronously, in-process, via the engine WASMDirectBackend:On Node the backend is a process-wide shared singleton
DirectBackend(src/engine/src/backend-factory.node.tsgetBackend()memoizes oneDirectBackend), i.e. the same WASM instance and the same OS thread/event loop as the Express server. There is no worker boundary anymore.Why it matters
app.yamlsetsinstance_class: F4,automatic_scaling.max_concurrent_requests: 100.openProtobuf+renderSvgString+renderPngis synchronous CPU work on the event loop; one slow/pathological model render blocks the loop and stalls up to ~100 concurrently-multiplexed requests on that instance (health checks, other users' API calls, static-handler fallthroughs to Express).src/server/app.ts:235,240,limit: '10mb') and the 800px output cap (src/server/render.ts:10,MAX_PREVIEW_SIZE = 400 * 2). Neither bounds render wall-clock: a small protobuf can describe an expensive-to-rasterize diagram. There is noPromise.race/AbortController/worker-kill timeout anywhere on the path.app.yaml'sautomatic_scalingblock has onlymax_concurrent_requests: 100and nomax_instances. Under sustained slow renders, GAE scales out F4 instances without an upper bound -- a cost-amplification angle distinct from the latency angle.Reachability
GET /api/preview/:username/:projectName(src/server/api.ts:192).if (!projectModel?.getIsPublic()) { ...require session owner... }(src/server/api.ts:206). So any authenticated user can trigger a render for any public project.updatePreview(app.db, projectModel)->renderToPNGsynchronously (src/server/api.ts:224-227), and previews are invalidated on every save (the POST handler deletes the cached preview; see also tech-debt run management #41), so re-render on next view is the normal case, not an edge case.Components affected
src/server/render.ts(in-process render)src/server/api.ts(/previewroute, public-skip, on-demandupdatePreview)src/server/app.ts(10 MB body limit)src/engine/src/backend-factory.node.ts(shared singletonDirectBackend)app.yaml(automatic_scaling, nomax_instances)Possible approaches
worker_threads.Worker(or a small worker pool), or render previews in a separate GAE service / background task rather than inline in the request that needs them. This re-establishes both the event-loop isolation and the crash isolation the 2022 design had.AbortController/Promise.race(or a kill-the-worker deadline if worker-based) so a pathological model fails fast with a 5xx instead of pinning the loop.openProtobuf.automatic_scaling.max_instancesin the deployed config to cap the blast-radius cost. (Noteapp.yamlis the committed reference; production deploys.app.prod.yaml, which is gitignored -- so this needs to be reflected in both the reference and the prod config perdocs/dev/deploy.md.)How discovered
Identified during a deploy-risk audit of the server-side preview path. Confirmed against the 2022 production commit
ae68caa4(worker-thread render) vs. HEAD (in-process render), the currentapp.yamlscaling block, and the/previewroute's public-skip and on-demand re-render behavior.Related existing tracking
docs/dev/deploy.md"Rough edges" already notes the no-size-cap/no-timeout facet in prose (facet 1 only); not indocs/tech-debt.md, and silent on facets 2 (isolation) and 3 (instance cost cap).