Skip to content

feat(worker): /proc-based external sampler for per-task RSS and CPU#92

Merged
albanm merged 19 commits into
masterfrom
perf-task-resource-metrics
May 27, 2026
Merged

feat(worker): /proc-based external sampler for per-task RSS and CPU#92
albanm merged 19 commits into
masterfrom
perf-task-resource-metrics

Conversation

@albanm
Copy link
Copy Markdown
Member

@albanm albanm commented May 27, 2026

Adds a parent-side /proc-based resource sampler that runs once per task slot, so the
worker keeps reporting per-task RSS and CPU usage even when the child's in-process
df-mem sampler is starved by a CPU-bound plugin.

What changed:

  • New worker/src/utils/proc-stat.ts reads VmRSS from /proc/<pid>/status and
    utime/stime from /proc/<pid>/stat; computes a CPU ratio (1.0 = one full core).
  • New worker/src/task/external-sampler.ts runs the reader per slot at
    WORKER_TASK_MEMORY_SAMPLE_INTERVAL_MS, started in iter() after child.spawn()
    and stopped in the close/error handlers.
  • The external sampler becomes the authoritative writer for the
    df_processings_process_resident_memory_bytes{kind="task"} gauge; the in-process
    df-mem writer suppresses RSS when the sampler is active.
  • New gauge df_processings_process_cpu_usage_ratio{kind="task",slot=…}.
  • diagnoseExit now takes a lastExt: ExternalSample | null and renders the
    external RSS/CPU lines before the (possibly stale) child sample in oom-host
    diagnostics; the French user message gets the same treatment.
  • New config knob WORKER_TASK_EXTERNAL_SAMPLER_ENABLED (default true,
    auto-disabled at boot on non-Linux).
  • Tests: unit specs for proc-stat, external-sampler, and the updated
    exit-code; one e2e (memory-oom.e2e.spec.ts) driving a new processing-cpu-leak
    fixture that busy-loops the child while the parent keeps reading /proc.
  • Also bundled: an exit-time df-mem debug write + text-warning/
    text-medium-emphasis color fix in the run-logs list (commit 178204b).

Why: CPU-saturated tasks make the in-process df-mem sampler go stale, so the
lastMem attached to an oom-host diagnostic could be many seconds old at the
moment of the OS kill. Parent-side /proc reading sidesteps the child's event
loop entirely.

Regression risks:

  • diagnoseExit signature gained lastExt (5th arg). In-tree callers + tests are
    updated; flag for any out-of-tree consumers.
  • memReporter.stop() is now async and is awaited in task.ts's finally.
    The exit-time debug write now blocks task tear-down until the mongo log write
    resolves. A wedged mongo will stall task cleanup; consider a short timeout
    around the awaited debug call.
  • worker.task.externalSamplerEnabled is required in config/type/schema.json.
    Deployments with an explicit pinned config will need to add the key (default
    is true in default.mjs).
  • If the external sampler stops mid-run (e.g. /proc read fails), the in-process
    writer is not re-enabled — the RSS gauge for that slot freezes at the last
    external sample until the next run. Acceptable per the design, but undocumented
    in the "Failure modes" section.
  • E2e fixture busy-loops for ~25 s against a 90 s timeout; CPU-ratio threshold
    (> 0.1) may flake on heavily-loaded CI. RSS-fallback assertion mitigates.

albanm and others added 19 commits May 27, 2026 14:48
Spec for moving per-task RSS/CPU metrics collection from inside the
task child (in-process df-mem reporter) to the parent worker process,
so observation survives CPU-bound plugins saturating the child event
loop. Linux-only via /proc, observe-only (no new kill behaviour),
reuses WORKER_TASK_MEMORY_SAMPLE_INTERVAL_MS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ten-task plan covering procfs reader, per-slot sampler, metrics gauge
wiring, worker.ts lifecycle integration, exit-code diagnostic
extension, config knob + env binding, architecture docs, and a
CPU-saturation e2e test. References spec doc 2026-05-27-...-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure string→number helpers. /proc/<pid>/stat parser slices from the last
')' to handle process names containing spaces, parens, or newlines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review: the /^(\d+)\s+kB$/ capture guarantees a digit string,
so Number() is always finite — the guard was dead code. Adds a
parseStatFields test for stat lines with fewer than 13 fields after the
closing ')'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds readProcStat (fs-backed), computeCpuRatio (delta math), isSupported
(/proc gate). CLK_TCK is detected once via 'getconf CLK_TCK' with a
fallback to 100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review:
- computeCpuRatio guards against clockTicksPerSec <= 0 (was theoretical
  Infinity path if a future caller passed 0).
- detectClockTicksPerSec logs an info line on fallback so operators in
  minimal containers can tell why CLK_TCK is 100.
- Adds a non-Linux test for readProcStat returning null, and a
  clockTicksPerSec=0 test for computeCpuRatio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owns a setInterval per slot in the parent, snapshots /proc/<pid> via the
injectable readProcStat, computes CPU ratio against a previous snapshot
and exposes the result through an onSample callback. Stops itself when
the reader returns null (PID gone) or throws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…only

Per code review: the default-singleton + lazy dynamic import path
silently swallowed gauge-write errors and existed only to dodge a
config/mongo load chain at test time. Drop the singleton and the
lazy path; the only consumer (worker.ts, Task 6) will build the
factory with an explicit updateGauge injection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds df_processings_process_cpu_usage_ratio (gauge, labels kind/slot)
and updateTaskExternalGauges. Introduces externalRssActive boot toggle:
when true, the in-process updateTaskMemoryGauges skips rssGauge so the
external sampler is the sole RSS authority for kind="task".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review and the design spec: writing only when cpuRatio !== null
leaked the previous run's CPU% into the first seconds of a reused slot.
Use cpuRatio ?? 0 so a new run's baseline tick resets the gauge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default: true. Bound to WORKER_TASK_EXTERNAL_SAMPLER_ENABLED with JSON
parsing so 'false' is correctly coerced to a boolean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Starts a per-slot proc-stat sampler immediately after spawn, stops it
on close and on spawn error. Passes the last external sample through
to diagnoseExit (signature updated in the next commit). Activates
setExternalRssActive at boot when the platform supports /proc and the
feature flag is on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diagnoseExit now accepts a lastExt parameter (the most recent
ExternalSample) and weaves "Last seen RSS (external)" plus optional
"CPU usage (external)" lines into oom-host, oom-heap, plugin-error,
and unknown diagnostics. French run-log text mirrors the English ops
message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review:
- Add two tests asserting userMessage contains the French external-RSS
  and CPU usage lines (or omits the CPU line when cpuRatio is null) —
  extLinesFr was previously untested.
- Refresh the plugin-error comment to reflect admin/user divergence.
- Harmonize the unknown branch with filter(Boolean) to match plugin-error
  and defend against trimmed-empty-string concatenation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New fixture processing-cpu-leak busy-loops in 200ms bursts while
allocating moderate JS heap, then asserts the per-slot Prometheus
gauges (RSS, CPU ratio) reflect activity — proving the parent-side
sampler kept ticking through child event-loop saturation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an "External sampler" subsection covering the procfs reader, CPU
math, lifecycle, and failure modes. Updates the Metrics table to add
df_processings_process_cpu_usage_ratio and clarifies that the per-slot
RSS gauge is parent-observed when the external sampler is active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opt-out

- oom-host admin/user messages: external RSS/CPU is primary (the
  in-process sampler is what gets starved in CPU-bound OOM scenarios).
  Swap rendering order: extLines BEFORE memLine. Tests pin the order.
- worker.ts: log a one-liner when the operator explicitly disables the
  sampler via WORKER_TASK_EXTERNAL_SAMPLER_ENABLED=false. Previously
  only enabled/platform-disabled were announced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were working artifacts for the external-sampler implementation and
don't belong alongside the human-facing docs/architecture content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@albanm albanm merged commit 8da95ba into master May 27, 2026
4 checks passed
@albanm albanm deleted the perf-task-resource-metrics branch May 27, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant