feat(worker): /proc-based external sampler for per-task RSS and CPU#92
Merged
Conversation
Spec for moving per-task RSS/CPU metrics collection from inside the task child (in-process df-mem reporter) to the parent worker process, so observation survives CPU-bound plugins saturating the child event loop. Linux-only via /proc, observe-only (no new kill behaviour), reuses WORKER_TASK_MEMORY_SAMPLE_INTERVAL_MS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ten-task plan covering procfs reader, per-slot sampler, metrics gauge wiring, worker.ts lifecycle integration, exit-code diagnostic extension, config knob + env binding, architecture docs, and a CPU-saturation e2e test. References spec doc 2026-05-27-...-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure string→number helpers. /proc/<pid>/stat parser slices from the last ')' to handle process names containing spaces, parens, or newlines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review: the /^(\d+)\s+kB$/ capture guarantees a digit string, so Number() is always finite — the guard was dead code. Adds a parseStatFields test for stat lines with fewer than 13 fields after the closing ')'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds readProcStat (fs-backed), computeCpuRatio (delta math), isSupported (/proc gate). CLK_TCK is detected once via 'getconf CLK_TCK' with a fallback to 100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review: - computeCpuRatio guards against clockTicksPerSec <= 0 (was theoretical Infinity path if a future caller passed 0). - detectClockTicksPerSec logs an info line on fallback so operators in minimal containers can tell why CLK_TCK is 100. - Adds a non-Linux test for readProcStat returning null, and a clockTicksPerSec=0 test for computeCpuRatio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owns a setInterval per slot in the parent, snapshots /proc/<pid> via the injectable readProcStat, computes CPU ratio against a previous snapshot and exposes the result through an onSample callback. Stops itself when the reader returns null (PID gone) or throws. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…only Per code review: the default-singleton + lazy dynamic import path silently swallowed gauge-write errors and existed only to dodge a config/mongo load chain at test time. Drop the singleton and the lazy path; the only consumer (worker.ts, Task 6) will build the factory with an explicit updateGauge injection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds df_processings_process_cpu_usage_ratio (gauge, labels kind/slot) and updateTaskExternalGauges. Introduces externalRssActive boot toggle: when true, the in-process updateTaskMemoryGauges skips rssGauge so the external sampler is the sole RSS authority for kind="task". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review and the design spec: writing only when cpuRatio !== null leaked the previous run's CPU% into the first seconds of a reused slot. Use cpuRatio ?? 0 so a new run's baseline tick resets the gauge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default: true. Bound to WORKER_TASK_EXTERNAL_SAMPLER_ENABLED with JSON parsing so 'false' is correctly coerced to a boolean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Starts a per-slot proc-stat sampler immediately after spawn, stops it on close and on spawn error. Passes the last external sample through to diagnoseExit (signature updated in the next commit). Activates setExternalRssActive at boot when the platform supports /proc and the feature flag is on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diagnoseExit now accepts a lastExt parameter (the most recent ExternalSample) and weaves "Last seen RSS (external)" plus optional "CPU usage (external)" lines into oom-host, oom-heap, plugin-error, and unknown diagnostics. French run-log text mirrors the English ops message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per code review: - Add two tests asserting userMessage contains the French external-RSS and CPU usage lines (or omits the CPU line when cpuRatio is null) — extLinesFr was previously untested. - Refresh the plugin-error comment to reflect admin/user divergence. - Harmonize the unknown branch with filter(Boolean) to match plugin-error and defend against trimmed-empty-string concatenation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New fixture processing-cpu-leak busy-loops in 200ms bursts while allocating moderate JS heap, then asserts the per-slot Prometheus gauges (RSS, CPU ratio) reflect activity — proving the parent-side sampler kept ticking through child event-loop saturation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an "External sampler" subsection covering the procfs reader, CPU math, lifecycle, and failure modes. Updates the Metrics table to add df_processings_process_cpu_usage_ratio and clarifies that the per-slot RSS gauge is parent-observed when the external sampler is active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opt-out - oom-host admin/user messages: external RSS/CPU is primary (the in-process sampler is what gets starved in CPU-bound OOM scenarios). Swap rendering order: extLines BEFORE memLine. Tests pin the order. - worker.ts: log a one-liner when the operator explicitly disables the sampler via WORKER_TASK_EXTERNAL_SAMPLER_ENABLED=false. Previously only enabled/platform-disabled were announced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were working artifacts for the external-sampler implementation and don't belong alongside the human-facing docs/architecture content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a parent-side
/proc-based resource sampler that runs once per task slot, so theworker keeps reporting per-task RSS and CPU usage even when the child's in-process
df-mem sampler is starved by a CPU-bound plugin.
What changed:
worker/src/utils/proc-stat.tsreadsVmRSSfrom/proc/<pid>/statusandutime/stime from
/proc/<pid>/stat; computes a CPU ratio (1.0 = one full core).worker/src/task/external-sampler.tsruns the reader per slot atWORKER_TASK_MEMORY_SAMPLE_INTERVAL_MS, started initer()afterchild.spawn()and stopped in the
close/errorhandlers.df_processings_process_resident_memory_bytes{kind="task"}gauge; the in-processdf-mem writer suppresses RSS when the sampler is active.
df_processings_process_cpu_usage_ratio{kind="task",slot=…}.diagnoseExitnow takes alastExt: ExternalSample | nulland renders theexternal RSS/CPU lines before the (possibly stale) child sample in
oom-hostdiagnostics; the French user message gets the same treatment.
WORKER_TASK_EXTERNAL_SAMPLER_ENABLED(defaulttrue,auto-disabled at boot on non-Linux).
proc-stat,external-sampler, and the updatedexit-code; one e2e (memory-oom.e2e.spec.ts) driving a newprocessing-cpu-leakfixture that busy-loops the child while the parent keeps reading
/proc.df-memdebug write +text-warning/text-medium-emphasiscolor fix in the run-logs list (commit178204b).Why: CPU-saturated tasks make the in-process df-mem sampler go stale, so the
lastMemattached to anoom-hostdiagnostic could be many seconds old at themoment of the OS kill. Parent-side
/procreading sidesteps the child's eventloop entirely.
Regression risks:
diagnoseExitsignature gainedlastExt(5th arg). In-tree callers + tests areupdated; flag for any out-of-tree consumers.
memReporter.stop()is nowasyncand isawaited intask.ts'sfinally.The exit-time debug write now blocks task tear-down until the mongo log write
resolves. A wedged mongo will stall task cleanup; consider a short timeout
around the awaited debug call.
worker.task.externalSamplerEnabledis required inconfig/type/schema.json.Deployments with an explicit pinned config will need to add the key (default
is
trueindefault.mjs)./procread fails), the in-processwriter is not re-enabled — the RSS gauge for that slot freezes at the last
external sample until the next run. Acceptable per the design, but undocumented
in the "Failure modes" section.
(
> 0.1) may flake on heavily-loaded CI. RSS-fallback assertion mitigates.