Sampler choice by asmacdo · Pull Request #415 · con/duct

asmacdo · 2026-04-24T01:41:54Z

Proof of Concept

Heavily WIP, lots remain to do.
Heavily vibe coded, lots of human review needed.
Built on top of the duct resource stats interpretation commit, but that is a separate PR, and can be merged independently.

Overreporting / Underreporting tests

Some of the added tests use systemd to directly run in isolation so we can accurately separate sample signal from host noise. The ps tests dont need this because they get the isolation with sessions. These results are output to csvs, with the intention of putting those results into the docs to help users pick the sampler that is appropriate for their use case. (currently have n/a for unwritten test, expect those to pass for both samplers)

Current sampler (ps):
https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_ps.csv

New optional sampler (cgroups hybrid with ps, linux only)
https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_cgroup-ps-hybrid.csv

The underlying claim

is that on linux systems that support cgroups, we can get significantly more accurate reporting, that should be very aligned with the monitoring that SLURM does, so we can get accurate readings to predict necessary resource requests on those jobs.

TODOs:

Make edits to the stats interpretation doc to show how this alternative sampler stats should be read
finish up todos in the context
discuss with yarik: is this heavy solution really required?
should this be broken up? design doc first, Add the CSV generator & tests?
close WIP: Validate resource stats match real usage #403 if this one moves forward (this absorbed those tests)
improve this PR description ;)

Explain the semantics of each field (`pcpu`, `rss`, `pmem`, `vsz`) under duct's current `ps`-based sampler, where the numbers are reliable, and where they mislead. Motivated by recurring confusion: `peak_pcpu` exceeding `Ncores × 100%` (issue con#399), `peak_vsz` reaching terabytes on workloads with only GBs of real memory, and `total_*` disagreeing with `sum(per-pid)` in aggregated records. All are correct behavior of `ps` semantics and duct's max-reduction aggregation — not bugs — but there was no reference explaining why. Scope is current behavior only; alternative samplers (psutil, cgroup) will be documented separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

No behavior change. Existing platform dispatch (_get_sample_linux / _get_sample_mac) stays as private helpers; PsSampler.sample() delegates to them. Report takes an optional sampler kwarg that defaults to PsSampler(), so every existing caller is unaffected. First step of the multi-sampler POC: this commit only reshapes where the ps-based sampling lives. The CLI flag, env var, and alternative sampler come in follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire a --sampler={ps,cgroup-ps-hybrid} flag (default ps, DUCT_SAMPLER env) through execute() to Report. The ps path is identical to before; cgroup-ps-hybrid raises NotImplementedError (real impl comes later). Every usage.jsonl record and info.json now carries a "sampler" field indicating which sampler produced it. Additive only -- schema_version bump and a pre-tag translation layer are tracked as Open Questions for production follow-up (see docs/design/multiple-samplers.md, landing in a later commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-picks the substantive work from the upstream branch claude/validate-resource-stats-uK9NV (commits a60a852, 0128f85) plus its lint/type fixups (6645c90, 8a59ad7, c1c0432), squashed to one commit since all five were Claude-authored upstream too. Drops test_total_rss_is_sum_of_processes from the pickup: per-sample total_rss-equals-sum-of-pid-rss is an artifact of the ps sampler and will not hold for cgroup-based samplers (see docs/resource-statistics.md Peak vs. average for the broader invariant-rejection rationale). Adds: - test/data/memory_children.py (multiprocessing helper) - test/duct_main/test_resource_validation.py (8 tests: memory allocation, wall-clock, CPU idle/busy, samples structure, child process memory, per-process tracking) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New test/data/workloads/ holds deterministic, standalone-runnable workload scripts for the sampler matrix tests (landing in a follow-up). Each script's docstring states its ground-truth contract so matrix assertions can reference it directly. - steady_cpu.py: one-core-pinned busy-loop; peak_pcpu ~= 100% - alloc_memory.py: bytearray(N MB), hold D seconds; peak_rss >= N MB README.md documents the catalog, points at test/data/memory_children.py (from PR con#403) as an additional matrix workload living outside this directory, and carries TODOs for: - a spikey multi-core CPU-burst workload (the con#399 anchor; planned but deferred out of this commit) - consolidating workload locations and overlap with test/data/test_script.py once the POC direction lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plumbing for the sampler matrix -- no tests use it yet (checkpoints 6/8 populate the cells). Cells are simple pass/fail for now; richer status vocabulary deferred. - @pytest.mark.sampler_matrix(workload=X, sampler=Y): marks a test as one cell of the matrix. - test/conftest.py: session-start clears stale results; a pytest_runtest_makereport hook appends {workload, sampler, status} to .sampler_matrix_results.jsonl. - scripts/gen_sampler_matrix.py: pivots the JSONL into test/sampler_matrix.csv (rows=workload, cols=sampler, cells=pass/ fail/(not yet tested)). The CSV is committed; fixed column order keeps diffs stable. - test/sampler_matrix.csv committed in header-only state; checkpoints 6 and 8 fill in ps and cgroup-ps-hybrid columns respectively. CI step to enforce `git diff --exit-code` drift of the CSV is deferred (still within this PR, just not this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reshapes checkpoint 5's sampler-matrix infra around three ideas: - Per-sampler CSVs (test/sampler_matrix_<sampler>.csv) with rows=workload, columns=property, cells=pass/fail/n/a. Easier to read as a per-sampler capability card than one pivoted CSV. - Marker carries (sampler, workload, property, expected) -- the test declares what property it probes AND whether we currently expect it to hold. - expected="fail" auto-dispatches to xfail(strict=True) via a pytest_collection_modifyitems hook. Known sampler limitations stay committed as expected-fail cells; the suite stays green as long as the matrix reflects reality; if a sampler improves, the xpass surfaces it noisily so we update the committed expectation. The conftest records the *actual* assertion outcome (from call.excinfo) rather than pytest's post-xfail interpretation, so the CSV reflects what the sampler did, not what we hoped. Adds test/duct_main/test_sampler_matrix.py with four ps-column tests: - alloc_memory/peak_rss_reaches_alloc: pass - memory_children/peak_rss_reaches_alloc: pass - memory_children/peak_rss_no_overcount: fail [con#399 anchor] - steady_cpu/peak_pcpu_reaches_floor: pass The memory_children cell is the con#399 story: ps double-counts shared library pages per PID, so peak_rss exceeds a realistic physical- memory ceiling. cgroup-ps-hybrid (checkpoint 7/8) should flip that cell to pass. File carries a TODO to consolidate with test_resource_validation.py once the POC direction lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First working implementation of --sampler=cgroup-ps-hybrid. The --sampler flag now actually does something in addition to tagging records. The ps code path is unchanged. Shape: - Per-pid stats still come from the ps sub-sampler (gives us the ``stats: {pid: ProcessStats}`` dict users expect in usage.jsonl). - Session totals are cgroup-sourced: ``total_rss`` from ``memory.current``, ``total_pcpu`` from a delta of ``cpu.stat.usage_usec`` over the last sample interval (stateful - sampler remembers the previous reading). - ``total_vsz`` and ``total_pmem`` remain ps-sourced -- cgroup v2 has no direct analogs (memory.current is already physical memory, and vsz is per-process by definition). - Reader mode only. Duct does NOT create a cgroup; it reads the one it already lives in via /proc/self/cgroup. Constraints enforced at startup: - cgroup v2 required (refuse on v1 with NotImplementedError pointing at /sys/fs/cgroup/cgroup.controllers). - --mode=current-session required; --mode=new-session + cgroup-ps- hybrid raises ValueError before log paths are created. Type plumbing: added a ``Sampler`` Union alias in _sampling.py so the factory return + Report.sampler annotation can refer to either concrete class without introducing an ABC. TODO(poc) markers in code: - Polling shape (sample-at-interval + Sample.aggregate max) is a ps- shaped compromise; cgroup could emit cumulative/delta directly. - End-of-run overwrite of full_run_stats.total_rss with memory.peak would give a truer peak than max-of-currents. - Assumes the tracked command stays in duct's cgroup; systemd-run or similar migrations would silently break the measurement. Zero schema changes: the ``sampler`` tag added in checkpoint 2 is already the source disambiguator on each record. Matrix tests for the cgroup column are deferred to checkpoint 8 -- those need a scoped cgroup (SLURM job, container with cgroup ns, or systemd unit) to assert meaningful behavior; root-cgroup readings reflect the whole host and would make assertions trivially pass or fail by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Populates the cgroup-ps-hybrid column of the sampler matrix with real data. Each cgroup matrix test spawns duct in a transient ``systemd-run --user --scope --collect`` unit, so the cgroup CgroupSampler reads is dedicated to just ``duct + workload``. Without the scope, duct's ambient cgroup (a login user slice, a non-ns container, etc.) contains megabytes to gigabytes of unrelated memory and ceilings become meaningless. Four cells, same workloads/properties as the ps column: alloc_memory/peak_rss_reaches_alloc = pass memory_children/peak_rss_reaches_alloc = pass memory_children/peak_rss_no_overcount = pass <-- the flip steady_cpu/peak_pcpu_reaches_floor = pass The memory_children cell flipping fail -> pass across the two samplers is the con#399 story captured in the matrix: ps per-pid sum overcounts shared library pages; cgroup v2 memory.current counts each physical page once. Opt-in via new ``@pytest.mark.cgroup_matrix`` + ``--cgroup-matrix`` pytest option. Default runs skip these tests (they require systemd-run --user and a writable /sys/fs/cgroup v2). To populate or refresh the CSV: .tox/py312/bin/python -m pytest -o addopts= \\ --cgroup-matrix test/duct_main/test_sampler_matrix.py python3 scripts/gen_sampler_matrix.py CI integration for the opt-in tests is TODO: likely a separate GitHub Actions job with podman-or-systemd-run setup; flagged in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the canonical regeneration recipe (pytest --cgroup-matrix + the JSONL-to-CSV pivot) as a checked-in script, so it can be invoked reproducibly via: datalad run scripts/regen_matrix.sh which records the command, inputs, and outputs in the commit metadata. Works in plain git too (provenance is just missing). Deletes the previously-committed test/sampler_matrix_cgroup-ps-hybrid.csv so the next datalad-run regeneration creates it fresh (with provenance). The ps CSV stays as-is (already committed from checkpoint 6; a datalad-run refresh will update it in place if any ps cell changes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

=== Do not change lines below === { "chain": [], "cmd": "scripts/regen_matrix.sh", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [ "test/sampler_matrix_ps.csv", "test/sampler_matrix_cgroup-ps-hybrid.csv" ], "pwd": "." } ^^^ Do not change lines above ^^^

Schema: marker kwargs change from ``property=<name>`` to ``metric=<rss|pcpu>`` + ``direction=<underreport|overreport>``. Generator pivots to rows=``<workload>/<metric>`` and columns= ``no_<direction>``. Each cell now cleanly reports the sampler's ability to avoid a specific failure mode on a specific workload. Column layout is easier to scan as a capability card and avoids bespoke per-property column names. New workloads (test/data/workloads/): - ``ephemeral_cpu.py`` -- fork N short-lived parallel CPU workers. ps misses them (children die between samples, session empty at sample time). cgroup ``cpu.stat.usage_usec`` is cumulative and captures the work regardless. - ``spikey_cpu.py`` -- multi-threaded ``hashlib.pbkdf2_hmac`` (GIL-released) across N*M threads. Principled standalone equivalent of con#399's tox-C-extension-compile pathology. Triggers ps's Bug 1: lifetime cumulative ``cputime/elapsed`` inflates per worker, duct's per-pid sum compounds across workers. New matrix cells (ps / cgroup): ephemeral_cpu/pcpu no_underreport fail / pass spikey_cpu/pcpu no_overreport fail / pass (Linux only) Spikey is Linux-only (Bug 1 is ps-on-Linux-specific; Darwin's ps uses a decaying 1-min average). Spikey also skips when ``cpu_count >= workers * threads`` -- without thread oversubscription, ps reports stay near legitimate physical peak and the ceiling can't discriminate. Workload + ceiling are tuned for 4-12 core hosts. Cite: RESEARCH.md section 1.1 (pcpu fully broken); DEEP_DIVE_PROGRESS.md section 2 (Bug 1 confirmed in src/con_duct/_sampling.py). Distinct from Bug 2 (aggregation, xfailed in c017800). Matrix CSV regeneration via ``datalad run scripts/regen_matrix.sh`` lands as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical results from host run showed two tunings needed: - Ephemeral cgroup floor 200% was too tight. Actual cgroup reading for 4 × 30ms parallel bursts over a 100ms sample window, with Python multiprocessing startup overhead included, comes in around 120-200%. Drop the floor to 80% -- still comfortably above ps's ~0% (which is what underreport-detection needs) while tolerating sample-window dilution and startup variance. - Spikey workload demand was 16 threads (4 workers x 4 threads), which on hosts with >= 16 cores skips via the oversubscription guard. Bump to 32 threads (4 workers x 8 threads) so the test can run on typical 8-16 core laptops and demonstrate Bug 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

=== Do not change lines below === { "chain": [], "cmd": "scripts/regen_matrix.sh", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [ "test/sampler_matrix_ps.csv", "test/sampler_matrix_cgroup-ps-hybrid.csv" ], "pwd": "." } ^^^ Do not change lines above ^^^

Revision of the aspirational design doc (originally drafted on the pcpu-overshoot-investigation branch) to reflect the POC as landed on branch sampler-choice. - Scope tightens to cgroup-ps-hybrid as the one additional sampler. psutil moves from a first-class backend to Future Directions (section 10.1), along with creator mode, memory.peak overwrite, pipeline-reshape for cgroup's cumulative nature, matrix completeness, CI wiring, and the placeholder-name decision. - Tense flips to present-indicative for implemented pieces: the sampler abstraction lives in _sampling.py, CgroupSampler does X, etc. Out-of-scope items stay future-tense in section 10. - New section 9 Schema Open Questions covers additive-tag trade-offs the POC deferred (schema_version bump, pre-tag-consumer compat, per-block source tagging, placeholder-name resolution). - New section 8 Results ships the two matrix CSVs as a capability card, with the three ps->cgroup flips called out as the POC claim. - ps pathologies in section 2 are described by mechanism rather than "Bug 1 / Bug 2" labels -- the cputime/elapsed lifetime ratio plus per-pid summing for CPU, shared-page per-pid summing for memory. Linux-only caveat for the CPU pathology called out (Darwin ps is a decaying 1-min average). - Section 12 annotates the original eight open questions as resolved / deferred / still-open. - Section 13 lists the landed commits so a reviewer can read the PR in order. This is the doc reviewers should open first. It documents what the POC actually demonstrates, what we deliberately left as future work, and why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-24T01:44:02Z

Codecov Report

❌ Patch coverage is 39.39394% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.90%. Comparing base (2162da9) to head (9e105c8).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/con_duct/_sampling.py	30.90%	37 Missing and 1 partial ⚠️
src/con_duct/_duct_main.py	50.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
- Coverage   91.87%   88.90%   -2.97%     
==========================================
  Files          15       15              
  Lines        1120     1181      +61     
  Branches      139      149      +10     
==========================================
+ Hits         1029     1050      +21     
- Misses         69      107      +38     
- Partials       22       24       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces multiprocessing.Process with bare os.fork + os._exit so child lifetime collapses to ~work_ms without interpreter-reinit or spawn-method overhead. multiprocessing bloat defeated the "children die between samples" phenomenon on PyPy and Python 3.14 (forkserver default), causing test_ps_ephemeral_cpu_pcpu_no_underreport to xpass-strict on those runners while xfailing on CPython 3.10-3.13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

asmacdo and others added 15 commits April 23, 2026 13:15

asmacdo and others added 2 commits April 24, 2026 07:34

docs: link sampler capability matrix from resource-statistics

7de046a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampler choice#415

Sampler choice#415
asmacdo wants to merge 17 commits intocon:mainfrom
asmacdo:sampler-choice

asmacdo commented Apr 24, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asmacdo commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proof of Concept

Overreporting / Underreporting tests

The underlying claim

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asmacdo commented Apr 24, 2026 •

edited

Loading

codecov Bot commented Apr 24, 2026 •

edited

Loading