Skip to content

Sampler choice#415

Draft
asmacdo wants to merge 17 commits intocon:mainfrom
asmacdo:sampler-choice
Draft

Sampler choice#415
asmacdo wants to merge 17 commits intocon:mainfrom
asmacdo:sampler-choice

Conversation

@asmacdo
Copy link
Copy Markdown
Member

@asmacdo asmacdo commented Apr 24, 2026

Proof of Concept

  • Heavily WIP, lots remain to do.
  • Heavily vibe coded, lots of human review needed.
  • Built on top of the duct resource stats interpretation commit, but that is a separate PR, and can be merged independently.

Overreporting / Underreporting tests

Some of the added tests use systemd to directly run in isolation so we can accurately separate sample signal from host noise. The ps tests dont need this because they get the isolation with sessions. These results are output to csvs, with the intention of putting those results into the docs to help users pick the sampler that is appropriate for their use case. (currently have n/a for unwritten test, expect those to pass for both samplers)

Current sampler (ps):
https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_ps.csv

New optional sampler (cgroups hybrid with ps, linux only)
https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_cgroup-ps-hybrid.csv

The underlying claim

is that on linux systems that support cgroups, we can get significantly more accurate reporting, that should be very aligned with the monitoring that SLURM does, so we can get accurate readings to predict necessary resource requests on those jobs.

TODOs:

  • Make edits to the stats interpretation doc to show how this alternative sampler stats should be read
  • finish up todos in the context
  • discuss with yarik: is this heavy solution really required?
  • should this be broken up? design doc first, Add the CSV generator & tests?
  • close WIP: Validate resource stats match real usage #403 if this one moves forward (this absorbed those tests)
  • improve this PR description ;)

asmacdo and others added 15 commits April 23, 2026 13:15
Explain the semantics of each field (`pcpu`, `rss`, `pmem`, `vsz`)
under duct's current `ps`-based sampler, where the numbers are
reliable, and where they mislead. Motivated by recurring confusion:
`peak_pcpu` exceeding `Ncores × 100%` (issue con#399), `peak_vsz`
reaching terabytes on workloads with only GBs of real memory, and
`total_*` disagreeing with `sum(per-pid)` in aggregated records.
All are correct behavior of `ps` semantics and duct's max-reduction
aggregation — not bugs — but there was no reference explaining why.

Scope is current behavior only; alternative samplers (psutil, cgroup)
will be documented separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior change. Existing platform dispatch (_get_sample_linux /
_get_sample_mac) stays as private helpers; PsSampler.sample() delegates
to them. Report takes an optional sampler kwarg that defaults to
PsSampler(), so every existing caller is unaffected.

First step of the multi-sampler POC: this commit only reshapes where the
ps-based sampling lives. The CLI flag, env var, and alternative sampler
come in follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire a --sampler={ps,cgroup-ps-hybrid} flag (default ps, DUCT_SAMPLER
env) through execute() to Report. The ps path is identical to before;
cgroup-ps-hybrid raises NotImplementedError (real impl comes later).

Every usage.jsonl record and info.json now carries a "sampler" field
indicating which sampler produced it. Additive only -- schema_version
bump and a pre-tag translation layer are tracked as Open Questions for
production follow-up (see docs/design/multiple-samplers.md, landing in
a later commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the substantive work from the upstream branch
claude/validate-resource-stats-uK9NV (commits a60a852, 0128f85) plus
its lint/type fixups (6645c90, 8a59ad7, c1c0432), squashed to one
commit since all five were Claude-authored upstream too.

Drops test_total_rss_is_sum_of_processes from the pickup: per-sample
total_rss-equals-sum-of-pid-rss is an artifact of the ps sampler and
will not hold for cgroup-based samplers (see
docs/resource-statistics.md Peak vs. average for the broader
invariant-rejection rationale).

Adds:
- test/data/memory_children.py (multiprocessing helper)
- test/duct_main/test_resource_validation.py (8 tests: memory
  allocation, wall-clock, CPU idle/busy, samples structure, child
  process memory, per-process tracking)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New test/data/workloads/ holds deterministic, standalone-runnable
workload scripts for the sampler matrix tests (landing in a follow-up).
Each script's docstring states its ground-truth contract so matrix
assertions can reference it directly.

- steady_cpu.py: one-core-pinned busy-loop; peak_pcpu ~= 100%
- alloc_memory.py: bytearray(N MB), hold D seconds; peak_rss >= N MB

README.md documents the catalog, points at test/data/memory_children.py
(from PR con#403) as an additional matrix workload living outside this
directory, and carries TODOs for:
  - a spikey multi-core CPU-burst workload (the con#399 anchor; planned
    but deferred out of this commit)
  - consolidating workload locations and overlap with
    test/data/test_script.py once the POC direction lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumbing for the sampler matrix -- no tests use it yet (checkpoints 6/8
populate the cells). Cells are simple pass/fail for now; richer status
vocabulary deferred.

- @pytest.mark.sampler_matrix(workload=X, sampler=Y): marks a test as
  one cell of the matrix.
- test/conftest.py: session-start clears stale results; a
  pytest_runtest_makereport hook appends {workload, sampler, status}
  to .sampler_matrix_results.jsonl.
- scripts/gen_sampler_matrix.py: pivots the JSONL into
  test/sampler_matrix.csv (rows=workload, cols=sampler, cells=pass/
  fail/(not yet tested)). The CSV is committed; fixed column order
  keeps diffs stable.
- test/sampler_matrix.csv committed in header-only state; checkpoints
  6 and 8 fill in ps and cgroup-ps-hybrid columns respectively.

CI step to enforce `git diff --exit-code` drift of the CSV is deferred
(still within this PR, just not this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshapes checkpoint 5's sampler-matrix infra around three ideas:

- Per-sampler CSVs (test/sampler_matrix_<sampler>.csv) with
  rows=workload, columns=property, cells=pass/fail/n/a. Easier to
  read as a per-sampler capability card than one pivoted CSV.
- Marker carries (sampler, workload, property, expected) -- the test
  declares what property it probes AND whether we currently expect it
  to hold.
- expected="fail" auto-dispatches to xfail(strict=True) via a
  pytest_collection_modifyitems hook. Known sampler limitations stay
  committed as expected-fail cells; the suite stays green as long as
  the matrix reflects reality; if a sampler improves, the xpass
  surfaces it noisily so we update the committed expectation.

The conftest records the *actual* assertion outcome (from
call.excinfo) rather than pytest's post-xfail interpretation, so the
CSV reflects what the sampler did, not what we hoped.

Adds test/duct_main/test_sampler_matrix.py with four ps-column
tests:
- alloc_memory/peak_rss_reaches_alloc: pass
- memory_children/peak_rss_reaches_alloc: pass
- memory_children/peak_rss_no_overcount: fail  [con#399 anchor]
- steady_cpu/peak_pcpu_reaches_floor: pass

The memory_children cell is the con#399 story: ps double-counts shared
library pages per PID, so peak_rss exceeds a realistic physical-
memory ceiling. cgroup-ps-hybrid (checkpoint 7/8) should flip that
cell to pass.

File carries a TODO to consolidate with test_resource_validation.py
once the POC direction lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First working implementation of --sampler=cgroup-ps-hybrid. The
--sampler flag now actually does something in addition to tagging
records. The ps code path is unchanged.

Shape:
- Per-pid stats still come from the ps sub-sampler (gives us the
  ``stats: {pid: ProcessStats}`` dict users expect in usage.jsonl).
- Session totals are cgroup-sourced: ``total_rss`` from
  ``memory.current``, ``total_pcpu`` from a delta of
  ``cpu.stat.usage_usec`` over the last sample interval (stateful -
  sampler remembers the previous reading).
- ``total_vsz`` and ``total_pmem`` remain ps-sourced -- cgroup v2 has
  no direct analogs (memory.current is already physical memory, and
  vsz is per-process by definition).
- Reader mode only. Duct does NOT create a cgroup; it reads the one
  it already lives in via /proc/self/cgroup.

Constraints enforced at startup:
- cgroup v2 required (refuse on v1 with NotImplementedError pointing
  at /sys/fs/cgroup/cgroup.controllers).
- --mode=current-session required; --mode=new-session + cgroup-ps-
  hybrid raises ValueError before log paths are created.

Type plumbing: added a ``Sampler`` Union alias in _sampling.py so the
factory return + Report.sampler annotation can refer to either
concrete class without introducing an ABC.

TODO(poc) markers in code:
- Polling shape (sample-at-interval + Sample.aggregate max) is a ps-
  shaped compromise; cgroup could emit cumulative/delta directly.
- End-of-run overwrite of full_run_stats.total_rss with memory.peak
  would give a truer peak than max-of-currents.
- Assumes the tracked command stays in duct's cgroup; systemd-run or
  similar migrations would silently break the measurement.

Zero schema changes: the ``sampler`` tag added in checkpoint 2 is
already the source disambiguator on each record.

Matrix tests for the cgroup column are deferred to checkpoint 8 --
those need a scoped cgroup (SLURM job, container with cgroup ns, or
systemd unit) to assert meaningful behavior; root-cgroup readings
reflect the whole host and would make assertions trivially pass or
fail by accident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Populates the cgroup-ps-hybrid column of the sampler matrix with
real data. Each cgroup matrix test spawns duct in a transient
``systemd-run --user --scope --collect`` unit, so the cgroup
CgroupSampler reads is dedicated to just ``duct + workload``.
Without the scope, duct's ambient cgroup (a login user slice, a
non-ns container, etc.) contains megabytes to gigabytes of
unrelated memory and ceilings become meaningless.

Four cells, same workloads/properties as the ps column:
  alloc_memory/peak_rss_reaches_alloc           = pass
  memory_children/peak_rss_reaches_alloc        = pass
  memory_children/peak_rss_no_overcount         = pass  <-- the flip
  steady_cpu/peak_pcpu_reaches_floor            = pass

The memory_children cell flipping fail -> pass across the two
samplers is the con#399 story captured in the matrix: ps per-pid sum
overcounts shared library pages; cgroup v2 memory.current counts
each physical page once.

Opt-in via new ``@pytest.mark.cgroup_matrix`` + ``--cgroup-matrix``
pytest option. Default runs skip these tests (they require
systemd-run --user and a writable /sys/fs/cgroup v2). To populate
or refresh the CSV:

    .tox/py312/bin/python -m pytest -o addopts= \\
        --cgroup-matrix test/duct_main/test_sampler_matrix.py
    python3 scripts/gen_sampler_matrix.py

CI integration for the opt-in tests is TODO: likely a separate
GitHub Actions job with podman-or-systemd-run setup; flagged in
the PR description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the canonical regeneration recipe (pytest --cgroup-matrix + the
JSONL-to-CSV pivot) as a checked-in script, so it can be invoked
reproducibly via:

    datalad run scripts/regen_matrix.sh

which records the command, inputs, and outputs in the commit
metadata. Works in plain git too (provenance is just missing).

Deletes the previously-committed test/sampler_matrix_cgroup-ps-hybrid.csv
so the next datalad-run regeneration creates it fresh (with provenance).
The ps CSV stays as-is (already committed from checkpoint 6; a
datalad-run refresh will update it in place if any ps cell changes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "scripts/regen_matrix.sh",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "test/sampler_matrix_ps.csv",
  "test/sampler_matrix_cgroup-ps-hybrid.csv"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Schema: marker kwargs change from ``property=<name>`` to
``metric=<rss|pcpu>`` + ``direction=<underreport|overreport>``.
Generator pivots to rows=``<workload>/<metric>`` and columns=
``no_<direction>``. Each cell now cleanly reports the sampler's
ability to avoid a specific failure mode on a specific workload.
Column layout is easier to scan as a capability card and avoids
bespoke per-property column names.

New workloads (test/data/workloads/):

- ``ephemeral_cpu.py`` -- fork N short-lived parallel CPU workers.
  ps misses them (children die between samples, session empty at
  sample time). cgroup ``cpu.stat.usage_usec`` is cumulative and
  captures the work regardless.
- ``spikey_cpu.py`` -- multi-threaded ``hashlib.pbkdf2_hmac``
  (GIL-released) across N*M threads. Principled standalone
  equivalent of con#399's tox-C-extension-compile pathology. Triggers
  ps's Bug 1: lifetime cumulative ``cputime/elapsed`` inflates per
  worker, duct's per-pid sum compounds across workers.

New matrix cells (ps / cgroup):

  ephemeral_cpu/pcpu no_underreport   fail / pass
  spikey_cpu/pcpu    no_overreport    fail / pass  (Linux only)

Spikey is Linux-only (Bug 1 is ps-on-Linux-specific; Darwin's ps
uses a decaying 1-min average). Spikey also skips when
``cpu_count >= workers * threads`` -- without thread
oversubscription, ps reports stay near legitimate physical peak and
the ceiling can't discriminate. Workload + ceiling are tuned for
4-12 core hosts.

Cite: RESEARCH.md section 1.1 (pcpu fully broken);
DEEP_DIVE_PROGRESS.md section 2 (Bug 1 confirmed in
src/con_duct/_sampling.py). Distinct from Bug 2 (aggregation,
xfailed in c017800).

Matrix CSV regeneration via ``datalad run scripts/regen_matrix.sh``
lands as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical results from host run showed two tunings needed:

- Ephemeral cgroup floor 200% was too tight. Actual cgroup reading
  for 4 × 30ms parallel bursts over a 100ms sample window, with
  Python multiprocessing startup overhead included, comes in around
  120-200%. Drop the floor to 80% -- still comfortably above ps's
  ~0% (which is what underreport-detection needs) while tolerating
  sample-window dilution and startup variance.
- Spikey workload demand was 16 threads (4 workers x 4 threads),
  which on hosts with >= 16 cores skips via the oversubscription
  guard. Bump to 32 threads (4 workers x 8 threads) so the test
  can run on typical 8-16 core laptops and demonstrate Bug 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "scripts/regen_matrix.sh",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "test/sampler_matrix_ps.csv",
  "test/sampler_matrix_cgroup-ps-hybrid.csv"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Revision of the aspirational design doc (originally drafted on the
pcpu-overshoot-investigation branch) to reflect the POC as landed on
branch sampler-choice.

- Scope tightens to cgroup-ps-hybrid as the one additional sampler.
  psutil moves from a first-class backend to Future Directions
  (section 10.1), along with creator mode, memory.peak overwrite,
  pipeline-reshape for cgroup's cumulative nature, matrix
  completeness, CI wiring, and the placeholder-name decision.
- Tense flips to present-indicative for implemented pieces: the
  sampler abstraction lives in _sampling.py, CgroupSampler does X,
  etc. Out-of-scope items stay future-tense in section 10.
- New section 9 Schema Open Questions covers additive-tag trade-offs
  the POC deferred (schema_version bump, pre-tag-consumer compat,
  per-block source tagging, placeholder-name resolution).
- New section 8 Results ships the two matrix CSVs as a capability
  card, with the three ps->cgroup flips called out as the POC claim.
- ps pathologies in section 2 are described by mechanism rather
  than "Bug 1 / Bug 2" labels -- the cputime/elapsed lifetime ratio
  plus per-pid summing for CPU, shared-page per-pid summing for
  memory. Linux-only caveat for the CPU pathology called out
  (Darwin ps is a decaying 1-min average).
- Section 12 annotates the original eight open questions as
  resolved / deferred / still-open.
- Section 13 lists the landed commits so a reviewer can read the
  PR in order.

This is the doc reviewers should open first. It documents what the
POC actually demonstrates, what we deliberately left as future work,
and why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 39.39394% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.90%. Comparing base (2162da9) to head (9e105c8).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/con_duct/_sampling.py 30.90% 37 Missing and 1 partial ⚠️
src/con_duct/_duct_main.py 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
- Coverage   91.87%   88.90%   -2.97%     
==========================================
  Files          15       15              
  Lines        1120     1181      +61     
  Branches      139      149      +10     
==========================================
+ Hits         1029     1050      +21     
- Misses         69      107      +38     
- Partials       22       24       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

asmacdo and others added 2 commits April 24, 2026 07:34
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces multiprocessing.Process with bare os.fork + os._exit so
child lifetime collapses to ~work_ms without interpreter-reinit or
spawn-method overhead. multiprocessing bloat defeated the "children
die between samples" phenomenon on PyPy and Python 3.14 (forkserver
default), causing test_ps_ephemeral_cpu_pcpu_no_underreport to
xpass-strict on those runners while xfailing on CPython 3.10-3.13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant