feat: feature_matrix accepts df_seq + list_parts (prototype #306) by breimanntools · Pull Request #316 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:29:49Z

Morning status — SOLID. Works end to end with green local tests and the full ripple done. df_seq= / list_parts= added to SequenceFeature.feature_matrix; the legacy df_parts= path is byte-identical (golden + dataset-fingerprint regression tests). Review focus: the Validate-block wording and the choice to reject batch=True with df_seq (single-DataFrame input). One open API question below.

Part of #305 / prototype for #306.

What changed

SequenceFeature.feature_matrix(features, df_parts=None, ..., df_seq=None, list_parts=None):
- Exactly one of df_parts / df_seq must be supplied → else bare ValueError.
- When df_seq is given, df_parts is built internally via get_df_parts(df_seq, list_parts).
- When df_parts is given the code path is unchanged (byte-identical).
- batch=True together with df_seq raises ValueError; list_parts without df_seq raises ValueError.
- df_parts default flipped to None (additive; existing positional/keyword calls unaffected).
numpydoc: df_seq / list_parts params (df_seq uses the canonical baseline phrasing), a versionchanged note, df_parts marked optional.
Example notebook examples/feature_engineering/sf_feature_matrix.ipynb: new df_seq cell, re-executed with embedded outputs (nbmake-green).
Unit tests (test_sf_feature_matrix.py): TestFeatureMatrixDfSeq (positive + negative per new param) and TestFeatureMatrixRegression (byte-identical legacy df_parts path: hand-computed 1.5 golden + dataset fingerprint).
Release notes: Unreleased entry under Feature Engineering.

Local gates (all green, run against the worktree)

test_sf_feature_matrix.py — 29 passed (was 18 before the new tests).
tests/unit/sequence_feature_tests/ + test_param_coverage.py — 444 passed.
cpp feature-matrix parity + aap.predict_samples pipe — 70 passed.
nbmake on the example notebook — 1 passed.
docstring structural checker — 0 defects.

KPIs (issue #306)

Existing df_parts= output byte-identical (regression test on canonical fixture).
feature_matrix(df_seq=..., list_parts=..., ...) equals the two-step result exactly.
≥1 negative test (both / neither / batch+df_seq / list_parts w/o df_seq → ValueError).

Open API question

df_seq / list_parts are appended after batch to preserve positional compatibility, so they read out of logical order in the signature. Acceptable, or should they sit next to df_parts (a positional-compat break for the rare positional batch/n_jobs caller)?
Should list_parts passed alongside df_parts be a hard ValueError (current) or a silently-ignored no-op? Chose ValueError for clarity.

🤖 Generated with Claude Code

Critical self-review (reviewer pass)

Adversarial review by a second agent. Findings and fixes:

Defect found + fixed — property tests silently reparented. The new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes were inserted between TestFeatureMatrixGoldenValues and its two trailing methods (test_property_shape_rows_cols, test_property_parallel_equals_serial), so those two pre-existing property tests were moved under the regression class — where they don't belong semantically ("byte-identical drift lock" vs. shape/parallel invariants). Restored them to TestFeatureMatrixGoldenValues and left the two new classes cleanly at the end of the file. No test lost; 29 pass, and collection now shows the property tests back under GoldenValues.

Verified (no change needed):

Byte-identical df_parts= path is genuinely proven, not a tautology. Recomputed the regression fingerprint on the independent master install: nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 — matches the pinned constants exactly. The anchor would fail if the legacy path drifted, and it agrees with master, so byte-identity is real.
Mutual exclusion is complete and correct. (df_parts is None) == (df_seq is None) rejects both-given and neither-given; batch=True + df_seq, list_parts without df_seq, and non-DataFrame df_seq (via get_df_parts→check_df_seq) all raise clear ValueErrors. Confirmed each with concrete inputs.
Internal df_parts construction reuses get_df_parts (single call, no duplicated slicing logic; downstream re-validation is cheap and harmless).
Docstring — param order matches the signature, two distinct versionchanged:: 1.1.0 notes, Examples include resolves to an existing file. check_docstrings = 0 defects, doc_signature_drift = 0 candidates.
Efficiency — no needless recompute/copy; the df_seq path adds exactly one get_df_parts call.

Residual concerns (not blocking, not fixed):

test_df_parts_path_byte_identical_golden duplicates the existing test_golden_segment_whole_mean (both assert AC Segment(1,1) → 1.5). Intentional redundancy as an explicit regression lock; left as-is.
The two open API questions in the original body (param position after batch; hard-error vs. no-op for list_parts with df_parts) remain design calls for the maintainer.

Iterative review log

Round 2 (correctness & edge cases): found + fixed a misleading error path — a non-bool truthy batch (e.g. batch=1) with df_seq raised the "batch=True not supported with df_seq" message instead of the type error, and behaved inconsistently vs. the df_parts path. Hoisted ut.check_bool(name="batch") above the mutual-exclusion resolution (removed the now-duplicate later check) so batch is type-validated first; added test_invalid_batch_type_with_df_seq. 30 tests pass.
Round 3 (numerical robustness & API/param-name consistency with get_df_parts): re-proved the legacy df_parts= fingerprint is non-tautological by recomputing it on the independent master install (no df_seq param) — nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 match the pinned constants exactly, so the anchor would fail on any drift. Confirmed mutual-exclusion is complete (both/neither/df_seq+batch/list_parts-without-df_seq) and part-building is a single get_df_parts delegation (no duplicated slicing). df_seq/list_parts names + phrasing mirror the sibling get_df_parts. Fixed one docstring-accuracy gap: documented that the shortcut only forwards list_parts while other get_df_parts options (jmd_n_len/jmd_c_len/tmd_len) use defaults, pointing to the two-step path for full control. 30 tests pass; check_docstrings 0 defects.
Round 4 (efficiency & simplicity): converged — no new defects. The df_seq path adds exactly one get_df_parts delegation (no loops, no needless copies, no recompute); the only redundancy (a duplicate check_bool(batch)) was already removed in round 2. The shared check_df_parts re-validation of the built df_parts is the common path for both input sources and is cheap — branching to skip it would add complexity, so it stays.
Round 5 (guides, docs & tests completeness): converged — no new defects. numpydoc has a one-line summary, named Returns, and an Examples include that resolves; no ADR refs / no print / ut.check_* in the diff. Frontend validates every public param (batch type-checked, df_seq/list_parts via get_df_parts delegation + mutual-exclusion), each new/changed param has positive+negative tests. Full area gate green: sequence_feature_tests 440 passed, param_coverage 5, import-hygiene 2, docstring-contracts 13, doc-signature-drift 0, example nbmake 1 passed.

Add df_seq= and list_parts= to SequenceFeature.feature_matrix so the get_df_parts -> feature_matrix two-step collapses into one call. When df_seq is given, df_parts is built internally via get_df_parts; when df_parts is given the path is unchanged (byte-identical). Exactly one of df_parts / df_seq must be supplied (bare ValueError in the Validate block); batch=True and list_parts-without-df_seq are rejected. Ripple: numpydoc (df_seq/list_parts params + versionchanged + canonical df_seq baseline), example notebook df_seq cell (re-executed with outputs), unit tests (positive+negative per new param + byte-identical regression on the legacy df_parts path), release-notes Unreleased entry. Part of #305. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ted) The df_seq PR inserted the new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes between TestFeatureMatrixGoldenValues and its two trailing property tests (test_property_shape_rows_cols / test_property_parallel_equals_serial), silently moving them under the regression class. Restore them to GoldenValues where they belong; the new classes now sit cleanly at the end of the file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hoist ut.check_bool(name='batch') above the mutual-exclusion block so a non-bool truthy batch (e.g. batch=1) with df_seq raises the type error, consistent with the df_parts path, instead of the df_seq-incompat message. Remove the now-duplicate later check; add test_invalid_batch_type_with_df_seq. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarify that the df_seq path uses get_df_parts defaults for its other options (jmd_n_len/jmd_c_len/tmd_len) and point to the two-step get_df_parts path for full control. API-consistency/docstring-accuracy only; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-df-seq

breimanntools · 2026-07-02T07:15:27Z

                       n_jobs: Union[int, None] = 1,
                       batch: bool = False,
+                       df_seq: Optional[pd.DataFrame] = None,
+                       list_parts: Optional[Union[str, List[str]]] = None,


but we need then as well len_jmd_n and len_jmd_c as paramter. Can we work here with a dict_df_parts to bundle all of the relevant parameters? We did this pattern already somewehre else, please apply this as well. default is 10 (if global value is not set) and the jmd lenght paramters are only considered if df_seq and list_parts are given.

breimanntools

Please adjust as desrcribed

…dict Address #316 review: replace the ad-hoc list_parts= arg with a general df_parts_kws dict that forwards any get_df_parts kwarg (list_parts, jmd_n_len, tmd_len, ...), matching the house parameter-bundling idiom (split_kws / aal_kws). Keys are validated against inspect.signature(get_df_parts) via check_df_parts_kws (mirroring check_aal_kws_keys); mutually exclusive with df_parts and only valid alongside df_seq. Docstring, example notebook, and tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…None, not 20) tmd_len is a valid get_df_parts kwarg but defaults to None (TMD length is variable, read per-sequence, fixed only in the Position-based input mode); jmd_n_len/jmd_c_len are the fixed flank lengths (default 10). Example used a misleading tmd_len=20. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rts_kws=) Use the single-call df_seq path (merged in #316) instead of the manual get_df_parts + feature_matrix pair. Behaviour-preserving; prediction suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

breimanntools and others added 5 commits July 1, 2026 01:29

Merge remote-tracking branch 'origin/master' into feat/feature-matrix…

d142923

…-df-seq

breimanntools commented Jul 2, 2026

View reviewed changes

breimanntools and others added 2 commits July 2, 2026 09:27

breimanntools marked this pull request as ready for review July 2, 2026 07:54

breimanntools merged commit 46793a8 into master Jul 2, 2026
12 of 13 checks passed

breimanntools deleted the feat/feature-matrix-df-seq branch July 2, 2026 07:55

breimanntools mentioned this pull request Jul 2, 2026

feat: feature_matrix accepts df_seq + list_parts directly (drop the get_df_parts triple-call) #306

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316
breimanntools merged 7 commits into
masterfrom
feat/feature-matrix-df-seq

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

breimanntools Jul 2, 2026

Uh oh!

breimanntools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Local gates (all green, run against the worktree)

KPIs (issue #306)

Open API question

Critical self-review (reviewer pass)

Iterative review log

Uh oh!

breimanntools Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading