feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316
Merged
Conversation
Add df_seq= and list_parts= to SequenceFeature.feature_matrix so the get_df_parts -> feature_matrix two-step collapses into one call. When df_seq is given, df_parts is built internally via get_df_parts; when df_parts is given the path is unchanged (byte-identical). Exactly one of df_parts / df_seq must be supplied (bare ValueError in the Validate block); batch=True and list_parts-without-df_seq are rejected. Ripple: numpydoc (df_seq/list_parts params + versionchanged + canonical df_seq baseline), example notebook df_seq cell (re-executed with outputs), unit tests (positive+negative per new param + byte-identical regression on the legacy df_parts path), release-notes Unreleased entry. Part of #305. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted) The df_seq PR inserted the new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes between TestFeatureMatrixGoldenValues and its two trailing property tests (test_property_shape_rows_cols / test_property_parallel_equals_serial), silently moving them under the regression class. Restore them to GoldenValues where they belong; the new classes now sit cleanly at the end of the file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hoist ut.check_bool(name='batch') above the mutual-exclusion block so a non-bool truthy batch (e.g. batch=1) with df_seq raises the type error, consistent with the df_parts path, instead of the df_seq-incompat message. Remove the now-duplicate later check; add test_invalid_batch_type_with_df_seq. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Clarify that the df_seq path uses get_df_parts defaults for its other options (jmd_n_len/jmd_c_len/tmd_len) and point to the two-step get_df_parts path for full control. API-consistency/docstring-accuracy only; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
breimanntools
commented
Jul 2, 2026
| n_jobs: Union[int, None] = 1, | ||
| batch: bool = False, | ||
| df_seq: Optional[pd.DataFrame] = None, | ||
| list_parts: Optional[Union[str, List[str]]] = None, |
Owner
Author
There was a problem hiding this comment.
but we need then as well len_jmd_n and len_jmd_c as paramter. Can we work here with a dict_df_parts to bundle all of the relevant parameters? We did this pattern already somewehre else, please apply this as well. default is 10 (if global value is not set) and the jmd lenght paramters are only considered if df_seq and list_parts are given.
breimanntools
commented
Jul 2, 2026
breimanntools
left a comment
Owner
Author
There was a problem hiding this comment.
Please adjust as desrcribed
…dict Address #316 review: replace the ad-hoc list_parts= arg with a general df_parts_kws dict that forwards any get_df_parts kwarg (list_parts, jmd_n_len, tmd_len, ...), matching the house parameter-bundling idiom (split_kws / aal_kws). Keys are validated against inspect.signature(get_df_parts) via check_df_parts_kws (mirroring check_aal_kws_keys); mutually exclusive with df_parts and only valid alongside df_seq. Docstring, example notebook, and tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…None, not 20) tmd_len is a valid get_df_parts kwarg but defaults to None (TMD length is variable, read per-sequence, fixed only in the Position-based input mode); jmd_n_len/jmd_c_len are the fixed flank lengths (default 10). Example used a misleading tmd_len=20. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closed
8 tasks
breimanntools
added a commit
that referenced
this pull request
Jul 2, 2026
…rts_kws=) Use the single-call df_seq path (merged in #316) instead of the manual get_df_parts + feature_matrix pair. Behaviour-preserving; prediction suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Morning status — SOLID. Works end to end with green local tests and the full ripple done.
df_seq=/list_parts=added toSequenceFeature.feature_matrix; the legacydf_parts=path is byte-identical (golden + dataset-fingerprint regression tests). Review focus: the Validate-block wording and the choice to rejectbatch=Truewithdf_seq(single-DataFrame input). One open API question below.Part of #305 / prototype for #306.
What changed
SequenceFeature.feature_matrix(features, df_parts=None, ..., df_seq=None, list_parts=None):df_parts/df_seqmust be supplied → else bareValueError.df_seqis given,df_partsis built internally viaget_df_parts(df_seq, list_parts).df_partsis given the code path is unchanged (byte-identical).batch=Truetogether withdf_seqraisesValueError;list_partswithoutdf_seqraisesValueError.df_partsdefault flipped toNone(additive; existing positional/keyword calls unaffected).df_seq/list_partsparams (df_sequses the canonical baseline phrasing), aversionchangednote,df_partsmarked optional.examples/feature_engineering/sf_feature_matrix.ipynb: newdf_seqcell, re-executed with embedded outputs (nbmake-green).test_sf_feature_matrix.py):TestFeatureMatrixDfSeq(positive + negative per new param) andTestFeatureMatrixRegression(byte-identical legacydf_partspath: hand-computed 1.5 golden + dataset fingerprint).Local gates (all green, run against the worktree)
test_sf_feature_matrix.py— 29 passed (was 18 before the new tests).tests/unit/sequence_feature_tests/+test_param_coverage.py— 444 passed.aap.predict_samplespipe — 70 passed.KPIs (issue #306)
df_parts=output byte-identical (regression test on canonical fixture).feature_matrix(df_seq=..., list_parts=..., ...)equals the two-step result exactly.batch+df_seq/list_partsw/odf_seq→ValueError).Open API question
df_seq/list_partsare appended afterbatchto preserve positional compatibility, so they read out of logical order in the signature. Acceptable, or should they sit next todf_parts(a positional-compat break for the rare positionalbatch/n_jobscaller)?list_partspassed alongsidedf_partsbe a hardValueError(current) or a silently-ignored no-op? Chose ValueError for clarity.🤖 Generated with Claude Code
Critical self-review (reviewer pass)
Adversarial review by a second agent. Findings and fixes:
Defect found + fixed — property tests silently reparented. The new
TestFeatureMatrixDfSeq/TestFeatureMatrixRegressionclasses were inserted betweenTestFeatureMatrixGoldenValuesand its two trailing methods (test_property_shape_rows_cols,test_property_parallel_equals_serial), so those two pre-existing property tests were moved under the regression class — where they don't belong semantically ("byte-identical drift lock" vs. shape/parallel invariants). Restored them toTestFeatureMatrixGoldenValuesand left the two new classes cleanly at the end of the file. No test lost; 29 pass, and collection now shows the property tests back underGoldenValues.Verified (no change needed):
df_parts=path is genuinely proven, not a tautology. Recomputed the regression fingerprint on the independentmasterinstall:nansum=329.66268,X[0,0]=0.7,X[-1,-1]=0.16033— matches the pinned constants exactly. The anchor would fail if the legacy path drifted, and it agrees with master, so byte-identity is real.(df_parts is None) == (df_seq is None)rejects both-given and neither-given;batch=True + df_seq,list_partswithoutdf_seq, and non-DataFramedf_seq(viaget_df_parts→check_df_seq) all raise clearValueErrors. Confirmed each with concrete inputs.df_partsconstruction reusesget_df_parts(single call, no duplicated slicing logic; downstream re-validation is cheap and harmless).versionchanged:: 1.1.0notes,Examplesinclude resolves to an existing file.check_docstrings= 0 defects,doc_signature_drift= 0 candidates.df_seqpath adds exactly oneget_df_partscall.Residual concerns (not blocking, not fixed):
test_df_parts_path_byte_identical_goldenduplicates the existingtest_golden_segment_whole_mean(both assertACSegment(1,1) → 1.5). Intentional redundancy as an explicit regression lock; left as-is.batch; hard-error vs. no-op forlist_partswithdf_parts) remain design calls for the maintainer.Iterative review log
batch(e.g.batch=1) withdf_seqraised the "batch=True not supported with df_seq" message instead of the type error, and behaved inconsistently vs. thedf_partspath. Hoistedut.check_bool(name="batch")above the mutual-exclusion resolution (removed the now-duplicate later check) sobatchis type-validated first; addedtest_invalid_batch_type_with_df_seq. 30 tests pass.get_df_parts): re-proved the legacydf_parts=fingerprint is non-tautological by recomputing it on the independent master install (nodf_seqparam) — nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 match the pinned constants exactly, so the anchor would fail on any drift. Confirmed mutual-exclusion is complete (both/neither/df_seq+batch/list_parts-without-df_seq) and part-building is a singleget_df_partsdelegation (no duplicated slicing).df_seq/list_partsnames + phrasing mirror the siblingget_df_parts. Fixed one docstring-accuracy gap: documented that the shortcut only forwardslist_partswhile otherget_df_partsoptions (jmd_n_len/jmd_c_len/tmd_len) use defaults, pointing to the two-step path for full control. 30 tests pass;check_docstrings0 defects.df_seqpath adds exactly oneget_df_partsdelegation (no loops, no needless copies, no recompute); the only redundancy (a duplicatecheck_bool(batch)) was already removed in round 2. The sharedcheck_df_partsre-validation of the builtdf_partsis the common path for both input sources and is cheap — branching to skip it would add complexity, so it stays.Returns, and anExamplesinclude that resolves; no ADR refs / noprint/ut.check_*in the diff. Frontend validates every public param (batchtype-checked,df_seq/list_partsviaget_df_partsdelegation + mutual-exclusion), each new/changed param has positive+negative tests. Full area gate green: sequence_feature_tests 440 passed, param_coverage 5, import-hygiene 2, docstring-contracts 13, doc-signature-drift 0, example nbmake 1 passed.