feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors#321
Merged
Merged
Conversation
…totype #308) Three additive conveniences for the positive/unlabelled -> mined-negatives flow, removing the recurring vstack/label-vector/color-lookup plumbing in the gamma-secretase use case. All existing APIs stay byte-identical. - dPULearn.mine_negatives(X_pos, X_unlabelled, ...): one-call sugar over fit that returns the reliable-negative boolean mask over X_unlabelled. Equals the manual labels_[len(X_pos):]==0 result exactly (regression-tested). fit(X, labels=...) unchanged. - get_labels(df, positive_label=1, col_label="label"): binary int label vector, the single-call form of (df[col]==x).astype(int).to_numpy(). - COLOR_SAMPLES_POS/NEG/UNL/REL_NEG: public named aliases for the canonical sample colors, equal to plot_get_cdict("DICT_COLOR")["SAMPLES_*"] (golden-tested). Wired get_labels + the 4 color constants into __init__/__all__ (on the #308 wire-to-public-API list). Ripple: numpydoc + 2 executed example notebooks (get_labels, dpul_mine_negatives), 39 unit tests (positive+negative+regression), release-notes Unreleased entries, cheat-sheet rows. Part of #305 / prototype for #308 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #321 +/- ##
==========================================
+ Coverage 94.93% 94.95% +0.01%
==========================================
Files 185 186 +1
Lines 17883 17910 +27
Branches 3038 3034 -4
==========================================
+ Hits 16978 17007 +29
+ Misses 598 597 -1
+ Partials 307 306 -1
... and 2 files with indirect coverage changes
🚀 New features to boost your workflow:
|
… in get_labels Reorder the get_labels Validate block so check_str(col_label) runs before check_df(cols_required=col_label). A non-str col_label now surfaces a clear 'col_label' error instead of an internal 'cols_required' one. No behaviour change on valid input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mine_negatives validated X_pos and X_unlabelled separately with the default check_X min_n_samples=3, so it rejected n_pos<3 inputs that the manual stacking path accepts (the >=3 floor belongs to the stacked matrix, which fit enforces). Relax the per-matrix check to min_n_samples=1 to restore exact equivalence; add tests for the small-positive-set equivalence and get_labels single-class/NaN mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sistency The rest of the package spells it 'unlabeled' (American, 85 uses) and abbreviates the marker as label_unl / n_unl_to_neg; the new public mine_negatives parameter used the British two-L 'X_unlabelled'. Rename the new/unreleased parameter, its match helper, docstrings, tests, cheat-sheet and release-notes entries, and re-execute the example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With no pre-labeled negatives, n_neg (total) and n_unl_to_neg (from the pool) are always equivalent in mine_negatives, so exposing both was redundant. Replace them with a single required n_neg (the method is new/unreleased, so non-breaking); it calls fit(n_unl_to_neg=n_neg) internally. Update docstring, tests, and the re-executed example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d error mine_negatives delegated n_neg validation to fit (which sees it as n_unl_to_neg), so an invalid n_neg raised an error naming the internal parameter. Validate n_neg explicitly in the frontend so the message names n_neg, and assert the name in the negative test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-mask # Conflicts: # docs/source/index/release_notes.rst
…d + mask_neg_); remove the method Instead of a separate mine_negatives method, extend fit to accept the positives/unlabeled split directly: pass X_pos + X_unlabeled (an alternative to X + labels), and read the new dPULearn.mask_neg_ attribute for the boolean mask of reliable negatives (over X_unlabeled in the split mode, over X otherwise). fit still returns self (sklearn contract preserved), so no output-parameter anti-pattern. Removes the mine_negatives method + its example notebook; folds the demo into dpul_fit; updates get_labels docstring, cheat sheet, and release notes. Tests rewritten to the split-mode + mask_neg_ (incl. manual-equivalence, both-modes guard, returns-self); 271 dpulearn + api-meta tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
breimanntools
commented
Jul 3, 2026
breimanntools
left a comment
Owner
Author
There was a problem hiding this comment.
Looks good. Please make sure that the alternative is cleraly described in the example notebook and both options are clearly introduced as two options (perhaps bullet points in the beginning of the example notebook before each paratmer is introduced iteslef!
…p front (review) Per PR review: the dpul_fit example notebook now opens by presenting the two input options (Option 1: X + labels; Option 2: X_pos + X_unlabeled -> mask_neg_) as bullet points before any parameter is demonstrated, and labels the split-input section as Option 2 so the alternative is clearly described. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #308. Three additive pieces for the positive/unlabeled workflow.
1.
dPULearn.fit— positives/unlabeled split input +mask_neg_For the common positives-vs-unlabeled setup,
fitnow acceptsX_pos+X_unlabeledas an alternative toX+labels: the two matrices are stacked and marked (1/2) internally, so you don't hand-build the label vector. After fitting, the newdPULearn.mask_neg_attribute is the boolean mask of reliable negatives — over the rows ofX_unlabeledin the split mode, overXotherwise (equals the manuallabels_[len(X_pos):] == 0exactly).fitstill returnsself(scikit-learn contract preserved — no output-flag anti-pattern), and the existingfit(X, labels=...)path is byte-identical.2.
aa.get_labels(df, positive_label=1, col_label="label")One-call binary label vector for the recurring
(df[col]==x).astype(int)pattern.3.
COLOR_SAMPLES_*constantsCanonical
aa.COLOR_SAMPLES_POS/NEG/UNL/REL_NEGcolors so notebooks reference a named constant.Tests cover the split-mode ↔ manual equivalence,
mask_neg_semantics in both modes, the returns-self + both-modes-rejected guards, andget_labelsedge cases. 271 dpulearn + api-meta tests green.🤖 Generated with Claude Code