feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors by breimanntools · Pull Request #321 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:49:43Z

Part of #308. Three additive pieces for the positive/unlabeled workflow.

1. `dPULearn.fit` — positives/unlabeled split input + `mask_neg_`

For the common positives-vs-unlabeled setup, fit now accepts X_pos + X_unlabeled as an alternative to X + labels: the two matrices are stacked and marked (1/2) internally, so you don't hand-build the label vector. After fitting, the new dPULearn.mask_neg_ attribute is the boolean mask of reliable negatives — over the rows of X_unlabeled in the split mode, over X otherwise (equals the manual labels_[len(X_pos):] == 0 exactly).

fit still returns self (scikit-learn contract preserved — no output-flag anti-pattern), and the existing fit(X, labels=...) path is byte-identical.

dpul.fit(X_pos=X_pos, X_unlabeled=X_pool, n_neg=49)
X_neg = X_pool[dpul.mask_neg_]

Supersedes the earlier standalone dPULearn.mine_negatives method (removed): the convenience now lives on fit itself rather than adding a second entry point.

2. `aa.get_labels(df, positive_label=1, col_label="label")`

One-call binary label vector for the recurring (df[col]==x).astype(int) pattern.

3. `COLOR_SAMPLES_*` constants

Canonical aa.COLOR_SAMPLES_POS/NEG/UNL/REL_NEG colors so notebooks reference a named constant.

Tests cover the split-mode ↔ manual equivalence, mask_neg_ semantics in both modes, the returns-self + both-modes-rejected guards, and get_labels edge cases. 271 dpulearn + api-meta tests green.

🤖 Generated with Claude Code

…totype #308) Three additive conveniences for the positive/unlabelled -> mined-negatives flow, removing the recurring vstack/label-vector/color-lookup plumbing in the gamma-secretase use case. All existing APIs stay byte-identical. - dPULearn.mine_negatives(X_pos, X_unlabelled, ...): one-call sugar over fit that returns the reliable-negative boolean mask over X_unlabelled. Equals the manual labels_[len(X_pos):]==0 result exactly (regression-tested). fit(X, labels=...) unchanged. - get_labels(df, positive_label=1, col_label="label"): binary int label vector, the single-call form of (df[col]==x).astype(int).to_numpy(). - COLOR_SAMPLES_POS/NEG/UNL/REL_NEG: public named aliases for the canonical sample colors, equal to plot_get_cdict("DICT_COLOR")["SAMPLES_*"] (golden-tested). Wired get_labels + the 4 color constants into __init__/__all__ (on the #308 wire-to-public-API list). Ripple: numpydoc + 2 executed example notebooks (get_labels, dpul_mine_negatives), 39 unit tests (positive+negative+regression), release-notes Unreleased entries, cheat-sheet rows. Part of #305 / prototype for #308 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-07-01T00:14:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.95%. Comparing base (1a152de) to head (830879a).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   94.93%   94.95%   +0.01%     
==========================================
  Files         185      186       +1     
  Lines       17883    17910      +27     
  Branches     3038     3034       -4     
==========================================
+ Hits        16978    17007      +29     
+ Misses        598      597       -1     
+ Partials      307      306       -1

Files with missing lines	Coverage Δ
aaanalysis/__init__.py	`97.29% <100.00%> (+0.03%)`	⬆️
aaanalysis/_constants.py	`100.00% <100.00%> (ø)`
aaanalysis/data_handling/__init__.py	`100.00% <100.00%> (ø)`
aaanalysis/data_handling/_get_labels.py	`100.00% <100.00%> (ø)`
aaanalysis/pu_learning/_dpulearn.py	`98.06% <100.00%> (+0.32%)`	⬆️

... and 2 files with indirect coverage changes

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… in get_labels Reorder the get_labels Validate block so check_str(col_label) runs before check_df(cols_required=col_label). A non-str col_label now surfaces a clear 'col_label' error instead of an internal 'cols_required' one. No behaviour change on valid input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mine_negatives validated X_pos and X_unlabelled separately with the default check_X min_n_samples=3, so it rejected n_pos<3 inputs that the manual stacking path accepts (the >=3 floor belongs to the stacked matrix, which fit enforces). Relax the per-matrix check to min_n_samples=1 to restore exact equivalence; add tests for the small-positive-set equivalence and get_labels single-class/NaN mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sistency The rest of the package spells it 'unlabeled' (American, 85 uses) and abbreviates the marker as label_unl / n_unl_to_neg; the new public mine_negatives parameter used the British two-L 'X_unlabelled'. Rename the new/unreleased parameter, its match helper, docstrings, tests, cheat-sheet and release-notes entries, and re-execute the example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With no pre-labeled negatives, n_neg (total) and n_unl_to_neg (from the pool) are always equivalent in mine_negatives, so exposing both was redundant. Replace them with a single required n_neg (the method is new/unreleased, so non-breaking); it calls fit(n_unl_to_neg=n_neg) internally. Update docstring, tests, and the re-executed example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d error mine_negatives delegated n_neg validation to fit (which sees it as n_unl_to_neg), so an invalid n_neg raised an error naming the internal parameter. Validate n_neg explicitly in the frontend so the message names n_neg, and assert the name in the negative test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-mask # Conflicts: # docs/source/index/release_notes.rst

…-mask

…d + mask_neg_); remove the method Instead of a separate mine_negatives method, extend fit to accept the positives/unlabeled split directly: pass X_pos + X_unlabeled (an alternative to X + labels), and read the new dPULearn.mask_neg_ attribute for the boolean mask of reliable negatives (over X_unlabeled in the split mode, over X otherwise). fit still returns self (sklearn contract preserved), so no output-parameter anti-pattern. Removes the mine_negatives method + its example notebook; folds the demo into dpul_fit; updates get_labels docstring, cheat sheet, and release notes. Tests rewritten to the split-mode + mask_neg_ (incl. manual-equivalence, both-modes guard, returns-self); 271 dpulearn + api-meta tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

breimanntools

Looks good. Please make sure that the alternative is cleraly described in the example notebook and both options are clearly introduced as two options (perhaps bullet points in the beginning of the example notebook before each paratmer is introduced iteslef!

…p front (review) Per PR review: the dpul_fit example notebook now opens by presenting the two input options (Option 1: X + labels; Option 2: X_pos + X_unlabeled -> mask_neg_) as bullet points before any parameter is demonstrated, and labels the split-input section as Option 2 so the alternative is clearly described. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

breimanntools and others added 7 commits July 1, 2026 04:50

Merge remote-tracking branch 'origin/master' into feat/dpulearn-mined…

884634e

…-mask # Conflicts: # docs/source/index/release_notes.rst

Merge remote-tracking branch 'origin/master' into feat/dpulearn-mined…

cd69329

…-mask

breimanntools marked this pull request as ready for review July 2, 2026 12:48

breimanntools changed the title ~~feat: dPULearn.mine_negatives + get_labels + named sample colors (prototype #308)~~ feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors Jul 3, 2026

breimanntools commented Jul 3, 2026

View reviewed changes

breimanntools merged commit 0877b3d into master Jul 3, 2026
13 checks passed

breimanntools deleted the feat/dpulearn-mined-mask branch July 3, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors#321

feat(pu): dPULearn.fit split input (X_pos/X_unlabeled + mask_neg_) + aa.get_labels + sample colors#321
breimanntools merged 10 commits into
masterfrom
feat/dpulearn-mined-mask

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

breimanntools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. dPULearn.fit — positives/unlabeled split input + mask_neg_

2. aa.get_labels(df, positive_label=1, col_label="label")

3. COLOR_SAMPLES_* constants

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading

1. `dPULearn.fit` — positives/unlabeled split input + `mask_neg_`

2. `aa.get_labels(df, positive_label=1, col_label="label")`

3. `COLOR_SAMPLES_*` constants

codecov Bot commented Jul 1, 2026 •

edited

Loading