Fix NaN weight handling in segmentation and metrics by etal · Pull Request #1036 · etal/cnvkit

etal · 2026-03-10T06:30:38Z

Summary

Fix apply_weights() to replace NaN with epsilon before np.clip(), which does not clamp NaN values — this is the root cause of NaN weights in .cnr files
Add defensive NaN filtering in segment_mean(), calc_intervals(), and group_by_genes() to handle .cnr files produced by older CNVkit versions
Add tests for NaN weight scenarios in both segment_mean and apply_weights

Addresses #436, #908.

Details

Users reported crashes during CBS segmentation and segment_mean when .cnr files contained NaN weight values. The root cause: apply_weights() computes weights via formulas that can produce NaN (e.g. from degenerate coverage or NaN spread), then calls weights.clip(epsilon, 1.0) — but np.clip(NaN, min, max) returns NaN per IEEE 754.

These NaN weights then caused:

CBS (cnvkit batch fail for WES data #436): R's CNA() dropped NaN rows internally, creating a length mismatch with the weights vector (already fixed in R script by Prevent CBS segmentation failures due to nulls in input .cnr #914)
segment_mean (targeted DNA seq #908): np.average() with NaN/zero-sum weights → ZeroDivisionError or silent NaN propagation through sex inference

The fix prevents NaN weights at the source and adds defensive guards at downstream np.average() call sites.

Test plan

test_segment_mean_nan_weights — partial NaN, all-NaN, and clean weights
test_apply_weights_no_nan — NaN spread in reference produces valid weights
Full test suite passes (156 tests)
mypy clean, ruff clean, all pre-commit hooks pass

🤖 Generated with Claude Code

#908) NaN values in bin weights caused two classes of failures: - CBS segmentation: R's CNA() dropped NaN rows internally, creating a length mismatch with the weights vector (#436, fixed in R script by #914) - segment_mean / np.average: NaN weights produced ZeroDivisionError or silent NaN propagation through sex inference and segmetrics (#908) Root cause: apply_weights() used np.clip() to bound weights, but np.clip(NaN, min, max) returns NaN rather than clamping to min. Fix at three layers: - apply_weights(): replace NaN with epsilon before clipping (prevents NaN weights from ever being written to .cnr files) - segment_mean(): filter NaN weights before np.average (handles .cnr files from older CNVkit versions) - calc_intervals(), group_by_genes(): guard np.average calls against NaN weights in bootstrap CI and gene-level depth calculations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal · 2026-03-10T06:31:09Z

@claude Please review.

claude · 2026-03-10T06:31:21Z

Claude finished @etal's task —— View job

Code Review for NaN Weight Handling Fix

I'll review this PR that fixes NaN weight handling in segmentation and metrics.

Todo List:

Read and analyze all changed files
Review the core fix in apply_weights()
Examine defensive measures in downstream functions
Evaluate test coverage and quality
Check for potential edge cases or improvements
Provide comprehensive review feedback

Summary

This is a well-implemented fix for a critical issue. The root cause analysis is correct: np.clip(NaN, min, max) returns NaN per IEEE 754, which then propagates through the pipeline causing crashes in CBS segmentation and np.average() calls.

Core Fix Analysis

`cnvlib/fix.py:apply_weights()` (lines 553-556)

✅ EXCELLENT - The fix is placed at exactly the right location:

# NaN can arise from degenerate coverage (e.g. all-zero bin sizes or
# NaN spread values).  np.clip does not clamp NaN, so replace first.
weights = np.where(np.isnan(weights), epsilon, weights)
return cnarr.add_columns(weight=weights.clip(epsilon, 1.0))

Strengths:

Addresses the root cause by replacing NaN with epsilon before clipping
Clear documentation explaining the IEEE 754 behavior
Preserves the existing epsilon clamping logic
Uses vectorized operation for efficiency

Defensive Measures

`cnvlib/segmetrics.py:segment_mean()` (lines 361-368)

✅ GOOD - Defensive NaN filtering:

weights = cnarr["weight"].to_numpy()
log2s = cnarr["log2"].to_numpy()
valid = ~np.isnan(weights)
if valid.any():
    return np.average(log2s[valid], weights=weights[valid])

Strengths:

Handles legacy .cnr files with NaN weights
Graceful degradation to unweighted mean when all weights are NaN
Efficient boolean indexing

`cnvlib/segmetrics.py:calc_intervals()` (lines 165-172)

✅ GOOD - Similar defensive pattern for confidence intervals:

valid = ~np.isnan(wt)
if valid.all():
    out_vals_lo[i], out_vals_hi[i] = func(vals, wt)
elif valid.any():
    out_vals_lo[i], out_vals_hi[i] = func(vals[valid], wt[valid])

Minor optimization opportunity: Could combine the conditions:

valid = ~np.isnan(wt)
if valid.any():  # Handles both all() and partial cases
    out_vals_lo[i], out_vals_hi[i] = func(vals[valid], wt[valid])

`cnvlib/reports.py:group_by_genes()` (lines 446-452)

✅ GOOD - Consistent defensive pattern for depth calculations.

Test Coverage

`test_segment_mean_nan_weights()` (lines 314-333)

✅ EXCELLENT - Comprehensive test coverage:

Normal weights (baseline)
Partial NaN weights (key scenario)
All NaN weights (edge case)

`test_apply_weights_no_nan()` (lines 335-351)

✅ EXCELLENT - Direct testing of the root cause fix:

Normal case with valid spread values
NaN spread values (reproduces original bug)
Verifies no NaN weights in output

Strength: Uses deterministic RNG seeds for reproducible test data.

Code Quality

Type Safety

Good: Proper use of .to_numpy() instead of deprecated .values
Good: Consistent pandas API usage throughout

Performance

Good: Vectorized operations (np.where, boolean indexing)
Good: Minimal computational overhead

Documentation

Excellent: Clear comments explaining IEEE 754 behavior
Good: Docstring updates preserve existing API contracts

Edge Cases & Considerations

Potential Issues Found:

Zero-sum weights: When all valid weights sum to zero, np.average() still raises ZeroDivisionError. However, this is extremely unlikely since weights are clamped to [epsilon, 1.0].
Empty valid arrays: If valid.any() is False in downstream functions, the functions silently skip processing. This is appropriate behavior.

Not Issues:

Memory overhead: The defensive measures add minimal memory overhead
Performance impact: Boolean indexing is efficient for the typical data sizes
Backward compatibility: Changes are fully backward compatible

Recommendations

Minor Improvements:

Consolidate condition checks in calc_intervals() as shown above

Consider logging when NaN weights are found to help with debugging:

nan_count = np.isnan(weights).sum()
if nan_count > 0:
    logging.debug("Replaced %d NaN weights with epsilon=%g", nan_count, epsilon)

Architecture:

The fix appropriately addresses the issue at multiple levels:

Prevention at the source (apply_weights)
Defense in downstream consumers
Testing for both scenarios

This layered approach is robust and handles both new data (fixed weights) and legacy data (defensive filtering).

Overall Assessment

✅ APPROVE - This is a high-quality fix that:

Correctly identifies and fixes the root cause
Provides appropriate defensive measures
Includes comprehensive test coverage
Maintains backward compatibility
Follows project coding standards

The IEEE 754 NaN behavior insight is particularly valuable, and the implementation is both correct and efficient. The fix should resolve issues #436 and #908 without introducing new problems.

codecov · 2026-03-10T06:36:11Z

Codecov Report

❌ Patch coverage is 68.42105% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.68%. Comparing base (a8ccdfc) to head (796201d).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
cnvlib/fix.py	25.00%	2 Missing and 1 partial ⚠️
cnvlib/segmetrics.py	80.00%	0 Missing and 2 partials ⚠️
cnvlib/reports.py	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1036   +/-   ##
=======================================
  Coverage   65.67%   65.68%           
=======================================
  Files          72       72           
  Lines        7362     7375   +13     
  Branches     1290     1294    +4     
=======================================
+ Hits         4835     4844    +9     
- Misses       2093     2094    +1     
- Partials      434      437    +3

Flag	Coverage Δ
unittests	`65.68% <68.42%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Combine the redundant valid.all()/valid.any() branches into a single valid.any() check — when all weights are valid, indexing with the mask is equivalent to using the original arrays. Also log when NaN weights are found in apply_weights() so users can see when degenerate bins are being handled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal and others added 3 commits March 9, 2026 16:26

CLAUDE.md: encourage TDD

b12ce48

Fix test imports and lint: move pandas to top-level, .values→.to_numpy()

a351167

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal and others added 2 commits March 9, 2026 23:50

CLAUDE.md: remind to consider and test edge cases

796201d

etal merged commit 4225b17 into master Mar 10, 2026
15 checks passed

etal deleted the infra branch March 10, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN weight handling in segmentation and metrics#1036

Fix NaN weight handling in segmentation and metrics#1036
etal merged 5 commits intomasterfrom
infra

etal commented Mar 10, 2026

Uh oh!

etal commented Mar 10, 2026

Uh oh!

claude Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

etal commented Mar 10, 2026

Summary

Details

Test plan

Uh oh!

etal commented Mar 10, 2026

Uh oh!

claude Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for NaN Weight Handling Fix

Summary

Core Fix Analysis

cnvlib/fix.py:apply_weights() (lines 553-556)

Defensive Measures

cnvlib/segmetrics.py:segment_mean() (lines 361-368)

cnvlib/segmetrics.py:calc_intervals() (lines 165-172)

cnvlib/reports.py:group_by_genes() (lines 446-452)

Test Coverage

test_segment_mean_nan_weights() (lines 314-333)

test_apply_weights_no_nan() (lines 335-351)

Code Quality

Type Safety

Performance

Documentation

Edge Cases & Considerations

Potential Issues Found:

Not Issues:

Recommendations

Minor Improvements:

Architecture:

Overall Assessment

Uh oh!

codecov Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Mar 10, 2026 •

edited

Loading

`cnvlib/fix.py:apply_weights()` (lines 553-556)

`cnvlib/segmetrics.py:segment_mean()` (lines 361-368)

`cnvlib/segmetrics.py:calc_intervals()` (lines 165-172)

`cnvlib/reports.py:group_by_genes()` (lines 446-452)

`test_segment_mean_nan_weights()` (lines 314-333)

`test_apply_weights_no_nan()` (lines 335-351)

codecov Bot commented Mar 10, 2026 •

edited

Loading