Skip to content

Fix NaN weight handling in segmentation and metrics#1036

Merged
etal merged 5 commits intomasterfrom
infra
Mar 10, 2026
Merged

Fix NaN weight handling in segmentation and metrics#1036
etal merged 5 commits intomasterfrom
infra

Conversation

@etal
Copy link
Copy Markdown
Owner

@etal etal commented Mar 10, 2026

Summary

  • Fix apply_weights() to replace NaN with epsilon before np.clip(), which does not clamp NaN values — this is the root cause of NaN weights in .cnr files
  • Add defensive NaN filtering in segment_mean(), calc_intervals(), and group_by_genes() to handle .cnr files produced by older CNVkit versions
  • Add tests for NaN weight scenarios in both segment_mean and apply_weights

Addresses #436, #908.

Details

Users reported crashes during CBS segmentation and segment_mean when .cnr files contained NaN weight values. The root cause: apply_weights() computes weights via formulas that can produce NaN (e.g. from degenerate coverage or NaN spread), then calls weights.clip(epsilon, 1.0) — but np.clip(NaN, min, max) returns NaN per IEEE 754.

These NaN weights then caused:

The fix prevents NaN weights at the source and adds defensive guards at downstream np.average() call sites.

Test plan

  • test_segment_mean_nan_weights — partial NaN, all-NaN, and clean weights
  • test_apply_weights_no_nan — NaN spread in reference produces valid weights
  • Full test suite passes (156 tests)
  • mypy clean, ruff clean, all pre-commit hooks pass

🤖 Generated with Claude Code

etal and others added 3 commits March 9, 2026 16:26
#908)

NaN values in bin weights caused two classes of failures:
- CBS segmentation: R's CNA() dropped NaN rows internally, creating a
  length mismatch with the weights vector (#436, fixed in R script by #914)
- segment_mean / np.average: NaN weights produced ZeroDivisionError or
  silent NaN propagation through sex inference and segmetrics (#908)

Root cause: apply_weights() used np.clip() to bound weights, but
np.clip(NaN, min, max) returns NaN rather than clamping to min.

Fix at three layers:
- apply_weights(): replace NaN with epsilon before clipping (prevents
  NaN weights from ever being written to .cnr files)
- segment_mean(): filter NaN weights before np.average (handles .cnr
  files from older CNVkit versions)
- calc_intervals(), group_by_genes(): guard np.average calls against
  NaN weights in bootstrap CI and gene-level depth calculations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@etal
Copy link
Copy Markdown
Owner Author

etal commented Mar 10, 2026

@claude Please review.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 10, 2026

Claude finished @etal's task —— View job


Code Review for NaN Weight Handling Fix

I'll review this PR that fixes NaN weight handling in segmentation and metrics.

Todo List:

  • Read and analyze all changed files
  • Review the core fix in apply_weights()
  • Examine defensive measures in downstream functions
  • Evaluate test coverage and quality
  • Check for potential edge cases or improvements
  • Provide comprehensive review feedback

Summary

This is a well-implemented fix for a critical issue. The root cause analysis is correct: np.clip(NaN, min, max) returns NaN per IEEE 754, which then propagates through the pipeline causing crashes in CBS segmentation and np.average() calls.

Core Fix Analysis

cnvlib/fix.py:apply_weights() (lines 553-556)

✅ EXCELLENT - The fix is placed at exactly the right location:

# NaN can arise from degenerate coverage (e.g. all-zero bin sizes or
# NaN spread values).  np.clip does not clamp NaN, so replace first.
weights = np.where(np.isnan(weights), epsilon, weights)
return cnarr.add_columns(weight=weights.clip(epsilon, 1.0))

Strengths:

  • Addresses the root cause by replacing NaN with epsilon before clipping
  • Clear documentation explaining the IEEE 754 behavior
  • Preserves the existing epsilon clamping logic
  • Uses vectorized operation for efficiency

Defensive Measures

cnvlib/segmetrics.py:segment_mean() (lines 361-368)

✅ GOOD - Defensive NaN filtering:

weights = cnarr["weight"].to_numpy()
log2s = cnarr["log2"].to_numpy()
valid = ~np.isnan(weights)
if valid.any():
    return np.average(log2s[valid], weights=weights[valid])

Strengths:

  • Handles legacy .cnr files with NaN weights
  • Graceful degradation to unweighted mean when all weights are NaN
  • Efficient boolean indexing

cnvlib/segmetrics.py:calc_intervals() (lines 165-172)

✅ GOOD - Similar defensive pattern for confidence intervals:

valid = ~np.isnan(wt)
if valid.all():
    out_vals_lo[i], out_vals_hi[i] = func(vals, wt)
elif valid.any():
    out_vals_lo[i], out_vals_hi[i] = func(vals[valid], wt[valid])

Minor optimization opportunity: Could combine the conditions:

valid = ~np.isnan(wt)
if valid.any():  # Handles both all() and partial cases
    out_vals_lo[i], out_vals_hi[i] = func(vals[valid], wt[valid])

cnvlib/reports.py:group_by_genes() (lines 446-452)

✅ GOOD - Consistent defensive pattern for depth calculations.

Test Coverage

test_segment_mean_nan_weights() (lines 314-333)

✅ EXCELLENT - Comprehensive test coverage:

  • Normal weights (baseline)
  • Partial NaN weights (key scenario)
  • All NaN weights (edge case)

test_apply_weights_no_nan() (lines 335-351)

✅ EXCELLENT - Direct testing of the root cause fix:

  • Normal case with valid spread values
  • NaN spread values (reproduces original bug)
  • Verifies no NaN weights in output

Strength: Uses deterministic RNG seeds for reproducible test data.

Code Quality

Type Safety

  • Good: Proper use of .to_numpy() instead of deprecated .values
  • Good: Consistent pandas API usage throughout

Performance

  • Good: Vectorized operations (np.where, boolean indexing)
  • Good: Minimal computational overhead

Documentation

  • Excellent: Clear comments explaining IEEE 754 behavior
  • Good: Docstring updates preserve existing API contracts

Edge Cases & Considerations

Potential Issues Found:

  1. Zero-sum weights: When all valid weights sum to zero, np.average() still raises ZeroDivisionError. However, this is extremely unlikely since weights are clamped to [epsilon, 1.0].

  2. Empty valid arrays: If valid.any() is False in downstream functions, the functions silently skip processing. This is appropriate behavior.

Not Issues:

  • Memory overhead: The defensive measures add minimal memory overhead
  • Performance impact: Boolean indexing is efficient for the typical data sizes
  • Backward compatibility: Changes are fully backward compatible

Recommendations

Minor Improvements:

  1. Consolidate condition checks in calc_intervals() as shown above
  2. Consider logging when NaN weights are found to help with debugging:
    nan_count = np.isnan(weights).sum()
    if nan_count > 0:
        logging.debug("Replaced %d NaN weights with epsilon=%g", nan_count, epsilon)

Architecture:

The fix appropriately addresses the issue at multiple levels:

  • Prevention at the source (apply_weights)
  • Defense in downstream consumers
  • Testing for both scenarios

This layered approach is robust and handles both new data (fixed weights) and legacy data (defensive filtering).

Overall Assessment

✅ APPROVE - This is a high-quality fix that:

  • Correctly identifies and fixes the root cause
  • Provides appropriate defensive measures
  • Includes comprehensive test coverage
  • Maintains backward compatibility
  • Follows project coding standards

The IEEE 754 NaN behavior insight is particularly valuable, and the implementation is both correct and efficient. The fix should resolve issues #436 and #908 without introducing new problems.


@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 68.42105% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.68%. Comparing base (a8ccdfc) to head (796201d).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
cnvlib/fix.py 25.00% 2 Missing and 1 partial ⚠️
cnvlib/segmetrics.py 80.00% 0 Missing and 2 partials ⚠️
cnvlib/reports.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1036   +/-   ##
=======================================
  Coverage   65.67%   65.68%           
=======================================
  Files          72       72           
  Lines        7362     7375   +13     
  Branches     1290     1294    +4     
=======================================
+ Hits         4835     4844    +9     
- Misses       2093     2094    +1     
- Partials      434      437    +3     
Flag Coverage Δ
unittests 65.68% <68.42%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

etal and others added 2 commits March 9, 2026 23:50
Combine the redundant valid.all()/valid.any() branches into a single
valid.any() check — when all weights are valid, indexing with the mask
is equivalent to using the original arrays.

Also log when NaN weights are found in apply_weights() so users can
see when degenerate bins are being handled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@etal etal merged commit 4225b17 into master Mar 10, 2026
15 checks passed
@etal etal deleted the infra branch March 10, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant