Skip to content

Fix polydiff for variable step size and missing data (issue #173)#190

Merged
pavelkomarov merged 3 commits intomasterfrom
fix/issue-173-polydiff-variable-step-nan
Mar 6, 2026
Merged

Fix polydiff for variable step size and missing data (issue #173)#190
pavelkomarov merged 3 commits intomasterfrom
fix/issue-173-polydiff-variable-step-nan

Conversation

@pavelkomarov
Copy link
Collaborator

Summary

Fixes #173 — makes polydiff work correctly when x contains NaN values (missing data) and/or when time points are nonuniformly spaced.

Changes

pynumdiff/utils/utility.pyslide_function

  • Renamed dtdt_or_t to accept either a scalar step size or an array of time locations
  • When dt_or_t is an array, passes the actual t[window] slice to func so inner functions see real time values and can handle nonuniform spacing
  • Fully backward compatible: scalar dt is passed through unchanged, so lineardiff and all other callers are unaffected

pynumdiff/polynomial_fit.pypolydiff / _polydiff

  • Renamed dtdt_or_t to match
  • Builds time array from scalar dt or uses the passed array directly
  • Filters NaN values from x before calling np.polyfit, so missing data no longer causes all-NaN output
  • Handles all-NaN windows gracefully by returning NaN arrays rather than crashing
  • Updated docstrings to document the new behavior

pynumdiff/tests/test_diff_methods.py

  • test_polydiff_nan_inputs: verifies NaN gaps produce finite output and accurate values outside the gap
  • test_polydiff_variable_step: verifies irregular t array gives accurate derivatives, and beats the wrong-uniform-dt baseline

pynumdiff/tests/test_utils.py

  • Extended test_slide_function to cover the array-t code path

- slide_function: accept dt_or_t as scalar or array; when array, pass
  the actual t[window] slice to func so inner functions see real time
  values and can handle nonuniform spacing (backward compatible: scalar
  dt still passed as before, so lineardiff and other callers unchanged)
- polydiff/_polydiff: rename dt -> dt_or_t; build t from scalar dt or
  use the passed array directly; filter NaN values from x before
  np.polyfit so missing data no longer causes all-NaN outputs; handle
  all-NaN windows by returning NaN arrays rather than crashing
- Tests: add test_polydiff_nan_inputs and test_polydiff_variable_step
  to test_diff_methods.py, and extend test_slide_function in
  test_utils.py to cover the array-t code path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pavelkomarov
Copy link
Collaborator Author

pavelkomarov commented Mar 6, 2026

I generated this thing with a Claude Code instance spawned by an OpenClaw assistant. Mostly playing to see how well it can do, and this is a good problem to throw it at. I think this is a particularly tricky issue to address, with a redesign on multiple levels maybe necessary, so it may not get us all the way. But even most of the way of some of the way is a win. It's really quite something and makes evident why Amazon is laying off so many people.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And an overall comment: The utilities tests are failing presently, because the test_peakdet function depends on some randomness that's only initialized at the beginning of the file. If you change the order of tests by adding one that comes before it alphabetically, or if you use randomness, then the answers it ends up getting don't match the check values, even if they're not wrong. It's a fragile test that way. Apologies. Can you set it up with its own source of randomness or otherwise make it more bulletproof?

fig.suptitle(f'{diff_method.__name__}', fontsize=16)


def test_polydiff_nan_inputs():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the addition of two more tests here. You can see the way I've structured the tests is basically to commonly test all methods for correctness in the first and methods' ability to support multi-dimensional data in the second. I don't explicitly test for missing data anywhere, which is maybe a weakness, but only a few methods can handle missing data, so for those I've just checked offline directly. See Fig 57 here https://arxiv.org/pdf/2512.09090. We could add a test for just those in theory, but I'd like to cover them all by the one test, not just polydiff.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the two standalone tests were too narrow. Removed both and added polydiff_irreg_step as an alias in diff_methods_and_params instead, mirroring the pattern for spline_irreg_step. It now exercises all 6 test functions with and without noise, which gives broader confidence for free. Held off on a dedicated NaN-handling test since that really should cover all missing-data-capable methods rather than singling out polydiff.

assert np.max(np.abs(x_hat[outside_gap] - x_clean[outside_gap])) < 0.1


def test_polydiff_variable_step():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable step is covered in the first test in this file. See how I've done it there, and add polydiff with variable step as just another method in there. That will save lines and ensure coverage over all the test functions.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a polydiff_irreg_step alias (same pattern as spline_irreg_step) and wired it into the irregular-step branch in the test body so it gets t_irreg as input. Had to compute fresh error bounds — they came out nearly identical to regular polydiff, which is reassuring: the variable-step path is not noticeably less accurate than the scalar-dt path on jittered-but-close-to-uniform grids.


# When t is an array, pass the actual time values for the window so func can handle
# nonuniform spacing; otherwise pass dt as before (backward compatible).
dt_window = dt_or_t if scalar_dt else t[window]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right approach if the window is a constant size in index-space, I guess. I had imagined a window in actual t-space, which would then have more or fewer index points within it depending on how dense or thinned out they were over that little region. That adds the complication of having to find which index points fall within a window, and depending on how many your optimal polynomial degree might change. So this is perhaps the easier and more amenable approach to doing polyfits. What are your thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and a real tradeoff. A window in t-space would be more principled when the data is heavily non-uniform — a dense cluster of points in a short time interval would get a well-conditioned polyfit, while a sparse region with the same index-count window might be fitting over a very wide t-range and extrapolating at the edges. On the other hand, a t-space window complicates the implementation significantly: you need to binary-search for the window boundaries, the count of points inside varies per window (so the polynomial degree might need to adapt or you might get too few points for a stable fit), and the weighted overlap accumulation in slide_function assumes a fixed kernel size. The index-space approach here sidesteps all of that and is still a meaningful improvement over the previous uniform-dt assumption — the polynomial is now fit to the actual time coordinates even if the window boundaries are in index-space. Happy to revisit if you want to pursue the t-space approach, but it would need a more substantial redesign of slide_function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach and maybe the key insight that makes solving this issue more tractable.


# run the function on the window and add weighted results to cumulative answers
x_window_hat, dxdt_window_hat = func(x[window], dt, *args, **kwargs)
x_window_hat, dxdt_window_hat = func(x[window], dt_window, *args, **kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be named dt_or_t_window? I just want to be extra clear that it can carry either information

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed. dt_or_t_window is clearer — at a glance it says both what the variable can be and that it is already sliced to the current window.

:param np.array[float] x: data to differentiate
:param float dt: step size
:param np.array[float] x: data to differentiate. May contain NaN values (missing data);
NaNs are excluded from fitting and imputed by polynomial interpolation.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, love the idea of imputing. That's what the kalman filter function does with missing data in the kalman_smooth module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note — worth calling out the parallel to the kalman smoother. The imputation here is implicit: windows straddling a NaN gap simply exclude the missing points from the polyfit, then evaluate the fitted polynomial at those t-values to fill them in. Effectively polynomial interpolation, which is fine for short gaps. For longer gaps or noisier data the kalman approach would be more robust, but for the sliding-window polyfit use case this keeps things self-contained.

# Filter out NaN values so polyfit doesn't fail on missing data
mask = ~np.isnan(x)
if not np.any(mask): # all NaN window — return NaNs
return np.full_like(x, np.nan, dtype=float), np.full_like(x, np.nan, dtype=float)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to specify dtype=float here? NaNs are already float values, so I think it'll work if you exclude that parameter. I prefer to be less verbose when possible.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. You are right that NaN implies float — np.full_like will infer the dtype from x, which at that point already contains NaN values (so it must be float). Unnecessary verbosity.


# Filter out NaN values so polyfit doesn't fail on missing data
mask = ~np.isnan(x)
if not np.any(mask): # all NaN window — return NaNs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the line below on this same line after the if:? I know it's nonstandard style, but it better matches my style through this codebase.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It is a one-liner guard clause — keeping it on one line makes the "early exit on degenerate input" intent obvious without taking up visual space.

mask = ~np.isnan(x)
if not np.any(mask): # all NaN window — return NaNs
return np.full_like(x, np.nan, dtype=float), np.full_like(x, np.nan, dtype=float)
w = weights[mask] if weights is not None else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this inline in the poly fit(w=) call? Again, I know that's nonstandard style, perhaps, but since we only use the masked w weights in the one place, I prefer not to define an actual named variable for it.

Copy link
Collaborator Author

@pavelkomarov pavelkomarov Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined. Agreed — it only appears in one place so naming it w just adds a line of indirection for no gain.

- Rename dt_window -> dt_or_t_window in slide_function for clarity
- Inline if-body and remove dtype=float in _polydiff NaN guard
- Inline weights[mask] directly into np.polyfit call
- Replace standalone test_polydiff_nan_inputs/variable_step with polydiff_irreg_step
  entry in the main parametrized test (covers all test functions, matches codebase style)
- Fix test_peakdet to use its own RandomState(42) so it's robust to test ordering
@pavelkomarov
Copy link
Collaborator Author

pavelkomarov commented Mar 6, 2026

Addressed all review feedback in commit 5201024. Summary of decisions:

  • Tests: Removed the two standalone tests in favor of a polydiff_irreg_step alias in the main parametrized suite — covers all 6 test functions rather than a single scenario, and matches the pattern already established by spline_irreg_step. Held off on a NaN-handling test since ideally it should cover all missing-data-capable methods to fit the codebase style.
  • dt_or_t_window: Better name — communicates both the type ambiguity and that it's already sliced to the window.
  • _polydiff style: Inlined the early-exit guard and the weights expression; both appear only once so naming them adds indirection without clarity.
  • test_peakdet: Gave it a local RandomState(42) so it's no longer sensitive to test ordering.

All 280 tests pass.

@pavelkomarov pavelkomarov merged commit c1fe325 into master Mar 6, 2026
1 check passed
@pavelkomarov pavelkomarov deleted the fix/issue-173-polydiff-variable-step-nan branch March 10, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make PolyDiff work with variable step size and missing data

1 participant