Fix polydiff for variable step size and missing data (issue #173)#190
Fix polydiff for variable step size and missing data (issue #173)#190pavelkomarov merged 3 commits intomasterfrom
Conversation
- slide_function: accept dt_or_t as scalar or array; when array, pass the actual t[window] slice to func so inner functions see real time values and can handle nonuniform spacing (backward compatible: scalar dt still passed as before, so lineardiff and other callers unchanged) - polydiff/_polydiff: rename dt -> dt_or_t; build t from scalar dt or use the passed array directly; filter NaN values from x before np.polyfit so missing data no longer causes all-NaN outputs; handle all-NaN windows by returning NaN arrays rather than crashing - Tests: add test_polydiff_nan_inputs and test_polydiff_variable_step to test_diff_methods.py, and extend test_slide_function in test_utils.py to cover the array-t code path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I generated this thing with a Claude Code instance spawned by an OpenClaw assistant. Mostly playing to see how well it can do, and this is a good problem to throw it at. I think this is a particularly tricky issue to address, with a redesign on multiple levels maybe necessary, so it may not get us all the way. But even most of the way of some of the way is a win. It's really quite something and makes evident why Amazon is laying off so many people. |
pavelkomarov
left a comment
There was a problem hiding this comment.
And an overall comment: The utilities tests are failing presently, because the test_peakdet function depends on some randomness that's only initialized at the beginning of the file. If you change the order of tests by adding one that comes before it alphabetically, or if you use randomness, then the answers it ends up getting don't match the check values, even if they're not wrong. It's a fragile test that way. Apologies. Can you set it up with its own source of randomness or otherwise make it more bulletproof?
pynumdiff/tests/test_diff_methods.py
Outdated
| fig.suptitle(f'{diff_method.__name__}', fontsize=16) | ||
|
|
||
|
|
||
| def test_polydiff_nan_inputs(): |
There was a problem hiding this comment.
I don't love the addition of two more tests here. You can see the way I've structured the tests is basically to commonly test all methods for correctness in the first and methods' ability to support multi-dimensional data in the second. I don't explicitly test for missing data anywhere, which is maybe a weakness, but only a few methods can handle missing data, so for those I've just checked offline directly. See Fig 57 here https://arxiv.org/pdf/2512.09090. We could add a test for just those in theory, but I'd like to cover them all by the one test, not just polydiff.
There was a problem hiding this comment.
Agreed — the two standalone tests were too narrow. Removed both and added polydiff_irreg_step as an alias in diff_methods_and_params instead, mirroring the pattern for spline_irreg_step. It now exercises all 6 test functions with and without noise, which gives broader confidence for free. Held off on a dedicated NaN-handling test since that really should cover all missing-data-capable methods rather than singling out polydiff.
pynumdiff/tests/test_diff_methods.py
Outdated
| assert np.max(np.abs(x_hat[outside_gap] - x_clean[outside_gap])) < 0.1 | ||
|
|
||
|
|
||
| def test_polydiff_variable_step(): |
There was a problem hiding this comment.
Variable step is covered in the first test in this file. See how I've done it there, and add polydiff with variable step as just another method in there. That will save lines and ensure coverage over all the test functions.
There was a problem hiding this comment.
Done. Added a polydiff_irreg_step alias (same pattern as spline_irreg_step) and wired it into the irregular-step branch in the test body so it gets t_irreg as input. Had to compute fresh error bounds — they came out nearly identical to regular polydiff, which is reassuring: the variable-step path is not noticeably less accurate than the scalar-dt path on jittered-but-close-to-uniform grids.
pynumdiff/utils/utility.py
Outdated
|
|
||
| # When t is an array, pass the actual time values for the window so func can handle | ||
| # nonuniform spacing; otherwise pass dt as before (backward compatible). | ||
| dt_window = dt_or_t if scalar_dt else t[window] |
There was a problem hiding this comment.
This is the right approach if the window is a constant size in index-space, I guess. I had imagined a window in actual t-space, which would then have more or fewer index points within it depending on how dense or thinned out they were over that little region. That adds the complication of having to find which index points fall within a window, and depending on how many your optimal polynomial degree might change. So this is perhaps the easier and more amenable approach to doing polyfits. What are your thoughts?
There was a problem hiding this comment.
Good question, and a real tradeoff. A window in t-space would be more principled when the data is heavily non-uniform — a dense cluster of points in a short time interval would get a well-conditioned polyfit, while a sparse region with the same index-count window might be fitting over a very wide t-range and extrapolating at the edges. On the other hand, a t-space window complicates the implementation significantly: you need to binary-search for the window boundaries, the count of points inside varies per window (so the polynomial degree might need to adapt or you might get too few points for a stable fit), and the weighted overlap accumulation in slide_function assumes a fixed kernel size. The index-space approach here sidesteps all of that and is still a meaningful improvement over the previous uniform-dt assumption — the polynomial is now fit to the actual time coordinates even if the window boundaries are in index-space. Happy to revisit if you want to pursue the t-space approach, but it would need a more substantial redesign of slide_function.
There was a problem hiding this comment.
I think this is the right approach and maybe the key insight that makes solving this issue more tractable.
pynumdiff/utils/utility.py
Outdated
|
|
||
| # run the function on the window and add weighted results to cumulative answers | ||
| x_window_hat, dxdt_window_hat = func(x[window], dt, *args, **kwargs) | ||
| x_window_hat, dxdt_window_hat = func(x[window], dt_window, *args, **kwargs) |
There was a problem hiding this comment.
Can this be named dt_or_t_window? I just want to be extra clear that it can carry either information
There was a problem hiding this comment.
Renamed. dt_or_t_window is clearer — at a glance it says both what the variable can be and that it is already sliced to the current window.
pynumdiff/polynomial_fit.py
Outdated
| :param np.array[float] x: data to differentiate | ||
| :param float dt: step size | ||
| :param np.array[float] x: data to differentiate. May contain NaN values (missing data); | ||
| NaNs are excluded from fitting and imputed by polynomial interpolation. |
There was a problem hiding this comment.
Nice, love the idea of imputing. That's what the kalman filter function does with missing data in the kalman_smooth module.
There was a problem hiding this comment.
Thanks for the note — worth calling out the parallel to the kalman smoother. The imputation here is implicit: windows straddling a NaN gap simply exclude the missing points from the polyfit, then evaluate the fitted polynomial at those t-values to fill them in. Effectively polynomial interpolation, which is fine for short gaps. For longer gaps or noisier data the kalman approach would be more robust, but for the sliding-window polyfit use case this keeps things self-contained.
pynumdiff/polynomial_fit.py
Outdated
| # Filter out NaN values so polyfit doesn't fail on missing data | ||
| mask = ~np.isnan(x) | ||
| if not np.any(mask): # all NaN window — return NaNs | ||
| return np.full_like(x, np.nan, dtype=float), np.full_like(x, np.nan, dtype=float) |
There was a problem hiding this comment.
Do you need to specify dtype=float here? NaNs are already float values, so I think it'll work if you exclude that parameter. I prefer to be less verbose when possible.
There was a problem hiding this comment.
Removed. You are right that NaN implies float — np.full_like will infer the dtype from x, which at that point already contains NaN values (so it must be float). Unnecessary verbosity.
pynumdiff/polynomial_fit.py
Outdated
|
|
||
| # Filter out NaN values so polyfit doesn't fail on missing data | ||
| mask = ~np.isnan(x) | ||
| if not np.any(mask): # all NaN window — return NaNs |
There was a problem hiding this comment.
Can you put the line below on this same line after the if:? I know it's nonstandard style, but it better matches my style through this codebase.
There was a problem hiding this comment.
Done. It is a one-liner guard clause — keeping it on one line makes the "early exit on degenerate input" intent obvious without taking up visual space.
pynumdiff/polynomial_fit.py
Outdated
| mask = ~np.isnan(x) | ||
| if not np.any(mask): # all NaN window — return NaNs | ||
| return np.full_like(x, np.nan, dtype=float), np.full_like(x, np.nan, dtype=float) | ||
| w = weights[mask] if weights is not None else None |
There was a problem hiding this comment.
Can you put this inline in the poly fit(w=) call? Again, I know that's nonstandard style, perhaps, but since we only use the masked w weights in the one place, I prefer not to define an actual named variable for it.
There was a problem hiding this comment.
Inlined. Agreed — it only appears in one place so naming it w just adds a line of indirection for no gain.
- Rename dt_window -> dt_or_t_window in slide_function for clarity - Inline if-body and remove dtype=float in _polydiff NaN guard - Inline weights[mask] directly into np.polyfit call - Replace standalone test_polydiff_nan_inputs/variable_step with polydiff_irreg_step entry in the main parametrized test (covers all test functions, matches codebase style) - Fix test_peakdet to use its own RandomState(42) so it's robust to test ordering
|
Addressed all review feedback in commit 5201024. Summary of decisions:
All 280 tests pass. |
Summary
Fixes #173 — makes
polydiffwork correctly whenxcontains NaN values (missing data) and/or when time points are nonuniformly spaced.Changes
pynumdiff/utils/utility.py—slide_functiondt→dt_or_tto accept either a scalar step size or an array of time locationsdt_or_tis an array, passes the actualt[window]slice tofuncso inner functions see real time values and can handle nonuniform spacingdtis passed through unchanged, solineardiffand all other callers are unaffectedpynumdiff/polynomial_fit.py—polydiff/_polydiffdt→dt_or_tto matchdtor uses the passed array directlyxbefore callingnp.polyfit, so missing data no longer causes all-NaN outputpynumdiff/tests/test_diff_methods.pytest_polydiff_nan_inputs: verifies NaN gaps produce finite output and accurate values outside the gaptest_polydiff_variable_step: verifies irregulartarray gives accurate derivatives, and beats the wrong-uniform-dt baselinepynumdiff/tests/test_utils.pytest_slide_functionto cover the array-tcode path