statistical_inefficiency does not slice data frame #198

ptmerz · 2022-06-22T07:25:05Z

Description

When using alchemlyb.preprocessing.statistical_inefficiency with a data frame, a series to calculate statistical inefficiency on, and slicing parameters (lower, upper, step), the slicing is only applied to the series, but not to the data frame. Since statistical_inefficiency uses a raw index to subsample the data frame, the series and the data frame are out of sync by the end of the function. This leads to results that seem, at the very least, unintuitive.

Examples

I think that a few examples make this much easier to explain. Assume I have loaded some data, e.g.

from alchemlyb.parsing.gmx import extract_u_nk
data = extract_u_nk('lambda_0.xvg', T=298.15)

and want to use only part of this data, e.g.

lower = 1000
upper = 5000

then I would expect a few things:

Slicing removes times before and after my set limits → this works ✅

from alchemlyb.preprocessing import slicing
sliced_data = slicing(data, lower=lower, upper=upper)
assert all(sliced_data.reset_index()['time'] >= lower)
assert all(sliced_data.reset_index()['time'] <= upper)

Using the slicing arguments with statistical_inefficiency removes times before and after my set limits → this does not work ❌

from alchemlyb.preprocessing import statistical_inefficiency
subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
assert all(subsampled_data.reset_index()['time'] >= lower)
assert all(subsampled_data.reset_index()['time'] <= upper)

Specifically, I find that the times are shifted, i.e. that the lowest time of subsampled_data is 0, and the highest time is upper - lower.

Using slicing on the data frame before running statistical_inefficiency, or running statistical_inefficiency with slicing parameters leads to the same result → this does not work ❌

sliced_data = slicing(data, lower=lower, upper=upper)
subsampled_sliced_data = statistical_inefficiency(sliced_data, data.columns[0])

subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)

assert (subsampled_data == subsampled_sliced_data).all(axis=None)

The problem here is, unsurprisingly, similar to the one above: subsampled_data starts at time 0, ends at time upper - lower, whereas subsampled_sliced_data starts at time lower and ends at time upper.

The issues demonstrated in cases 2 and 3 are especially problematic if we imagine a use case in which we would repeatedly sample from the same data, e.g.

window_size=1000
for window_idx in range(5):
    lower = window_idx * window_size
    upper = (window_idx + 1) * window_size
    subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
    # ...
    # further analysis of subsampled_data for window here
    # ...

My expectation here would be to analyze windows of different data at every iteration. The current implementation would, however, always subsample from the window 0 to 1000. (The subsampled data might differ between windows, since the statistical inefficiency is actually calculated on a different window of the series each time, but the resulting indices are applied to the same window of the data frame every time.)

These issues could lead to correlated samples in case that the step keyword is used. Imagine that we have data which is correlated such that the conservative subsampling algorithm picks every 10th frame. If we were to use statistical_inefficiency with step=10 (by chance, or because we have a feeling that we might have sampled data too frequently), then statistical_inefficiency would return every frame of the first 10% of the data! So we would have lost a lot of information, and would continue working with heavily sampled data.

Proposed solution

I will propose a change shortly which adds tests (in line with examples 1 to 3) and a fix for this behavior. I think the most stable fix is to call slicing on both the series and the dataframe. It might be possible to calculate an offset index plus a stride to avoid slicing the dataframe, but that seems like pointless overengineering since there is a slicing function readily available that makes sure that the series and dataframe slicing will always be done in the same way.

PS: Maybe I've misunderstood everything and this is intended behavior. In this case, please close the PR, but I would then suggest to add some documentation (ideally also a note when calling the function) warning about this behavior 🙂

The text was updated successfully, but these errors were encountered:

Make sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. Fixes alchemistry#198

xiki-tempula · 2022-06-22T08:11:05Z

@ptmerz Many thanks for reporting this and come up with the PR. I will have a look at it.

* Fixes #198 * Sync series and dataframe when subsampling: Make sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. * Add tests: This adds a number of tests for slicing() and statistical_inefficieny(): * Test that slicing() respects upper and lower time bounds (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently fails) * Test that first using slicing() on the data frame, then statistical_inefficieny() without time bounds yields the same results as a single call to statistical_inefficieny() with time bounds (currently fails) * Extract input sanity check in separate function * Extract test for unchanged data in separate test * Parametrize tests of statistical_inefficiency to merge cases which use `series` with the ones that don't * Parametrize all tests to use `conservative` `True` and `False` * Remove conservative keyword where it didn't belong * Use pytest fixture to load data at test time * Add notes to docstring, CHANGES, and AUTHORS

ptmerz mentioned this issue Jun 22, 2022

Sync series and dataframe in statistical_inefficieny() #199

Merged

orbeckst added the preprocessors label Jun 23, 2022

orbeckst added the bug label Jul 6, 2022

orbeckst closed this as completed in #199 Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statistical_inefficiency does not slice data frame #198

statistical_inefficiency does not slice data frame #198

ptmerz commented Jun 22, 2022 •

edited

xiki-tempula commented Jun 22, 2022

statistical_inefficiency does not slice data frame #198

statistical_inefficiency does not slice data frame #198

Comments

ptmerz commented Jun 22, 2022 • edited

Description

Examples

Proposed solution

xiki-tempula commented Jun 22, 2022

ptmerz commented Jun 22, 2022 •

edited