Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistical_inefficiency does not slice data frame #198

Closed
ptmerz opened this issue Jun 22, 2022 · 1 comment · Fixed by #199
Closed

statistical_inefficiency does not slice data frame #198

ptmerz opened this issue Jun 22, 2022 · 1 comment · Fixed by #199

Comments

@ptmerz
Copy link
Contributor

ptmerz commented Jun 22, 2022

Description

When using alchemlyb.preprocessing.statistical_inefficiency with a data frame, a series to calculate statistical inefficiency on, and slicing parameters (lower, upper, step), the slicing is only applied to the series, but not to the data frame. Since statistical_inefficiency uses a raw index to subsample the data frame, the series and the data frame are out of sync by the end of the function. This leads to results that seem, at the very least, unintuitive.

Examples

I think that a few examples make this much easier to explain. Assume I have loaded some data, e.g.

from alchemlyb.parsing.gmx import extract_u_nk
data = extract_u_nk('lambda_0.xvg', T=298.15)

and want to use only part of this data, e.g.

lower = 1000
upper = 5000

then I would expect a few things:

  1. Slicing removes times before and after my set limits → this works ✅
from alchemlyb.preprocessing import slicing
sliced_data = slicing(data, lower=lower, upper=upper)
assert all(sliced_data.reset_index()['time'] >= lower)
assert all(sliced_data.reset_index()['time'] <= upper)
  1. Using the slicing arguments with statistical_inefficiency removes times before and after my set limits → this does not work ❌
from alchemlyb.preprocessing import statistical_inefficiency
subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
assert all(subsampled_data.reset_index()['time'] >= lower)
assert all(subsampled_data.reset_index()['time'] <= upper)

Specifically, I find that the times are shifted, i.e. that the lowest time of subsampled_data is 0, and the highest time is upper - lower.

  1. Using slicing on the data frame before running statistical_inefficiency, or running statistical_inefficiency with slicing parameters leads to the same result → this does not work ❌
sliced_data = slicing(data, lower=lower, upper=upper)
subsampled_sliced_data = statistical_inefficiency(sliced_data, data.columns[0])

subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)

assert (subsampled_data == subsampled_sliced_data).all(axis=None)

The problem here is, unsurprisingly, similar to the one above: subsampled_data starts at time 0, ends at time upper - lower, whereas subsampled_sliced_data starts at time lower and ends at time upper.

  1. The issues demonstrated in cases 2 and 3 are especially problematic if we imagine a use case in which we would repeatedly sample from the same data, e.g.
window_size=1000
for window_idx in range(5):
    lower = window_idx * window_size
    upper = (window_idx + 1) * window_size
    subsampled_data = statistical_inefficiency(data, data.columns[0], lower=lower, upper=upper)
    # ...
    # further analysis of subsampled_data for window here
    # ...

My expectation here would be to analyze windows of different data at every iteration. The current implementation would, however, always subsample from the window 0 to 1000. (The subsampled data might differ between windows, since the statistical inefficiency is actually calculated on a different window of the series each time, but the resulting indices are applied to the same window of the data frame every time.)

  1. These issues could lead to correlated samples in case that the step keyword is used. Imagine that we have data which is correlated such that the conservative subsampling algorithm picks every 10th frame. If we were to use statistical_inefficiency with step=10 (by chance, or because we have a feeling that we might have sampled data too frequently), then statistical_inefficiency would return every frame of the first 10% of the data! So we would have lost a lot of information, and would continue working with heavily sampled data.

Proposed solution

I will propose a change shortly which adds tests (in line with examples 1 to 3) and a fix for this behavior. I think the most stable fix is to call slicing on both the series and the dataframe. It might be possible to calculate an offset index plus a stride to avoid slicing the dataframe, but that seems like pointless overengineering since there is a slicing function readily available that makes sure that the series and dataframe slicing will always be done in the same way.

PS: Maybe I've misunderstood everything and this is intended behavior. In this case, please close the PR, but I would then suggest to add some documentation (ideally also a note when calling the function) warning about this behavior 🙂

ptmerz added a commit to ptmerz/alchemlyb that referenced this issue Jun 22, 2022
Make sure that both the subsampling series and the dataframe are sliced
when calling statistical_inefficieny(). This ensure that the series and
the dataframe are in sync before the subsampled indices of the series
are applied to the dataframe.

Fixes alchemistry#198
@xiki-tempula
Copy link
Collaborator

@ptmerz Many thanks for reporting this and come up with the PR. I will have a look at it.

@orbeckst orbeckst added the bug label Jul 6, 2022
orbeckst pushed a commit that referenced this issue Jul 7, 2022
* Fixes #198
* Sync series and dataframe when subsampling:
  Make sure that both the subsampling series and the dataframe are sliced
  when calling statistical_inefficieny(). This ensure that the series and
  the dataframe are in sync before the subsampled indices of the series
  are applied to the dataframe.
* Add tests: This adds a number of tests for slicing() and statistical_inefficieny():
  * Test that slicing() respects upper and lower time bounds (currently
    passes)
  * Test that statistical_inefficieny() respects upper and lower time bounds
    when it is used without series to subsample (currently passes)
  * Test that statistical_inefficieny() respects upper and lower time bounds
    when it is used without series to subsample (currently fails)
  * Test that first using slicing() on the data frame, then
    statistical_inefficieny() without time bounds yields the same results as
    a single call to statistical_inefficieny() with time bounds (currently
    fails)
  * Extract input sanity check in separate function
  * Extract test for unchanged data in separate test
  * Parametrize tests of statistical_inefficiency to merge cases
    which use `series` with the ones that don't
  * Parametrize all tests to use `conservative` `True` and `False`
  * Remove conservative keyword where it didn't belong
  * Use pytest fixture to load data at test time
* Add notes to docstring, CHANGES, and AUTHORS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants