Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the subsampling method #38

Closed
wants to merge 2 commits into from
Closed

Conversation

shuail
Copy link
Collaborator

@shuail shuail commented Nov 14, 2017

change the subsampling code to use the pymbar module to get the subsampled indexes, the original code use the slicing function in pandas which requires the slicing step to be integer. While in theory I think we don't want to force the subsampling steps to be integer, will open an issue to discuss this.

…mpled indexes, the original method use the slicing function in pandas which requires the slicing step to be integer
@codecov-io
Copy link

codecov-io commented Nov 14, 2017

Codecov Report

Merging #38 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   97.59%   97.58%   -0.02%     
==========================================
  Files           9        9              
  Lines         374      372       -2     
  Branches       59       59              
==========================================
- Hits          365      363       -2     
  Misses          4        4              
  Partials        5        5
Impacted Files Coverage Δ
src/alchemlyb/preprocessing/subsampling.py 95.12% <100%> (-0.23%) ⬇️
src/alchemlyb/parsing/amber.py 98.24% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5173d17...cc2dfd1. Read the comment docs.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a quick look, I don't really understand the changes well enough yet.

@@ -4,6 +4,7 @@
import numpy as np
from pymbar.timeseries import statisticalInefficiency
from pymbar.timeseries import detectEquilibration
from pymbar.timeseries import subsampleCorrelatedData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import together (combine all three lines, no idea why they were separate)

from pymbar.timeseries import statisticalInefficiency, detectEquilibration, subsampleCorrelatedData

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst yes, changed to one line

df = df.loc[series.index]

#use the subsampleCorrelatedData function to get the subsample index
indices = subsampleCorrelatedData(series, g=statinef)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you briefly explain what subsampleCorrelatedData does (and maybe link to the code in their repo)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst yes, the link to the code is https://github.com/choderalab/pymbar/blob/ac648e63787592a251992a764876713a64f17455/pymbar/timeseries.py#L632 it basically subsample the time series (series) based on the given g (statistical inefficiency)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotsdl yes, the code you refer to is an optional option conservative, and the default for that option is False. I didn't notice that option before while looking at the details, I think the implementation is also different from what alchemlyb currently had.

The topic here is basically all related to the precision and distinguishability of the subsampling. For a given g like 1.5, current alchemlyb code will round up using int(np.rint(1.5)) = 1.0 and, the conservative option will round to 2 using math.ceil(1.5) and the non conservative option will use 1.5. For a given time series with 1,000 data point, current alchemlyb; conservative; non conservative will get 1,000; 500; 666 data point respectively. Since the energy will depend on the subsampling, I think we should be careful about how we round the g and I suggested to use the non conservative option here to make all dataset as much distinguishable as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False) (especially conservative=False to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with #38 (comment) – until we come up with anything better inside alchemlyb I would also go with the pymbar version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst yes, sounds good to me. I will add the default options explicitly.

indices = subsampleCorrelatedData(series, g=statinef)
picked_time_index = []
#pick the time index for the pandas dataframe based on the python index from subsample
for s_index, s_index_pair in enumerate(series.index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks clunky and potentially slow (although that might not matter much). Isn't there a direct way to do this without the for loop?

(I don't really know what the data look like to say more here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, let me check if there is a better way to rewrite this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame or Series. You can directly extract all values for an integer-based index with series.iloc[indices] instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotsdl yes, the iloc[indices] works. Thx!

Copy link
Member

@dotsdl dotsdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to address.

df = df.loc[series.index]

#use the subsampleCorrelatedData function to get the subsample index
indices = subsampleCorrelatedData(series, g=statinef)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?

indices = subsampleCorrelatedData(series, g=statinef)
picked_time_index = []
#pick the time index for the pandas dataframe based on the python index from subsample
for s_index, s_index_pair in enumerate(series.index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame or Series. You can directly extract all values for an integer-based index with series.iloc[indices] instead.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See minor comments, otherwise I agree with your rationale for changing.

Please also update the docs and explicitly explain that we are using pymbar.timeseries.subsampleCorrelatedData(..., fast=False, conservative=False) and say that this can lead to non-uniform sampling of the time series. Thus it is not safe to assume anymore that frames are equally spaced in time.

Do we have tests that distinguish old from new behavior?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs and tests.

Otherwise I agree with the change on scientific grounds.


df = df.loc[series.index]

#use the subsampleCorrelatedData function to get the subsample index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the docs: make clear that intervals may be not uniform any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will add docs

df = df.loc[series.index]

#use the subsampleCorrelatedData function to get the subsample index
indices = subsampleCorrelatedData(series, g=statinef)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False) (especially conservative=False to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we got tests that test fractional g?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test_preprocessing.py code, and inside there is a test_subsampling function

def test_subsampling(self, data, size):
testing that the subsampled data is not bigger than the original data, while I think this one didn't explicitly test the g value or the actual size of the subsample data. Also, should we define a slicer function inside this class, maybe a bug here?

@orbeckst
Copy link
Member

Btw, @shuail this PR appears to come from your master branch. I recommend that PRs get their own branches.

@shuail
Copy link
Collaborator Author

shuail commented Nov 15, 2017

@orbeckst thanks for the comments, yes, will clean up the code (add more docs/explicit the options) and create another PR from the not master branch.

@davidlmobley
Copy link

So just to chime in here -- we're talking about a fairly significant change on the basis of how to round g, is that correct? Specifically:

The topic here is basically all related to the precision and distinguishability of the subsampling. For a given g like 1.5, current alchemlyb code will round up using int(np.rint(1.5)) = 1.0 and, the conservative option will round to 2 using math.ceil(1.5) and the non conservative option will use 1.5. For a given time series with 1,000 data point, current alchemlyb; conservative; non conservative will get 1,000; 500; 666 data point respectively. Since the energy will depend on the subsampling, I think we should be careful about how we round the g and I suggested to use the non conservative option here to make all dataset as much distinguishable as possible.

Let's think for a minute about the practical difference between using different values of g. When will this be important? I think in the limit of "well behaved" data, where we have plenty of statistically distinguishable samples, this is not going to really matter at all, so the change is unimportant and is unlikely to buy us any real value; presumably it will make uncertainty estimates a little lower, but that's not especially valuable. When it will be important is closer to the limit of "problematic" data, where we may not have enough sampling at all, or we just barely have enough sampling to get a reasonable estimate. This is the case we really want to be thinking about -- maybe $N_{eff}$ is 20 or 50 so we only have a rather small number of samples at some lambda value.

In that case -- as we approach the limit of problematic data -- what's our biggest goal? Is it being able to squeeze every last bit of POSSIBLE value out of the data we have, even if that results in a false appearance of precision when in fact our results are garbage? Or is it being able to correctly recognize that in fact we don't have enough data? I'd argue that the latter goal is more important: It's worse to get an "apparently reasonable" answer that is wrong than to get a "clearly noisy/useless" answer.

Thus, in my view, we should err in the direction of conservatism (at least by default) and round g up if we need to round. I could be OK with not rounding (though see below) but certainly not rounding down.

In my previous experience though, statistical uncertainties tend to always be underestimated anyway, so rounding up seems like a relatively safe thing to do if we need to round.

So in my mind the decision is between:

  • round up (and keep current code)
  • go with non-conservative option and just use numeric value of g, potentially giving non-uniform spaced frames

One of our overall goals is to allow easy coupling with things like trajectory analysis, which may be an argument in favor of the simpler approach (round g upwards to the nearest integer) since having non-uniform frame spacing is going to complicate the analysis a lot.

@shuail
Copy link
Collaborator Author

shuail commented Nov 15, 2017

@davidlmobley In our practice, supposing g will be linearly correlated to the subsample_diff which is = energy(subsampled) - energy(not subsampled) and we want to use the subsample_diff as an indicator for other further analysis ie, filtering the dataset, machine learning etc. It will be nice to have that indicator behave linearly so that g is not rounded. While I can see that round up will not effect the detecting of large subsample_diff which saying that g = 30 vs 31 will not have much difference, so I suggest that we adopt the subsampleCorrelatedData option here (conservative=False as default) and add a conservative flag to pass to subsampleCorrelatedData. Another thing that I am thinking to avoid using round up as default is that for cases like g = 1.01, the rounded up g will be 2 and it turns out that we just pick half of the data even they are highly uncorrelated.

For coupling the energy vs trajectory analysis, I think the pacing of writing out energy vs trajectory is arbitrary and will depend on the users input, even the uniform frame spacing of subsampling would not benefit much in this case. So I would like to use the unsubsampled data for the energy vs trajectory analysis.

@orbeckst
Copy link
Member

Good points.

The current code rounds to nearest integer so that can create the appearance of more independent samples than there really are.

Perhaps use the conservative approach as default and add an option for users to choose the non-uniform sampling (and making sure that the frame indices for slicing the original trajectory remain available downstream).

@orbeckst
Copy link
Member

To summarize the discussion so far:

  • We rather want to be conservative so the current implementation using statinef = int(np.rint(statinef)) should be changed because numpy.rint can round down (e.g., g=1.4 will give g=1). If nothing else happens it should become statinef = int(np.ceil(statinef)).
  • We could completely rely on pymbar.timeseries.subsampleCorrelatedData() and use the conservative=True keyword to get the ceil(g) rounding.
  • If we use pymbar.timeseries.subsampleCorrelatedData()then we can also use conservative=False to obtain indices with uneven sampling that best approximate a fractional g. I don't think that unevenly sampled frames will make any difference for downstream processing in alchemlyb. It might become more difficult to use the processed energies in the context of a trajectory but this might be less of a concern because
    • energy and trajectory output are often not written out at the same frequency and slicing one by the other is tricky business anyway that has to be solved elsewhere (e.g. by comparing time stamps of frames – our standard forms of raw data contain time stamps, so they can be used)
    • a more typical approach might be (??) to featurize a trajectory, from the featurization determine how to slice the raw energy time series, and then subsample the sliced time series according to the g of the raw time series – in this case, we never need to care about even or uneven sampling.

I suggest that we

  • add a keyword to allow the user to switch between conservative=True and False (I don't like the kw name very much but we should keep it to be consistent with pymbar)
  • use pymbar.timeseries.subsampleCorrelatedData() for fractional g or ceil(g)

The main question that would remain is what should the default be?

I am tending towards conservative=True to protect unsuspecting users from overly optimistic estimates while still allowing knowledgable users to get what they want.

@davidlmobley
Copy link

I think I'm on board with all of this and also lean towards having the default be conservative=True, for the same reasons just noted.

@shuail
Copy link
Collaborator Author

shuail commented Nov 16, 2017

I think I'm fine with @orbeckst suggested procedure. For the default option, I am also fine with the rounding approach since that's close to the original code and philosophy. I could change the option for my own use anyway.

@orbeckst
Copy link
Member

orbeckst commented Nov 16, 2017 via email

@shuail
Copy link
Collaborator Author

shuail commented Nov 16, 2017

Yes, I will work on the code and the tests. Once I finished that, will create another PR from a separate branch other than master

@orbeckst
Copy link
Member

I continued the work on PR #41 – please push updates there. (#41 was rebased against master!)

@orbeckst orbeckst closed this Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants