change the subsampling method #38

shuail · 2017-11-14T16:24:37Z

change the subsampling code to use the pymbar module to get the subsampled indexes, the original code use the slicing function in pandas which requires the slicing step to be integer. While in theory I think we don't want to force the subsampling steps to be integer, will open an issue to discuss this.

…mpled indexes, the original method use the slicing function in pandas which requires the slicing step to be integer

codecov-io · 2017-11-14T16:29:11Z

Codecov Report

Merging #38 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   97.59%   97.58%   -0.02%     
==========================================
  Files           9        9              
  Lines         374      372       -2     
  Branches       59       59              
==========================================
- Hits          365      363       -2     
  Misses          4        4              
  Partials        5        5

Impacted Files	Coverage Δ
src/alchemlyb/preprocessing/subsampling.py	`95.12% <100%> (-0.23%)`	⬇️
src/alchemlyb/parsing/amber.py	`98.24% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5173d17...cc2dfd1. Read the comment docs.

orbeckst

I just had a quick look, I don't really understand the changes well enough yet.

orbeckst · 2017-11-14T20:23:11Z

src/alchemlyb/preprocessing/subsampling.py

@@ -4,6 +4,7 @@
 import numpy as np
 from pymbar.timeseries import statisticalInefficiency
 from pymbar.timeseries import detectEquilibration
+from pymbar.timeseries import subsampleCorrelatedData


import together (combine all three lines, no idea why they were separate)

from pymbar.timeseries import statisticalInefficiency, detectEquilibration, subsampleCorrelatedData

@orbeckst yes, changed to one line

orbeckst · 2017-11-14T20:23:48Z

src/alchemlyb/preprocessing/subsampling.py

-        df = df.loc[series.index]
+
+        #use the subsampleCorrelatedData function to get the subsample index
+        indices = subsampleCorrelatedData(series, g=statinef)


Can you briefly explain what subsampleCorrelatedData does (and maybe link to the code in their repo)?

@orbeckst yes, the link to the code is https://github.com/choderalab/pymbar/blob/ac648e63787592a251992a764876713a64f17455/pymbar/timeseries.py#L632 it basically subsample the time series (series) based on the given g (statistical inefficiency)

I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?

@dotsdl yes, the code you refer to is an optional option conservative, and the default for that option is False. I didn't notice that option before while looking at the details, I think the implementation is also different from what alchemlyb currently had.

The topic here is basically all related to the precision and distinguishability of the subsampling. For a given g like 1.5, current alchemlyb code will round up using int(np.rint(1.5)) = 1.0 and, the conservative option will round to 2 using math.ceil(1.5) and the non conservative option will use 1.5. For a given time series with 1,000 data point, current alchemlyb; conservative; non conservative will get 1,000; 500; 666 data point respectively. Since the energy will depend on the subsampling, I think we should be careful about how we round the g and I suggested to use the non conservative option here to make all dataset as much distinguishable as possible.

Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False) (especially conservative=False to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)

I agree with #38 (comment) – until we come up with anything better inside alchemlyb I would also go with the pymbar version.

@orbeckst yes, sounds good to me. I will add the default options explicitly.

orbeckst · 2017-11-14T20:26:46Z

src/alchemlyb/preprocessing/subsampling.py

+        indices = subsampleCorrelatedData(series, g=statinef)
+        picked_time_index = []
+        #pick the time index for the pandas dataframe based on the python index from subsample
+        for s_index, s_index_pair in enumerate(series.index):


This looks clunky and potentially slow (although that might not matter much). Isn't there a direct way to do this without the for loop?

(I don't really know what the data look like to say more here.)

Yea, let me check if there is a better way to rewrite this

Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame or Series. You can directly extract all values for an integer-based index with series.iloc[indices] instead.

@dotsdl yes, the iloc[indices] works. Thx!

dotsdl

Added some comments to address.

dotsdl · 2017-11-14T22:03:29Z

src/alchemlyb/preprocessing/subsampling.py

-        df = df.loc[series.index]
+
+        #use the subsampleCorrelatedData function to get the subsample index
+        indices = subsampleCorrelatedData(series, g=statinef)


I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?

dotsdl · 2017-11-14T22:04:44Z

src/alchemlyb/preprocessing/subsampling.py

+        indices = subsampleCorrelatedData(series, g=statinef)
+        picked_time_index = []
+        #pick the time index for the pandas dataframe based on the python index from subsample
+        for s_index, s_index_pair in enumerate(series.index):


Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame or Series. You can directly extract all values for an integer-based index with series.iloc[indices] instead.

orbeckst

See minor comments, otherwise I agree with your rationale for changing.

Please also update the docs and explicitly explain that we are using pymbar.timeseries.subsampleCorrelatedData(..., fast=False, conservative=False) and say that this can lead to non-uniform sampling of the time series. Thus it is not safe to assume anymore that frames are equally spaced in time.

Do we have tests that distinguish old from new behavior?

orbeckst

Docs and tests.

Otherwise I agree with the change on scientific grounds.

orbeckst · 2017-11-15T00:16:32Z

src/alchemlyb/preprocessing/subsampling.py

-
-        df = df.loc[series.index]
+
+        #use the subsampleCorrelatedData function to get the subsample index


update the docs: make clear that intervals may be not uniform any more.

yes, will add docs

orbeckst · 2017-11-15T00:17:01Z

src/alchemlyb/preprocessing/subsampling.py

-        df = df.loc[series.index]
+
+        #use the subsampleCorrelatedData function to get the subsample index
+        indices = subsampleCorrelatedData(series, g=statinef)


Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False) (especially conservative=False to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)

Have we got tests that test fractional g?

There is a test_preprocessing.py code, and inside there is a test_subsampling function

alchemlyb/src/alchemlyb/tests/test_preprocessing.py

Line 76 in 0587ebd

def test_subsampling(self, data, size):

testing that the subsampled data is not bigger than the original data, while I think this one didn't explicitly test the g value or the actual size of the subsample data. Also, should we define a slicer function inside this class, maybe a bug here?

orbeckst · 2017-11-15T00:20:06Z

Btw, @shuail this PR appears to come from your master branch. I recommend that PRs get their own branches.

shuail · 2017-11-15T14:58:47Z

@orbeckst thanks for the comments, yes, will clean up the code (add more docs/explicit the options) and create another PR from the not master branch.

davidlmobley · 2017-11-15T18:47:47Z

So just to chime in here -- we're talking about a fairly significant change on the basis of how to round g, is that correct? Specifically:

The topic here is basically all related to the precision and distinguishability of the subsampling. For a given g like 1.5, current alchemlyb code will round up using int(np.rint(1.5)) = 1.0 and, the conservative option will round to 2 using math.ceil(1.5) and the non conservative option will use 1.5. For a given time series with 1,000 data point, current alchemlyb; conservative; non conservative will get 1,000; 500; 666 data point respectively. Since the energy will depend on the subsampling, I think we should be careful about how we round the g and I suggested to use the non conservative option here to make all dataset as much distinguishable as possible.

Let's think for a minute about the practical difference between using different values of g. When will this be important? I think in the limit of "well behaved" data, where we have plenty of statistically distinguishable samples, this is not going to really matter at all, so the change is unimportant and is unlikely to buy us any real value; presumably it will make uncertainty estimates a little lower, but that's not especially valuable. When it will be important is closer to the limit of "problematic" data, where we may not have enough sampling at all, or we just barely have enough sampling to get a reasonable estimate. This is the case we really want to be thinking about -- maybe $N_{eff}$ is 20 or 50 so we only have a rather small number of samples at some lambda value.

In that case -- as we approach the limit of problematic data -- what's our biggest goal? Is it being able to squeeze every last bit of POSSIBLE value out of the data we have, even if that results in a false appearance of precision when in fact our results are garbage? Or is it being able to correctly recognize that in fact we don't have enough data? I'd argue that the latter goal is more important: It's worse to get an "apparently reasonable" answer that is wrong than to get a "clearly noisy/useless" answer.

Thus, in my view, we should err in the direction of conservatism (at least by default) and round g up if we need to round. I could be OK with not rounding (though see below) but certainly not rounding down.

In my previous experience though, statistical uncertainties tend to always be underestimated anyway, so rounding up seems like a relatively safe thing to do if we need to round.

So in my mind the decision is between:

round up (and keep current code)
go with non-conservative option and just use numeric value of g, potentially giving non-uniform spaced frames

One of our overall goals is to allow easy coupling with things like trajectory analysis, which may be an argument in favor of the simpler approach (round g upwards to the nearest integer) since having non-uniform frame spacing is going to complicate the analysis a lot.

shuail · 2017-11-15T20:01:21Z

@davidlmobley In our practice, supposing g will be linearly correlated to the subsample_diff which is = energy(subsampled) - energy(not subsampled) and we want to use the subsample_diff as an indicator for other further analysis ie, filtering the dataset, machine learning etc. It will be nice to have that indicator behave linearly so that g is not rounded. While I can see that round up will not effect the detecting of large subsample_diff which saying that g = 30 vs 31 will not have much difference, so I suggest that we adopt the subsampleCorrelatedData option here (conservative=False as default) and add a conservative flag to pass to subsampleCorrelatedData. Another thing that I am thinking to avoid using round up as default is that for cases like g = 1.01, the rounded up g will be 2 and it turns out that we just pick half of the data even they are highly uncorrelated.

For coupling the energy vs trajectory analysis, I think the pacing of writing out energy vs trajectory is arbitrary and will depend on the users input, even the uniform frame spacing of subsampling would not benefit much in this case. So I would like to use the unsubsampled data for the energy vs trajectory analysis.

orbeckst · 2017-11-15T20:09:19Z

Good points.

The current code rounds to nearest integer so that can create the appearance of more independent samples than there really are.

Perhaps use the conservative approach as default and add an option for users to choose the non-uniform sampling (and making sure that the frame indices for slicing the original trajectory remain available downstream).

orbeckst · 2017-11-16T01:11:50Z

To summarize the discussion so far:

We rather want to be conservative so the current implementation using statinef = int(np.rint(statinef)) should be changed because numpy.rint can round down (e.g., g=1.4 will give g=1). If nothing else happens it should become statinef = int(np.ceil(statinef)).
We could completely rely on pymbar.timeseries.subsampleCorrelatedData() and use the conservative=True keyword to get the ceil(g) rounding.
If we use pymbar.timeseries.subsampleCorrelatedData()then we can also use conservative=False to obtain indices with uneven sampling that best approximate a fractional g. I don't think that unevenly sampled frames will make any difference for downstream processing in alchemlyb. It might become more difficult to use the processed energies in the context of a trajectory but this might be less of a concern because
- energy and trajectory output are often not written out at the same frequency and slicing one by the other is tricky business anyway that has to be solved elsewhere (e.g. by comparing time stamps of frames – our standard forms of raw data contain time stamps, so they can be used)
- a more typical approach might be (??) to featurize a trajectory, from the featurization determine how to slice the raw energy time series, and then subsample the sliced time series according to the g of the raw time series – in this case, we never need to care about even or uneven sampling.

I suggest that we

add a keyword to allow the user to switch between conservative=True and False (I don't like the kw name very much but we should keep it to be consistent with pymbar)
use pymbar.timeseries.subsampleCorrelatedData() for fractional g or ceil(g)

The main question that would remain is what should the default be?

I am tending towards conservative=True to protect unsuspecting users from overly optimistic estimates while still allowing knowledgable users to get what they want.

davidlmobley · 2017-11-16T05:32:34Z

I think I'm on board with all of this and also lean towards having the default be conservative=True, for the same reasons just noted.

shuail · 2017-11-16T14:37:57Z

I think I'm fine with @orbeckst suggested procedure. For the default option, I am also fine with the rounding approach since that's close to the original code and philosophy. I could change the option for my own use anyway.

orbeckst · 2017-11-16T16:54:22Z

Good, so we have a consensus. Can you please modify your PR accordingly? We will also need at least two tests that check that the two approaches work.

…

-- Oliver Beckstein email: orbeckst@gmail.com

Am Nov 16, 2017 um 07:37 schrieb shuail ***@***.***>: I think I'm fine with @orbeckst suggested procedure. For the default option, I am also fine with the rounding approach since that's close to the original code and philosophy. I could change the option for my own use anyway. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

shuail · 2017-11-16T17:59:52Z

Yes, I will work on the code and the tests. Once I finished that, will create another PR from a separate branch other than master

orbeckst · 2017-11-22T01:16:50Z

I continued the work on PR #41 – please push updates there. (#41 was rebased against master!)

change the subsampling code to use the pymbar module to get the subsa…

ff7d3a1

…mpled indexes, the original method use the slicing function in pandas which requires the slicing step to be integer

shuail added the enhancement label Nov 14, 2017

shuail requested a review from orbeckst November 14, 2017 16:24

shuail mentioned this pull request Nov 14, 2017

change the subsampling method to use pymbar.timeseries.subsampleCorrelatedData() #39

Closed

orbeckst reviewed Nov 14, 2017

View reviewed changes

dotsdl reviewed Nov 14, 2017

View reviewed changes

update the subsampling code

cc2dfd1

orbeckst requested changes Nov 15, 2017

View reviewed changes

orbeckst mentioned this pull request Nov 22, 2017

change subsampling with statistical_inefficiency() (#39) #41

Merged

orbeckst closed this Nov 22, 2017


		df = df.loc[series.index]

		#use the subsampleCorrelatedData function to get the subsample index

change the subsampling method #38

change the subsampling method #38

Conversation

shuail commented Nov 14, 2017

codecov-io commented Nov 14, 2017 • edited

Codecov Report

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotsdl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment • edited

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Nov 15, 2017

shuail commented Nov 15, 2017

davidlmobley commented Nov 15, 2017

shuail commented Nov 15, 2017

orbeckst commented Nov 15, 2017

orbeckst commented Nov 16, 2017

davidlmobley commented Nov 16, 2017

shuail commented Nov 16, 2017

orbeckst commented Nov 16, 2017 via email

shuail commented Nov 16, 2017

orbeckst commented Nov 22, 2017

codecov-io commented Nov 14, 2017 •

edited

orbeckst left a comment •

edited