Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to N x K storage for U_kln matrix #3

Closed
mrshirts opened this issue Jul 5, 2013 · 32 comments
Closed

Convert to N x K storage for U_kln matrix #3

mrshirts opened this issue Jul 5, 2013 · 32 comments
Assignees
Milestone

Comments

@mrshirts
Copy link
Collaborator

mrshirts commented Jul 5, 2013

For data sets where we have different numbers of samples at different states, rather than having the energies stored in a KxKxN_each matrix, it would be more efficient to store data as a KxN_tot matrix, where N is the total number of sampled collected, and K is the number of states we are evaluating the energies at. This will take a bit of extra work, however,

@ghost ghost assigned mrshirts Jul 5, 2013
@kyleabeauchamp
Copy link
Collaborator

So interestingly, we recently ran into a possibly similar issue in MSMBuilder. We previously stored the state assignments as a (num_trajectories, max_length) numpy array. This led to over-allocation of huge blocks of memory when we had data sets where some trajectories were very long, but others were very short (e.g. combining FAH and ANTON data).

Our proposed solution (hasn't been merged yet, but probably will eventually be) was to use Pandas to store the data in a different format (msmbuilder/msmbuilder-legacy#206 and msmbuilder/msmbuilder-legacy#205). Instead of a 2D array, we use a Pandas Dataframe that has fields for "trajectory ID", "timestep", and "conformational state".

I suspect a similar thing could be done in pymbar--we just create a DataFrame object that has entries for "sample ID", "sampled state", etc.

@kyleabeauchamp
Copy link
Collaborator

Obviously that requires an extra prerequisite, but it could be an elegant solution if this problem is a real efficiency breaker.

@jchodera
Copy link
Member

I like this idea---it's effectively a sparse representation.

Optimally, I think we would implement a simple linear storage system that doesn't require pandas (all N samples are concatenated and evaluated at all K states) and a pandas-interoperable interface, so that either form could be passed in.

The big question is whether we can deprecate the old interface where an Nmax x Nmax x K matrix u_kln is passed in. I imagine this would be a bad idea, and we can simply support both as input (but use NxK internally) and determine which one is being passed by the dimension.

@mrshirts
Copy link
Collaborator Author

I think the first step is to convert to an N x K representation, using the
N_k data to identify where we transition between samples from state 1,
state 2, etc. I have some ideas on this, but not sure when I'll have time
to do too much.

On Sat, Aug 10, 2013 at 9:02 PM, John Chodera notifications@github.comwrote:

I like this idea---it's effectively a sparse representation.

Optimally, I think we would implement a simple linear storage system that
doesn't require pandas (all N samples are concatenated and evaluated at all
K states) and a pandas-interoperable interface, so that either form could
be passed in.

The big question is whether we can deprecate the old interface where an
Nmax x Nmax x K matrix u_kln is passed in. I imagine this would be a bad
idea, and we can simply support both as input (but use NxK internally) and
determine which one is being passed by the dimension.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-22450344
.

@jchodera
Copy link
Member

This should be straightforward, but will require some time investment to complete and test. We should only tackle this once we have a sufficient number of tests in place to ensure we haven't screwed anything major up.

@mrshirts
Copy link
Collaborator Author

Agreed on all counts mentioned.

On Sun, Aug 11, 2013 at 8:36 PM, John Chodera notifications@github.comwrote:

This should be straightforward, but will require some time investment to
complete and test. We should only tackle this once we have a sufficient
number of tests in place to ensure we haven't screwed anything major up.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-22468659
.

@kyleabeauchamp
Copy link
Collaborator

I agree. Once tests are in place, it would be easy to hammer this out.
On Aug 11, 2013 6:26 PM, "mrshirts" notifications@github.com wrote:

Agreed on all counts mentioned.

On Sun, Aug 11, 2013 at 8:36 PM, John Chodera notifications@github.comwrote:

This should be straightforward, but will require some time investment to
complete and test. We should only tackle this once we have a sufficient
number of tests in place to ensure we haven't screwed anything major up.


Reply to this email directly or view it on GitHub<
https://github.com/choderalab/pymbar/issues/3#issuecomment-22468659>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-22469468
.

@kyleabeauchamp
Copy link
Collaborator

Below, $q_{kj}$ refers to the unnormalized probability of conformation j in state k. (e.g. N x K storage)

So I've been playing with this a bit and have some initial thoughts regarding acceleration:

  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications.
  2. We can still handle overflow by dividing $q_{kj}$ by the largest element in each row (e.g. the most probable state for each sample).
  3. We can't handle underflow and still use fast BLAS / matrix multiplication--underflow requires Kahan summation, which is generally not a BLAS accelerated operation. Note that I don't think the current code actually handles underflow--just overflow. Correct me if I'm wrong here.
  4. My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code.
  5. If 100X is fast enough, do we still need to have the Newton code and c++ code? Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right? One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I can give more quantitative results / PR later.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 4, 2013

Hi, all-

Perhaps you can check code in as a branch so we can look at it directly?

So I've been playing with this a bit and have some initial thoughts
regarding acceleration:

Great!

  1. If we switch from log-space to exponential space (e.g. from $f_i$
    to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become
    essentially just two matrix multiplications.

There are some places where the log space representation really is
necessary for precision. Most was originally in exponential space, and
more and more got shifted over time to log space. I'd have to dig in a bit
more. Is it possible to shift to exponential space just before the time
consuming calculations?

  1. We can still handle overflow by dividing $q_{kj}$ by the largest
    element in each row (e.g. the most probable state for each sample).

I don't THINK this is sufficient to save the error -- but we'd have to
check.

  1. We can't handle underflow and still use fast BLAS / matrix
    multiplication--underflow requires Kahan summation, which is generally not
    a BLAS accelerated operation. Note that I don't think the current code
    actually handles underflow--just overflow. Correct me if I'm wrong here.

I would have to look a little more carefully. I think we don't have to
worry about underflow.

  1. My implementation of these ideas (with self-consistent iteration)
    gives a 100X speedup over the current C++ Newton code. Also, this requires
    only like 10 lines of code.

Great!

  1. If 100X is fast enough, do we still need to have the Newton code
    and c++ code?

NR is MUCH faster than self consistent solution near the answer. It might take 8 steps
where self-consistent iteration still hasn't converged after 8000. Of particular
worry is cases where the delta F is below the specified limit -- but each
dF is correlates, so you could still be 0.01 away from the solution even
though the increment has dropped below 0.00001, for example. So NR is
still good for now. Don't remove NR even because of a factor of 100, it's worth more than that in many cases. If you can replace the old self-consistent version w/o removing NR, then more power, of course.

  1. Do we want to run calculations for n_states > 10000? In such cases,
    we can't use NR because of the memory footprint of the Hessian--right?

Longer term, we do want to handle more states. However, we really only
need this for states for which there are samples; states for which there
are not samples do not require a Hessian matrix to be generated.
The maximum size is K_sampled x K_sampled.

The likely number of cases where we want to handle many states are when we
are exploring a highly multidimensional space, and n_states >> n_samples,
in which case we would be limited by the number of samples instead. And I
think we wan to think about these a bit differently.

  1. One possible solution is to transform the self-consistent equations
    into a nonlinear optimization problem (e.g. possibly using L2 error) and
    apply LBFGS. I'm not sure if this is worth the extra code required, though.

I think this is a great idea, but I think it can wait until we've
stabilized the current changes a bit more.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 4, 2013

One possible solution is to transform the self-consistent equations into
a nonlinear optimization problem (e.g. possibly using L2 error) and apply
LBFGS. I'm not sure if this is worth the extra code required, though.

Note that there is an objective function that when minimized, it gives the
free energies. See the function _objectiveF. Certainly can be written
more efficiently.

On Wed, Sep 4, 2013 at 3:33 PM, kyleabeauchamp notifications@github.comwrote:

Below, $q_{kj}$ refers to the unnormalized probability of conformation j
in state k. (e.g. N x K storage)

So I've been playing with this a bit and have some initial thoughts
regarding acceleration:

  1. If we switch from log-space to exponential space (e.g. from $f_i$
    to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become
    essentially just two matrix multiplications.
    1. We can still handle overflow by dividing $q_{kj}$ by the largest
      element in each row (e.g. the most probable state for each sample).
    2. We can't handle underflow and still use fast BLAS / matrix
      multiplication--underflow requires Kahan summation, which is generally not
      a BLAS accelerated operation. Note that I don't think the current code
      actually handles underflow--just overflow. Correct me if I'm wrong here.
  2. My implementation of these ideas (with self-consistent iteration)
    gives a 100X speedup over the current C++ Newton code. Also, this requires
    only like 10 lines of code.
  3. If 100X is fast enough, do we still need to have the Newton code
    and c++ code? Do we want to run calculations for n_states > 10000? In such
    cases, we can't use NR because of the memory footprint of the
    Hessian--right? One possible solution is to transform the self-consistent
    equations into a nonlinear optimization problem (e.g. possibly using L2
    error) and apply LBFGS. I'm not sure if this is worth the extra code
    required, though.

I can give more quantitative results / PR later.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-23817337
.

@jchodera
Copy link
Member

jchodera commented Sep 5, 2013

  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications.
    We can still handle overflow by dividing $q_{kj}$ by the largest element in each row (e.g. the most probable state for each sample).
    There are some places where the log space representation really is necessary for precision. Most was originally in exponential space, andmore and more got shifted over time to log space. I'd have to dig in a bit more. Is it possible to shift to exponential space just before the time consuming calculations?

Note that Kyle says he can handle overflow by scaling each row. That could fix nearly all of the issues that caused us to go to log space. I think this might be worth exploring to see if underflow ends up being unproblematic.

To test this, I think we'd need to have the harmonic oscillator test be automated with a variety of different overlap conditions (good, medium, poor, very bad).

@jchodera
Copy link
Member

jchodera commented Sep 5, 2013

My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code.
If 100X is fast enough, do we still need to have the Newton code and c++ code? Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right? One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I'd fully support this, since it would make for code that is more easily understood, still fast, and easier to maintain.

For the really big problems, maybe we could use a scheme that doesn't require explicit construction of the full matrix, or computes the elements on the fly as needed, possibly in a pyopencl/pycuda implementation (which would be compact and fast).

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

Note that Kyle says he can handle overflow by scaling each row. That could fix nearly all of the issues that
caused us to go to log space. I think this might be worth exploring to see if underflow ends up being unproblematic.

Sure, just remember that almost everything in there is there for a reason. Before we make a major switch like this, I'll need to go though the check in dates for these changes, check the emails at that time, and make sure there's not some nasty test case that we're forgetting something important for. Limiting to the exponential case, limits us to a dynamic range of of 700x (log 10^300 - log 10^-300). I believe the specific problem was looking at heat capacities for protein aggregation.

To test this, I think we'd need to have the harmonic oscillator test be automated with a variety of different overlap > conditions (good, medium, poor, very bad).

Unfortunately, harmonic oscillators are "nice" in a way that a lot of other problems are not. So just because something works with harmonic oscillators doesn't mean it will work with other problems. If it breaks harmonic oscillators, it will certainly break other problems.

The solution, of course, is to dig those problems out of the email so they can be tested as well. But given my schedule in sep/oct, I'm not going to be able to do that.

In conclusion: The todos as I see them:

  1. Fix the K x K x N representation.

This will provide make formulas simpler because we won't need to do nearly as much index juggling, and save speed with everything as well as saving memory -- in most cases where memory fails, it fails because of K x K x N memory issues, not because of K x K issues.

  1. Make our test set include all the problems that failed in the past (will require some data archeology in the emails and subversion code), so that when we have good ideas, we can tell if they will break things.
  2. Then experiment with things like only representing exponential spaces, highly sparse spaces, and removing methods that are working pretty well. If it ain't broke, be very careful about fixing it.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

OK, I found the example that caused me to harden the code. Basically, free energy differences greater than ~710 KbT cannot be handled in the exponential regime alone. Note that this has nothing to do with overlap -- if the energy differences alone cause the free energy differences, it will still fail.

I'll try to figure out where to upload this.

@kyleabeauchamp
Copy link
Collaborator

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile.

@jchodera
Copy link
Member

jchodera commented Sep 5, 2013

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile.

I agree completely.

OK, I found the example that caused me to harden the code. Basically, free energy differences greater than ~710 KbT cannot be handled in the exponential regime alone. Note that this has nothing to do with overlap -- if the energy differences alone cause the free energy differences, it will still fail.

We should create a special tests/ directory for difficult cases, provided they are small enough to include with the standard package. Otherwise, they should go into pymbar-examples or pymbar-datasets. Or maybe we can create a pymbar-tests?

@kyleabeauchamp
Copy link
Collaborator

So here's my proposed organization:

  1. Create a directory of test systems, each of which is a class that generates the necessary data
  2. Have wrappers in our nosetests that runs all the tests (some might be disabled for speed reasons, if necessary)

@kyleabeauchamp
Copy link
Collaborator

Regarding the 700 kt underflow limit, while working in log space is one solution, another solution would be to switch to float128 in situations where more precision is needed.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

That only doubles the range to 1400 kBT. There are things that are fundamentally better to do logarithmically, and its surprisingly easy to bump up against them. Two harmonic oscillators with equal spring constants but energy offsets of 2000 kBT can still break parts of pymbar if done solely in exponential.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

So here's my proposed organization:

Create a directory of test systems, each of which is a class that generates the necessary data

Can you clarify this? Sometimes, there will be data sets where the data cannot be generated on the fly.

Have wrappers in our nosetests that runs all the tests (some might be disabled for speed reasons, if necessary)

Yes, there should be standard tests that always run.

I'm thinking that there are really three types of data sets and examples:

  1. tests (currently existing):
    This is supposed to provide the automated tests. For example tests/harmonic-oscillators/harmonic-oscillators.py runs through all the functionality in pymbar and checks analytical error estimates by making sure biases are within 1-2 std. It could probably be integrated much better. The tests/timeseries/timeseries/test-statistical-efficiency code checks the timeseries.py. The tests/harmonic-oscillators/harmonic-oscillators.py is supposed to test the distribution of error, which it basically does, but doesn't test all functionality.
  2. examples (currently existing) Pedagogical examples, should be part of the main download. Are for users, and may involve small tests sets. But the main goal is to help people write code using these examples. Data sets should be small so that it can fit in the main download.
  3. large-datasets (kind of existing): This kind of already exists, and should be for larger scale testing that is 1) to expensive to automate and 2) a database of all problems we've encountered, so we can make sure that new code doesn't break weird but important cases. Stress testing ( like the energy range problem) should be here.

@kyleabeauchamp
Copy link
Collaborator

Thanks. Here's another question: do we need everything in log space, or just $f_i$. For example, can we pre-compute $Q_{ij} = exp(-u_{ij})$, rather than doing it over and over again during the self-consistent iteration? That's probably a large chunk of the slowdown.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code
simplification and maintainability, such changes might be worthwhile.

Certainly! I'm just pointing out that it is currently easy to break the code because we don't have all of the cases explicitly laid out -- they are all in emails between John and I that need to be added as test cases in pymbar. So one should not assume that just because harmonic-oscillators.py is unchanged that a new change will not break the code right now. The situation does need to change longer term (which for me, might mean a month or two).

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

Thanks. Here's another question: do we need everything in log space, or just $f_i$. For example, can we pre-
compute $Q_{ij} = exp(-u_{ij})$, rather than doing it over and over again during the self-consistent iteration?
That's probably a large chunk of the slowdown.

That's a really good question that I would need to play around with. I THINK the answer is no, but I can't remember.

My first reaction is caution, so I'd still advocating taking the current code, changing everything to NxK, and then all subsequent proposed modifications (like this) will be much easier to make. I'd like to see how much of the problem is all the annoying is-sampled-state index dancing, and then work from there. Getting rid of that will make the code much more readable without changing any convergence properties, and then we can separately address optimizations that may result in different convergence.

I'm sorry I'm being a pain here. Clearly, the specific problems in numerical stability that resulted in the switch to log form need to be documented better. I'm just trying to make sure we minimize pain in switching back as I'm almost certain will be needed :) Though I could be wrong.

One of the reasons we haven't been worrying as much about the self-consistent speed is that NR is so much more robust as an optimization method than self-consistent iteration when it can be used (which is when states with samples < 1,0000 or so).

@kyleabeauchamp
Copy link
Collaborator

So note that in the benchmark, I think my proposed self-consistent code is actually 100X faster than the NR code--it's better than apples to apples...

@kyleabeauchamp
Copy link
Collaborator

I agree that speed is lower priority than accuracy and readability, though.

@kyleabeauchamp
Copy link
Collaborator

git clone https://github.com/kyleabeauchamp/pymbar.git
cd pymbar
git checkout refactor

@kyleabeauchamp
Copy link
Collaborator

Oops, wrong thread.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 5, 2013

So note that in the benchmark, I think my proposed self-consistent code is actually 100X faster than the NR code--it's better than apples to apples...

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

@jchodera
Copy link
Member

jchodera commented Sep 6, 2013

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for both of these schemes in a consistent fashion: Compute the norm of the gradient of the related minimization problem. We compute this gradient in our adaptive scheme to decide whether self-consistent iteration or Newton-Raphson is getting "closer" to the converged result:

      # gradient (defined by Eq. C6 of [1])
      # g_i(theta) = N_i - \sum_n N_i W_ni

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 6, 2013

This does lose some of the advantages of self-consistent iteration if you
have to do the gradient calculation at each step. Though perhaps we can
play around with the variables so that not much extra work is required at
each step.

Please don't get me wrong in all of this. I think it's great Kyle is
finding these tricks. We just need to get the robustness tests in so that
we don't have to rely on me remembering what things broke the code in the
past leading to the somewhat more convoluted than ideal version that we
have now.

On Fri, Sep 6, 2013 at 9:03 AM, John Chodera notifications@github.comwrote:

The problem with self-consistent is that relative_tolerance means
something very different. for NR, given quadratic convergence, the df_i+1 -
df_i is essentially df_i+1 - df_converged. For self-consistent iteration,
df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence.
These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for
both of these schemes in a consistent fashion: Compute the norm of the
gradient of the related minimization problem. We compute this gradient in
our adaptive scheme to decide whether self-consistent iteration or
Newton-Raphson is getting "closer" to the converged result:

  # gradient (defined by Eq. C6 of [1])
  # g_i(theta) = N_i - \sum_n N_i W_ni


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-23938121
.

@mrshirts
Copy link
Collaborator Author

mrshirts commented Sep 6, 2013

Note also that even if the self-consistent iteration and NR are using the
same test for convergence, that does not change the self-consistent
iteration performance. It's just that it becomes clearer than it takes
more steps.

On Fri, Sep 6, 2013 at 10:06 AM, Michael Shirts mrshirts@gmail.com wrote:

This does lose some of the advantages of self-consistent iteration if you
have to do the gradient calculation at each step. Though perhaps we can
play around with the variables so that not much extra work is required at
each step.

Please don't get me wrong in all of this. I think it's great Kyle is
finding these tricks. We just need to get the robustness tests in so that
we don't have to rely on me remembering what things broke the code in the
past leading to the somewhat more convoluted than ideal version that we
have now.

On Fri, Sep 6, 2013 at 9:03 AM, John Chodera notifications@github.comwrote:

The problem with self-consistent is that relative_tolerance means
something very different. for NR, given quadratic convergence, the df_i+1 -
df_i is essentially df_i+1 - df_converged. For self-consistent iteration,
df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence.
These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for
both of these schemes in a consistent fashion: Compute the norm of the
gradient of the related minimization problem. We compute this gradient in
our adaptive scheme to decide whether self-consistent iteration or
Newton-Raphson is getting "closer" to the converged result:

  # gradient (defined by Eq. C6 of [1])
  # g_i(theta) = N_i - \sum_n N_i W_ni


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-23938121
.

@mrshirts mrshirts modified the milestones: 3.0 release, 2.0 release Apr 22, 2014
@kyleabeauchamp kyleabeauchamp modified the milestones: 2.0 release, 3.0 release, 2.1 release Apr 22, 2014
@kyleabeauchamp
Copy link
Collaborator

So NxK is done. There are some useful discussions here regarding precision and performance, but I think I've refreshed my memory of the key points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants