New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert to N x K storage for U_kln matrix #3
Comments
So interestingly, we recently ran into a possibly similar issue in MSMBuilder. We previously stored the state assignments as a Our proposed solution (hasn't been merged yet, but probably will eventually be) was to use Pandas to store the data in a different format (msmbuilder/msmbuilder-legacy#206 and msmbuilder/msmbuilder-legacy#205). Instead of a 2D array, we use a Pandas Dataframe that has fields for "trajectory ID", "timestep", and "conformational state". I suspect a similar thing could be done in pymbar--we just create a DataFrame object that has entries for "sample ID", "sampled state", etc. |
Obviously that requires an extra prerequisite, but it could be an elegant solution if this problem is a real efficiency breaker. |
I like this idea---it's effectively a sparse representation. Optimally, I think we would implement a simple linear storage system that doesn't require pandas (all N samples are concatenated and evaluated at all K states) and a pandas-interoperable interface, so that either form could be passed in. The big question is whether we can deprecate the old interface where an Nmax x Nmax x K matrix |
I think the first step is to convert to an N x K representation, using the On Sat, Aug 10, 2013 at 9:02 PM, John Chodera notifications@github.comwrote:
|
This should be straightforward, but will require some time investment to complete and test. We should only tackle this once we have a sufficient number of tests in place to ensure we haven't screwed anything major up. |
Agreed on all counts mentioned. On Sun, Aug 11, 2013 at 8:36 PM, John Chodera notifications@github.comwrote:
|
I agree. Once tests are in place, it would be easy to hammer this out.
|
Below, So I've been playing with this a bit and have some initial thoughts regarding acceleration:
I can give more quantitative results / PR later. |
Hi, all- Perhaps you can check code in as a branch so we can look at it directly?
Great!
There are some places where the log space representation really is
I would have to look a little more carefully. I think we don't have to
Great!
NR is MUCH faster than self consistent solution near the answer. It might take 8 steps
Longer term, we do want to handle more states. However, we really only The likely number of cases where we want to handle many states are when we
I think this is a great idea, but I think it can wait until we've |
Note that there is an objective function that when minimized, it gives the On Wed, Sep 4, 2013 at 3:33 PM, kyleabeauchamp notifications@github.comwrote:
|
Note that Kyle says he can handle overflow by scaling each row. That could fix nearly all of the issues that caused us to go to log space. I think this might be worth exploring to see if underflow ends up being unproblematic. To test this, I think we'd need to have the harmonic oscillator test be automated with a variety of different overlap conditions (good, medium, poor, very bad). |
I'd fully support this, since it would make for code that is more easily understood, still fast, and easier to maintain. For the really big problems, maybe we could use a scheme that doesn't require explicit construction of the full matrix, or computes the elements on the fly as needed, possibly in a pyopencl/pycuda implementation (which would be compact and fast). |
Sure, just remember that almost everything in there is there for a reason. Before we make a major switch like this, I'll need to go though the check in dates for these changes, check the emails at that time, and make sure there's not some nasty test case that we're forgetting something important for. Limiting to the exponential case, limits us to a dynamic range of of 700x (log 10^300 - log 10^-300). I believe the specific problem was looking at heat capacities for protein aggregation.
Unfortunately, harmonic oscillators are "nice" in a way that a lot of other problems are not. So just because something works with harmonic oscillators doesn't mean it will work with other problems. If it breaks harmonic oscillators, it will certainly break other problems. The solution, of course, is to dig those problems out of the email so they can be tested as well. But given my schedule in sep/oct, I'm not going to be able to do that. In conclusion: The todos as I see them:
This will provide make formulas simpler because we won't need to do nearly as much index juggling, and save speed with everything as well as saving memory -- in most cases where memory fails, it fails because of K x K x N memory issues, not because of K x K issues.
|
OK, I found the example that caused me to harden the code. Basically, free energy differences greater than ~710 KbT cannot be handled in the exponential regime alone. Note that this has nothing to do with overlap -- if the energy differences alone cause the free energy differences, it will still fail. I'll try to figure out where to upload this. |
In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile. |
I agree completely.
We should create a special tests/ directory for difficult cases, provided they are small enough to include with the standard package. Otherwise, they should go into |
So here's my proposed organization:
|
Regarding the 700 kt underflow limit, while working in log space is one solution, another solution would be to switch to float128 in situations where more precision is needed. |
That only doubles the range to 1400 kBT. There are things that are fundamentally better to do logarithmically, and its surprisingly easy to bump up against them. Two harmonic oscillators with equal spring constants but energy offsets of 2000 kBT can still break parts of pymbar if done solely in exponential. |
Can you clarify this? Sometimes, there will be data sets where the data cannot be generated on the fly.
Yes, there should be standard tests that always run. I'm thinking that there are really three types of data sets and examples:
|
Thanks. Here's another question: do we need everything in log space, or just |
Certainly! I'm just pointing out that it is currently easy to break the code because we don't have all of the cases explicitly laid out -- they are all in emails between John and I that need to be added as test cases in pymbar. So one should not assume that just because harmonic-oscillators.py is unchanged that a new change will not break the code right now. The situation does need to change longer term (which for me, might mean a month or two). |
That's a really good question that I would need to play around with. I THINK the answer is no, but I can't remember. My first reaction is caution, so I'd still advocating taking the current code, changing everything to NxK, and then all subsequent proposed modifications (like this) will be much easier to make. I'd like to see how much of the problem is all the annoying is-sampled-state index dancing, and then work from there. Getting rid of that will make the code much more readable without changing any convergence properties, and then we can separately address optimizations that may result in different convergence. I'm sorry I'm being a pain here. Clearly, the specific problems in numerical stability that resulted in the switch to log form need to be documented better. I'm just trying to make sure we minimize pain in switching back as I'm almost certain will be needed :) Though I could be wrong. One of the reasons we haven't been worrying as much about the self-consistent speed is that NR is so much more robust as an optimization method than self-consistent iteration when it can be used (which is when states with samples < 1,0000 or so). |
So note that in the benchmark, I think my proposed self-consistent code is actually 100X faster than the NR code--it's better than apples to apples... |
I agree that speed is lower priority than accuracy and readability, though. |
|
Oops, wrong thread. |
The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases. |
Zhiqiang Tan had a great idea for looking at the relative tolerance for both of these schemes in a consistent fashion: Compute the norm of the gradient of the related minimization problem. We compute this gradient in our adaptive scheme to decide whether self-consistent iteration or Newton-Raphson is getting "closer" to the converged result:
|
This does lose some of the advantages of self-consistent iteration if you Please don't get me wrong in all of this. I think it's great Kyle is On Fri, Sep 6, 2013 at 9:03 AM, John Chodera notifications@github.comwrote:
|
Note also that even if the self-consistent iteration and NR are using the On Fri, Sep 6, 2013 at 10:06 AM, Michael Shirts mrshirts@gmail.com wrote:
|
So NxK is done. There are some useful discussions here regarding precision and performance, but I think I've refreshed my memory of the key points. |
For data sets where we have different numbers of samples at different states, rather than having the energies stored in a KxKxN_each matrix, it would be more efficient to store data as a KxN_tot matrix, where N is the total number of sampled collected, and K is the number of states we are evaluating the energies at. This will take a bit of extra work, however,
The text was updated successfully, but these errors were encountered: