Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAMD parsers #7

Closed
dotsdl opened this issue Feb 6, 2017 · 32 comments
Closed

NAMD parsers #7

dotsdl opened this issue Feb 6, 2017 · 32 comments
Assignees
Labels
Projects
Milestone

Comments

@dotsdl
Copy link
Member

dotsdl commented Feb 6, 2017

In line with the overall API proposal, we want to have parsers for each of the major MD engines, and eventually have coverage for all of those in use. Since there are essentially two types of estimators (TI and FEP), each packages needs a parser for:

  1. Extracting reduced potentials u_nk from output files (for FEP).
  2. Extracting derivatives DHdl from output files (for TI).

This issue is the nexus for discussion for such parsers for the NAMD package. If you have existing parsing code for this package, comment below and we can begin adapting it into the parsers outlined above in a PR.

@davidlmobley
Copy link

I'm going to ask Christophe Chipot about this. @Limn1 in my group also has some basic parsers which might be good enough/serve as a starting point/provide a skeleton.

@davidlmobley
Copy link

Chipot is unfortunately unable to help at present, he reports. @Limn1 may be able to help but is tied up with some stuff on deadline til early March, I expect.

@davidlmobley
Copy link

davidlmobley commented Nov 28, 2017

I talked to @vtlim about this today. She could use some of her existing data, but we also batted around the idea of generating some sample data for a simpler test system.

@orbeckst
Copy link
Member

Good, feel free to grab alchemistry/alchemtest#17 !

@mrshirts
Copy link

There's a data set from JC Gumbart for the Tyr->Ala example from the NAMD FEP tutorial that he gave me. The data files are at http://simbac.gatech.edu/forMichael/NAMD-FEP-tutorial-TYR2ALA.tgz

If these look useful, then I can contact him about the requisite permissions.

@vtlim
Copy link
Collaborator

vtlim commented Nov 28, 2017

@mrshirts - if this is the same Tyr->Ala example as in this tutorial, I've played around with some of their example output a while back.

So, yes, it'd be helpful to see if we had permissions to use these.

@orbeckst
Copy link
Member

orbeckst commented Dec 5, 2017

I moved the discussion of tests datasets to alchemistry/alchemtest#17

@orbeckst
Copy link
Member

@davidlmobley please assign this issue (and alchemistry/alchemtest#17) to @vtlim

@vtlim
Copy link
Collaborator

vtlim commented Dec 12, 2017

@orbeckst - Apologies for the delay; I had a proposal application due in past hour.
Also, I see mention of "reST" in alchemistry/alchemtest#17
but could not find mention of this term elsewhere in either repo. This may be a trivial question,
but can you clarify what it is?

@orbeckst
Copy link
Member

orbeckst commented Dec 12, 2017 via email

@davidlmobley
Copy link

@orbeckst - we've not used Sphinx yet, so this is helpful. yet another argument for a general "best practices for code development" writeup. :)

@orbeckst orbeckst added this to the release 0.2.0 milestone Jan 16, 2018
@orbeckst
Copy link
Member

@vtlim the NAMD data set is now in alchemtest – thank you for making this work.

I tentatively added this feature to the 0.2.0 release but if you are pretty busy at the moment then we can easily move it to 0.3.0. We can release as quickly as we like.

@vtlim
Copy link
Collaborator

vtlim commented Jan 31, 2018

@orbeckst -- can you clarify what the fep-lambda column means in the pandas dataframe? I was reading the docs here but it's not clear to me what that column is for. I'm not sure if I should add that column after parsing the output file in namd.py or elsewhere.

@orbeckst
Copy link
Member

The html-formatted docs are nicer: https://alchemlyb.readthedocs.io/en/latest/parsing.html

I don't know what the NAMD data look like. But if it comes from any kind of free energy calculation then there should be either (for thermodynamic integration, TI) dH/dlambda derivatives (the dHdl standard form or for FEP/BAR/MBAR kind of calculations, potential energies evaluated at different lambdas (the u_nk standard form).

Do you know what kind of calculations were done in the NAMD case?

If it is dHdl then "fep-lambda" would be a generic name for the lambda value at which dHdl was sampled. For Gromacs, one has typically multiple lambda values (coul-lambda, vdw-lambda, maybe restr-lambda, ...) but for the Amber reader they are just called "lambdas"

df["lambdas"] = data_dic["lambdas"][0]

@vtlim
Copy link
Collaborator

vtlim commented Feb 1, 2018

The NAMD case I am working on is the FEP-based estimator (the extract_u_nk function).

Based on the link you sent:

For datasets that sample only a single λ parameter, then the DataFrame will feature only a single index in addition to time, with the values of λ for which reduced potentials were recorded given as column labels.

My understanding of the above was that the different columns represent the lambda value of sampling, but that would mean the fep-lambda column is doing something else since it has values of both 0 and 1 in the example in the docs.

@orbeckst
Copy link
Member

orbeckst commented Feb 1, 2018

@vtlim I am looking at the test data. But while doing so, I found that we overlooked something in the test data accessor. I pushed PR alchemistry/alchemtest#29 – if you have a moment could you review the PR, please? Thanks.

I'll say more about u_nk when I had a look at the data... hopefully in a few minutes.

@orbeckst
Copy link
Member

orbeckst commented Feb 2, 2018

For datasets that sample only a single λ parameter, then the DataFrame will feature only a single index in addition to time, with the values of λ for which reduced potentials were recorded given as column labels.

My understanding of the above was that the different columns represent the lambda value of sampling, but that would mean the fep-lambda column is doing something else since it has values of both 0 and 1 in the example in the docs.

The DataFrame should contain all data needed to run BAR/MBAR or one of the FEP estimators (which we don't have implemented yet). The "fep-lambda" index states at which lambda this particular frame was sampled. If you have one output file at lambda=0, one at lambda=0.5, and one at lambda=1 then your parser needs to read all the data files and join them together in one big DataFrame. The timesteps (or trajectory "frames") are each tagged with the lambda at which they were sample, i.e., data from the lambda=0.5 file will all have fep-lambda 0.5.

The columns are the evaluations of the Hamiltonian (or the potential energy U) at other lambdas (sometimes called "foreign lambdas"): take positions sampled at fep-lambda lambda0, x_lambda0 and evaluate the energy function U(x, lambda=lambda1) with the lambda set to foreign lambda1: U(x=x_lambda0(t), lambda=lambda1) — this gives you row lambda0 with column lambda1 (at time t).

I am looking at the data (the following needs PR alchemistry/alchemtest#29 to work):

>>> import alchemtest.namd
>>> import bz2
>>> t = alchemtest.namd.load_tyr2ala()
>>> fwd = bz2.open(t.data['forward'][0], mode="rt").readlines()
>>> print("".join(fwd[0:5]))
#            STEP                 Elec                            vdW                    dE           dE_avg         Temp             dG
#                           l             l+dl             l            l+dl         E(l+dl)-E(l)
#NEW FEP WINDOW: LAMBDA SET TO 0 LAMBDA2 0.05
FepEnergy:     10      -5391.7740     -5390.5907       533.4814       532.2565        -0.0415        -0.1133       302.6836        -0.1161
FepEnergy:     20      -5281.7946     -5280.8118       452.4335       451.4536         0.0029        -0.0665       298.4974        -0.0702

It looks to me as if we have a single lambda (as you say), which has the value lambda0=0. I am not quite sure how to interpret LAMBDA2 and the l+dl; I am assuming that l+dl == LAMBDA2. In my notation then you would apparently have Elec and vdW evaluated at lambda0 (l) and lambda1 (l+dl). I don't know if the sum of Elec and vdW is the full change in energy.

I think the question is how do you normally process the NAMD data to generate the free energies?

@vtlim
Copy link
Collaborator

vtlim commented Feb 2, 2018

If you have one output file at lambda=0, one at lambda=0.5, and one at lambda=1 then your parser needs to read all the data files and join them together in one big DataFrame.

NAMD joins the different lambda windows in its standard routine. Here's a web example of a NAMD output file. You can see that new windows are declared by "#NEW FEP WINDOW: LAMBDA SET TO ..." I don't have an additional set of lambda values other than these.

I am assuming that l+dl == LAMBDA2. In my notation then you would apparently have Elec and vdW evaluated at lambda0 (l) and lambda1 (l+dl).

Yes, that should be the case for both statements.

I don't know if the sum of Elec and vdW is the full change in energy.

Correct. Essentially, the dE column is equivalent to elec(l+dl) - elec(l) added to vdW(l+dl) - vdW(l).

I think the question is how do you normally process the NAMD data to generate the free energies?

We have two fepout files from NAMD: one representing the forward work of transformation, and one representing the reverse work. For each lambda window, (e.g., 0 to 0.05), we:

  • subsample fwd and rev based on statistical efficiency (done with the timeseries module of pymbar )
  • apply the BAR estimator to get the dG value
    Then net dG is the sum of the windows.

I can share the code I am in the process of adapting and paste the example output from the data I put in alchemtest.

@vtlim
Copy link
Collaborator

vtlim commented Feb 3, 2018

The DoBAR function in this script does what I discussed above.

@vtlim
Copy link
Collaborator

vtlim commented Feb 3, 2018

Applying the code I shared above on the example that is in alchemtest:
ex1_180126.txt

@orbeckst
Copy link
Member

orbeckst commented Feb 6, 2018

From what I can gather from what you posted is that you should assemble the data into a DataFrame that roughly looks like the following

                                   0.0           0.05             0.1            0.15          ......
time  fep-lambda

0.0       0.0                 xxx               yyy              0                0
1.0       0.0                 xxx               yyy              0                0
...
...
0.0       0.05                 0                xxx            yyy              0             
1.0       0.05                 0                xxx            yyy              0             
...
...
0.0       0.1                 0                    0           xxx             yyy                   
1.0       0.1                 0                    0           xxx             yyy                      
...
...

I think this should be for the total E (because that would be the u_nk) but I'd be happy if @mrshirts or @davidlmobley would chime in.

We could either have a function that also extracts the Elec and VDW components into the same format. Or one could add additional multi-indices to this DataFrame but it seems to me that this would break the overall API ( @dotsdl ?)

@davidlmobley
Copy link

That sounds right, I think. But @vtlim - you basically have this already working with BAR in your existing analysis, right? So you should be able to take a file you can analyze there, format it to analyze with alchemlyb, and check that you get the same answers... It should be using the same code (BAR/MBAR) for analysis, so the differences should just be in terms of how the data flow is organized.

@vtlim
Copy link
Collaborator

vtlim commented Feb 8, 2018

@orbeckst - BAR is not yet in alchemlyb, is it? I don't see it in the docs nor in my conda installation of alchemlyb estimators. I reassembled my DataFrame, but it's not suited for MBAR.
The sum of all N_k must equal the total number of samples (length of second dimension of u_kn.)

In other words, I have 20,000 samples based on 20,000 time steps, and this is not equal to the number of windows (20).

@orbeckst
Copy link
Member

orbeckst commented Feb 8, 2018

The missing BAR estimator #28 is one of the major open issues.

I don't think that the MBAR estimator requires you to have as many foreign lambdas as samples. Does this error (is it an error message?) come from alchemlyb or from pymbar?

@davidlmobley
Copy link

@vtlim could you say a little more about how you reassembled your data frame? Also, the traceback of the error message would probably help.

I think the error message is telling you that you have $N_k$ for some lambda which is not equal to the number of samples at that lambda value, e.g. at some lambda value you have $N_k$ set to, say, 15,345 but the number of snapshots provided is 20,000 (or some such).

@vtlim
Copy link
Collaborator

vtlim commented Feb 8, 2018

The error message above came from pymbar. The error is raised here.

I think the error is actually not an issue with mbar but with my dataframe somehow, triggered by somewhere around here. Right now, my code passes the wrong value for N_k which is why MBAR is unable to work.

        u_nk = u_nk.sort_index(level=u_nk.index.names[1:])

        groups = u_nk.groupby(level=u_nk.index.names[1:])
        N_k = [(len(groups.get_group(i)) if i in groups.groups else 0) for i in u_nk.columns]

For the Gromacs data I'm testing, it returns N_k of [4001, 0, 0, 0, 0] because there are 4001 samples and 5 sets of lambda values. For the NAMD data I'm testing, I have 1001 samples and 20 lambda windows. Here, N_k is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. I would expect the first item in that list to be 1001.

Here's the head of the Gromacs dataframe:

                         0.00       0.25       0.50       0.75       1.00
time    fep-lambda                                                       
0.0     0.0         13.257467  16.496997  19.736528  22.976059  26.215589
10.0    0.0          9.232471  11.465868  13.699265  15.932662  18.166059
20.0    0.0          5.423357   6.706389   7.989420   9.272451  10.555482

Here's the head of the NAMD dataframe:

                         0    0.05     0.1    0.15     0.2    0.25     0.3  \
time    fep-lambda                                                           
10000.0 0.0        -0.2674  0.5577  0.4597  0.5728  0.2116  0.5174  0.1851   
10010.0 0.0        -0.1513  0.2481  0.2725  0.2319  0.5871  0.5013  0.1299   
10020.0 0.0         0.1559 -0.1338  0.3915  0.1978  0.5235  0.6366  0.1526   
10030.0 0.0         0.1265  0.1149  0.5136 -0.6523  0.4662  0.4138  0.4163 

I'm not sure where the difference lies right now, but it must have something to do with how I created the pandas dataframe. I printed out the groups variable of the code above N_k, and both look analogous, having levels of (1) the time, (2) sequential integers, (3) zeroes with each level the same length as number of samples. I'm relatively new to pandas, so maybe there is something I'm not seeing. Any suggestions on where to look?

@vtlim
Copy link
Collaborator

vtlim commented Feb 8, 2018

I found the issue. The lambda values of the column headers have an inconsistent number of decimal places as printed out by NAMD. That means the 0 of the column header doesn't match exactly the 0.0 of the fep-lambda column. Right now I have a workaround to list the column headers and the fep-lambda column to be of two decimal points and of the same data type. How I end up fixing that workaround depends on how I want to structure the final dataframe (which should depend on the BAR implementation, I think).

The dataframe I printed out above is technically not correct since I populated by the column lambdas instead of the fep-lambda. The correct version (which I have also tested successfully with MBAR) looks more along the lines of what @orbeckst said earlier (but I have the yyy values as zero):

                       0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  \
timestep fep-lambda                                                           
10000.0  0.00       -0.2674   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
10010.0  0.00       -0.1513   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
10020.0  0.00        0.1559   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
10030.0  0.00        0.1265   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   

The reason that I think this should depend on how BAR is implemented is that it should take in a list of forward and reverse work values. This last dataframe I'm showing has 20020 rows compared to 1001 rows, and most of them are zero. It seems to me that the first format I posted would make more sense, but that would mean that the lambdas take on a different meaning in this application -- which I'm not sure would be consistent with the API.

@mrshirts
Copy link

What's the current status on this, @vtlim ? Anything that needs looking at?

@vtlim
Copy link
Collaborator

vtlim commented Mar 26, 2018

@mrshirts -- I have code takes the NAMD output and formats it into a pandas dataframe consistent with the existing API. The dataframe works with MBAR but I'm not sure if the output values are correct (I'm not sure how valid it is to put this dataset into MBAR. Though I realize that MBAR reduces to BAR in the limit of sampling just two states.) As of now, having the BAR estimator (#28) would be ideal for me to wrap this up.

@mrshirts
Copy link

Thanks for letting me know that BAR is a limiting issue now. Yes, one could examine it by simply running BAR on all pairwise states; that will give free energies that are equivalent to BAR.

@orbeckst
Copy link
Member

@vtlim we now have a BAR estimator (issue #28) in the development version. Any chance you can pick this one up again?

@vtlim
Copy link
Collaborator

vtlim commented Jan 14, 2019

@orbeckst - I'm not sure I'll be able to work on the NAMD parser soon due to file system rearrangements + operating system updates on our local cluster and also having not worked on this project in a while. So I can work on it, but probably not anytime soon.

@orbeckst orbeckst removed this from the release 0.2.0 milestone Feb 16, 2019
@orbeckst orbeckst added this to the release 0.2.0 milestone Mar 18, 2019
@orbeckst orbeckst moved this from posted to complete in parsing Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
parsing
complete
Development

No branches or pull requests

5 participants