add amber TI parser and improve the subsampling code #32

shuail · 2017-10-30T15:37:26Z

Fixes #10

add the amber TI parser to get the dhdl
Inside the statistical_inefficiency function under subsampling.py, switch to use the subsampleCorrelatedData in pymbar to generate the subsampled data points, the original code use the pandas slicing method which required to round the statistical inefficiency to an integer.

… the subsampled dataset

codecov-io · 2017-10-30T15:40:23Z

Codecov Report

Merging #32 into master will decrease coverage by 0.27%.
The diff coverage is 93.25%.

@@            Coverage Diff             @@
##           master      #32      +/-   ##
==========================================
- Coverage   95.04%   94.77%   -0.28%     
==========================================
  Files           7        9       +2     
  Lines         202      383     +181     
  Branches       28       64      +36     
==========================================
+ Hits          192      363     +171     
- Misses          5       10       +5     
- Partials        5       10       +5

Impacted Files	Coverage Δ
src/alchemlyb/parsing/amber.py	`93.25% <93.25%> (ø)`
src/alchemlyb/__init__.py	`100% <0%> (ø)`
src/alchemlyb/parsing/util.py	`100% <0%> (+11.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8915ece...3bc2c67. Read the comment docs.

davidlmobley · 2017-10-30T15:41:29Z

@shuail - this will need to come with tests. You'll probably want to put the data for the tests in alchemtest and then build in tests which rely on it here.

Note also that there is a "code coverage" analysis being run above which I presume is checking to see what portion of the code is being tested by the tests; since yours doesn't include tests, well, code coverage reduces dramatically and fails that test.

orbeckst · 2017-10-30T19:35:53Z

As just discussed:

add tests to alchemtest (open a PR there)
write tests (initially modelled on the Gromacs ones) that use the alchemtest data set

Ping me if anything is unclear.

shuail · 2017-10-31T18:41:06Z

@orbeckst just added the amber dataset in alchemtest and create a PR there. Added amber test here to use the files in alchemtest. Tested on my local computer to use the updated alchemtest, it runs fine. Finger crossed for travis once the alchemtest library get updated.

orbeckst · 2017-10-31T20:39:27Z

I'll deal with PR alchemistry/alchemtest#6 first to sort out the data side of things.

orbeckst

Thanks for getting started. See comments.

Major things

keep the PR focused on Amber parser: split the changes to subsampling into a separate PR
do not SysExit in a library
use logging instead of print

orbeckst · 2017-10-31T20:54:28Z

src/alchemlyb/preprocessing/subsampling.py

@@ -4,6 +4,7 @@
 import numpy as np
 from pymbar.timeseries import statisticalInefficiency
 from pymbar.timeseries import detectEquilibration
+from pymbar.timeseries import subsampleCorrelatedData


The changes to subsampling are independent from the TI parser. Please remove from this PR and submit a new PR. This allows us to focus on one issue at a time, including reviewing and testing, and also makes the git history cleaner.

It also helps if you raise an issue for changes to the library, e.g., "Replace pandas-based subsampling with pymbar.subsampleCorrelatedData". We can then discuss the change and link the PR to an issue.

I am not reviewing these changes here because they will be in a separate PR.

orbeckst · 2017-10-31T20:59:29Z

src/alchemlyb/tests/parsing/test_amber.py

@@ -0,0 +1,21 @@
+"""Amber parser tests.


We also like to have a test with at least one of the estimators. Can you add an Amber test to test_fep_estimators.py?

orbeckst · 2017-10-31T21:01:13Z

src/alchemlyb/parsing/amber.py

+Change the final format to pandas to be consistent with the alchemlyb format
+"""
+
+import pandas as pd


order imports in canonical order:

standard lib (os, re, ...)

scientific python (numpy, pandas, ...)

alchemlyb (if present)

orbeckst · 2017-10-31T21:01:36Z

src/alchemlyb/parsing/amber.py

+import numpy as np
+import os
+
+def convert_to_pandas(file_datum, ):


orbeckst · 2017-10-31T21:02:22Z

src/alchemlyb/parsing/amber.py

+        #self.mbar_energies = []
+
+
+def file_validation(outfile, ):


doc string, also remove comma + white space in arg list

orbeckst · 2017-11-01T18:18:02Z

src/alchemlyb/parsing/amber.py

+            invalid = True
+
+        if not secp.skip_after('^   3.  ATOMIC '):
+            print('  WARNING: no ATOMIC section found, ignoring file\n')


use logging

orbeckst · 2017-11-01T18:18:09Z

src/alchemlyb/parsing/amber.py

+
+        t0, = secp.extract_section('^ begin time', '^$', ['coords'])
+        if not secp.skip_after('^   4.  RESULTS'):
+            print('  WARNING: no RESULTS section found, ignoring file\n')


use logging

orbeckst · 2017-11-01T18:19:18Z

src/alchemlyb/parsing/amber.py

+        file_datum.t0 = t0
+        file_datum.dt = dt
+        file_datum.T = T
+        return file_datum


I would simplify to

if invalid: return False file_datum = FEData() file_datum.clambda = clambda file_datum.t0 = t0 file_datum.dt = dt file_datum.T = T return file_datum

orbeckst · 2017-11-01T18:20:51Z

src/alchemlyb/parsing/amber.py

+
+def extract_dHdl(outfile, ):
+    file_datum = file_validation(outfile)
+    if file_validation(outfile):


rewrite the if to terminate early, the following would be clearer:

if not file_validation(outfile): return None finished = False comps = [] ... return df

orbeckst · 2017-11-01T18:22:08Z

src/alchemlyb/tests/parsing/test_amber.py

+from alchemtest.amber import load_simplesolvated
+
+
+def test_dHdl():


shuail · 2017-11-02T19:10:49Z

@orbeckst Thanks for all your comments, it's very helpful. I made three major changes 1, switch to logging in the amber parser and clean up the code (add docs, empty lines etc). 2, add test_ti_estimators_amber.py to test the ti_estimator for amber. 3, roll back the subsampling commit and will create a separated PR later.

orbeckst · 2017-11-03T00:40:05Z

src/alchemlyb/parsing/amber.py

+    df = convert_to_pandas(file_datum)        
+    return df
+
+#currently just check the code with a simple amber ti output file


remove the whole main section (also, the print use python 2.7 syntax but we run under python 2 and python 3 – this makes Travis fail at the moment).

Got it, the main section got removed now.

orbeckst · 2017-11-03T18:43:23Z

This is starting to look good.

We would now like to get test coverage up. If you look at https://codecov.io/gh/alchemistry/alchemlyb/src/master/src/alchemlyb/parsing/amber.py (follow the codecov/project report and look for the overall coverage and navigate in the files) there are many lines in the function parsing.amber.file_validation() that are not exercised by the tests. We would want to have tests that actually show that you catch these errors as expected.

The best thing to do would be to include small files with errors and then just test the file_validation() function itself on these files. It is difficult to check that a log message was printed but you can test that file_validation() returns False for these cases. (As a sidenote, if you need to test exceptions and warnings then pytest also makes this straight forward).

(This is where finalizing PRs becomes painful and boring but it is vital for the long-term survivability of the code that we check things now and are then assured that in the future everything will still work or we learn when it doesn't.)

Other comments on the code.

orbeckst · 2017-11-03T18:47:29Z

src/alchemlyb/parsing/amber.py

+    def __init__(self, filename):
+        """Opens a file according to its file type."""
+        self.filename = filename
+        with open(filename, 'r') as f:


This approach with _MAGIC_CMPR can be replace with util.anyopen(), see, for example https://github.com/alchemistry/alchemlyb/blob/master/src/alchemlyb/parsing/gmx.py#L143 . That's cleaner and more expandable. (I hadn't picked up on it in my earlier code reviews.)

(You can do it yourself or I am also happy to finish the PR as is and do this later and separately.)

Got it, will try to switch to anyopen

shuail · 2017-11-03T19:20:10Z

@orbeckst Got it, will add some testing for the file validation function. For the tested output files, I planned to created an invalidfiles folder under alchemtest/amber and use the alchemtest frame work in the alchemlyb. Does this sound good?

orbeckst · 2017-11-03T21:48:25Z

Yes, good idea. Just check that the files are properly included when you run pip install, i.e., proper inclusions in setup.py (and perhaps MANIFEST.in... if we use it).

shuail · 2017-11-13T08:37:10Z

@orbeckst Just finished adding the test code here and test files (alchemtest) to test the file validation function in amber parser and switched the file opener to anyopen. Hope the test will pass this time :)

orbeckst · 2017-11-13T08:38:22Z

Looking good, just waiting for PR alchemistry/alchemtest#11.

orbeckst

Using anyopen() will require a few more changes, which I hadn't realized earlier.

If this becomes to difficult I would suggest to revert the change where you introduced anyopen and we use the original implementation for this PR. Then raise an issue to use anyopen and we can then sort this out in a separate PR.

EDIT: Well, thinking about this again: try the original code — maybe it works and I am wrong.

Other minor things: see comments.

orbeckst · 2017-11-13T08:40:40Z

src/alchemlyb/parsing/amber.py

+DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
+              'EELEC', 'RESTRAINT']
+_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'
+_MAGIC_CMPR = {


not needed anymore since switch to anyopen. Please remove.

orbeckst · 2017-11-13T08:41:34Z

src/alchemlyb/parsing/amber.py

+import re
+import pandas as pd
+import numpy as np
+import logging 


empty line between standard library imports and package imports .... but this is nit-picking on my part ;-)

logging would come after re and be fore numpy because it is standard lib... according to PEP8

Got it, push a fixed

orbeckst · 2017-11-13T08:43:32Z

src/alchemlyb/parsing/amber.py

+        yield first
+
+    while it:
+        yield it.next()


For Python 3 this should be

next(it)

(I think)

Yes, change it to next(it)

orbeckst · 2017-11-13T08:50:09Z

src/alchemlyb/parsing/amber.py

+            match = re.search(pattern, line)
+            if match:
+                break
+        return self.fileh.tell() != self.filesize


This is going to break when we use anyopen because we never uncompress the file to disk and thus the file size of the compressed file is not related to the position in the uncompressed stream fileh.

(More specifically: For a compressed file, self.filesize < self.fileh.tell() for some positions so you might get a random occurrence where this is False in the middle of the file. Conversely, you won't get True at the real end of the file.)

I think this is supposed to check for EOF (end of file). This would need to be implemented differently. Depends on how skip_after() is used. Perhaps can be rewritten along the lines of https://stackoverflow.com/a/24738688/334357 ??

Well, thinking about this again: try it out, maybe it works and I am wrong.

Yes, I checked that the self.filesize < self.fileh.tell() for the zipped file and it will not be equal at the end of the file so I changed the skip_after() function to not use this detection but explicitly return True if the pattern was found along iterating the whole file

orbeckst · 2017-11-13T08:58:08Z

src/alchemlyb/parsing/amber.py

+        """Read next line of the filehandle and check for EOF."""
+        self.lineno += 1
+        curr_pos = self.fileh.tell()
+        if curr_pos == self.filesize:


This will break with compression (see comment above).

Not even sure why this is needed. Can't we just say

def next(self): self.lineno += 1 return next(self.fileh)

This should raise StopIteration when at the end and it should return the line.

Yes, that's a better solution. Sorry I didn't think that part much as migrating the code from alchemical-analysis. It looks like that this test successfully catch the untested part of the original code!

orbeckst · 2017-11-13T09:06:29Z

src/alchemlyb/parsing/amber.py

+    return file_datum
+
+def extract_dHdl(outfile):
+    """Return gradients `dH/dl` from Amebr TI outputfile


typo: Amber

orbeckst · 2017-11-13T09:09:57Z

src/alchemlyb/parsing/amber.py

+        self.filename = filename
+        try:
+            self.fileh = anyopen(self.filename, 'r')
+            self.filesize = os.stat(self.filename).st_size


This cannot be used for compressed files (see other comments).

I think this line can be deleted because it is useless.

Well, thinking about this again: try it out, maybe it works and I am wrong.

Yes, I think you are right, the filesize is not working here for zipped files, remove the filesize related detection in the code.

… other minor issues

orbeckst · 2017-11-13T18:33:14Z

I merged alchemistry/alchemtest#11 so you should now have the appropriate test files available for Travis CI. I will manually restart the test #90 . You don't need to do anything until the test results come back.

orbeckst · 2017-11-13T18:38:33Z

src/alchemlyb/parsing/amber.py

+import pandas as pd
+import numpy as np
+
+from util import anyopen


Needs to be

from .util import anyopen

https://travis-ci.org/alchemistry/alchemlyb/jobs/301541570

Oops, yes, fixed this

shuail · 2017-11-13T18:38:38Z

@orbeckst thanks for your comments! Yes, after I switched to the zipped files, the end of file(EOF) detection in the skip_after function will not work, so I changed that part and also the next() function to make it suitable for the zipped file. The anyopen actually works fine here.

orbeckst · 2017-11-13T18:39:24Z

One small change to get it to pass under python 3.

orbeckst · 2017-11-13T18:44:30Z

Once the tests pass, we have the coverage. Assuming that this will look good, we need to make sure that it can be merged and I want to clean up the PR, i.e., merging related commits and removing the subsampling code that was added and deleted. I can do this locally with git rebase.

A consequence for you will be that you also have to rebase your fork because you have been working from master instead of a branch. I can tell you what I would do if you want to know.

orbeckst · 2017-11-13T18:49:03Z

This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function

parsing.amber.any_none(sequence)

Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.

orbeckst

One tiny additional test, please, then I am happy ;-)

orbeckst · 2017-11-13T18:49:30Z

src/alchemlyb/parsing/amber.py

+
+    for element in sequence:
+        if element is None:
+            return True


This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function

parsing.amber.any_none(sequence)

Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.

Got a test function added for this line

orbeckst · 2017-11-13T19:50:38Z

@shuail looking good, I'll do the rebase... hopefully done before I have to run to class...

orbeckst · 2017-11-13T20:04:21Z

I created PR #36 to finalize.

shuail added 2 commits October 30, 2017 11:12

switch to use subsampleCorrelatedData in pymbar to get the indices of…

91a6d03

… the subsampled dataset

add the amber TI parser to get dhdl

05fbdf0

orbeckst self-requested a review October 30, 2017 19:24

orbeckst self-assigned this Oct 30, 2017

orbeckst mentioned this pull request Oct 30, 2017

codecov badge shows coverage from PR and not master #33

Closed

shuail added 2 commits October 31, 2017 14:27

add test code to test amber TI parser

915062d

fix a path issue

b74c89c

orbeckst requested changes Nov 1, 2017

View reviewed changes

orbeckst mentioned this pull request Nov 1, 2017

use logging #34

Closed

shuail added 3 commits November 2, 2017 14:57

add amber ti estimator test code

208705b

switch the subsampling to the original version

715f3ee

clean up the amber parser, add logging info etc

f64722d

orbeckst reviewed Nov 3, 2017

View reviewed changes

shuail added 3 commits November 3, 2017 09:42

remove the main test function

dcf997b

change to amber parser to be compatible with python 3

3c29957

clean up a print line

a0e1a5b

orbeckst reviewed Nov 3, 2017

View reviewed changes

add amber file validation test and switch the file opener to anyopen

3a75866

orbeckst requested changes Nov 13, 2017

View reviewed changes

change the EOF detection/skip_after function for ziped file, fix some…

46eb7e9

… other minor issues

orbeckst reviewed Nov 13, 2017

View reviewed changes

change to .util

49cc45c

orbeckst requested changes Nov 13, 2017

View reviewed changes

add test function for any_none function in amber parser

3bc2c67

orbeckst approved these changes Nov 13, 2017

View reviewed changes

orbeckst mentioned this pull request Nov 13, 2017

amber TI parser (#10) #36

Merged

orbeckst closed this Nov 13, 2017

		from alchemtest.amber import load_simplesolvated


		def test_dHdl():

add amber TI parser and improve the subsampling code #32

add amber TI parser and improve the subsampling code #32

Conversation

shuail commented Oct 30, 2017 • edited by orbeckst

codecov-io commented Oct 30, 2017 • edited

Codecov Report

davidlmobley commented Oct 30, 2017

orbeckst commented Oct 30, 2017

shuail commented Oct 31, 2017

orbeckst commented Oct 31, 2017

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuail commented Nov 2, 2017

orbeckst Nov 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Nov 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuail commented Nov 3, 2017

orbeckst commented Nov 3, 2017

shuail commented Nov 13, 2017

orbeckst commented Nov 13, 2017

orbeckst left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Nov 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuail commented Nov 13, 2017

orbeckst commented Nov 13, 2017

orbeckst commented Nov 13, 2017

orbeckst commented Nov 13, 2017

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Nov 13, 2017

orbeckst commented Nov 13, 2017

shuail commented Oct 30, 2017 •

edited by orbeckst

codecov-io commented Oct 30, 2017 •

edited

orbeckst Nov 3, 2017 •

edited

orbeckst left a comment •

edited

orbeckst commented Nov 13, 2017 •

edited