Use atom slice #35

sroet · 2018-04-25T21:21:55Z

This uses atom_slice to cut down a trajectory before computing the neighborlist. This should speed up contactmap/frequency calculations for real systems
The code is not yet completely implemented, but is only tested at one point, and not yet benchmarked for now.

TODO:

write documentation
write proper tests
update example
do benchmarking

EDIT: found the first bug already (only converting the list/ not the numpy arrays)

dwhswenson · 2018-04-25T22:47:58Z

The Travis error is a py-3.4 error that has existed for a few weeks now. It looks like something goes wrong in the pandas install. That's not on you (exists independently of this PR), but you're more than welcome to investigate! I haven't had time.

On the CodeClimate stuff, I'm somewhat ambivalent. Things that better fit the current CodeClimate rules (relaxed from the defaults) are better, but I'm willing to make exceptions. Maybe try making use_atom_slice a class variable that defaults to True? That way we can turn it off for more thorough testing, but you don't increase argument count. That also removes cognitive complexity from ContactObject.__init__. I'll take a closer look at the complaints about ContactObject.contact_map and offer suggestions later (possibly offline). Obviously, you intend to fix your TODOs, so I'm not worried about those.

Finally, don't worry about serializing use_atom_slice. I take the view that things that only affect the performance, not the actual output, don't need to be serialized. We expect that this should be that sort of situation.

sroet · 2018-05-15T21:39:04Z

Alright, I benchmarked everything in contact_map.ipynb. The atom slicing adds some overhead but is faster for most situations, on my system (default slices 2722 atoms to 1408):

The single atom_slice per ContactMap or ContactFrequency: adds about 0.4 s. This scales with the number of atoms left after the slicing. It however does not scale with the length of the trajectory (as long as both the original and sliced trajectory fit in memory)
The computation time of the neighborlist reduces by the relative amount of atoms sliced, from 0.02 s to 0.01 s per frame for the defaults
The conversion of the neighborlist to the original indices adds about 0.01 s per frame for the default settings, this also time reduces if less atoms are left after slicing

Even for this trajectory without waters using atom_slice performs better for almost everything in the example, except for the ContactMap (110 ms to 516 ms because of the single 0.4s overhead) and the ContactFrequency with only a different query and no different haystack (2.87s to 3.18s). On the other hand for the example where both the query and haystack are defined (slicing the system down to 90 atoms) it is much faster (2.71s to 348ms).

Seeing this I would argue that use_atom_slice=True should be the default for both ContactMap and ContactFrequency. What is your opinion on this @dwhswenson ? And could you see if these benchmarks hold up you?

sroet · 2018-06-01T11:16:15Z

@dwhswenson This is ready for some benchmarking. I don't expect the working of the code to change, I only need to write extra documentation/tests and update the example.
Using the script on the bottom on different slices of a carbon (CH4 in united atoms) water (TIP4P) system:
On a node with 32 logical cores:

length traj: 2411 frames
Carbons: 511
Oxygens: 2944
Combined selections: 3455
All atoms: 12287

 3 frames
carbon query and haystack
with atom_slice: 0.20131815969944
no atom_slice: 0.28427401185035706
carbon query and oxygen haystack
with atom_slice: 0.2735431417822838
no atom_slice: 0.3182491138577461

 25 frames
carbon query and haystack
with atom_slice: 0.47005893290042877
no atom_slice: 1.526077225804329
carbon query and oxygen haystack
with atom_slice: 1.1776891574263573
no atom_slice: 1.7423365116119385

 242 frames
carbon query and haystack
with atom_slice: 4.201747737824917
no atom_slice: 14.126300185918808
carbon query and oxygen haystack
with atom_slice: 14.567059725522995
no atom_slice: 21.696870408952236

 2411 frames
carbon query and haystack
with atom_slice: 55.376811660826206
no atom_slice: 159.75818028301
carbon query and oxygen haystack
with atom_slice: 337.8058380037546
no atom_slice: 406.616727553308

And on a 4 core desktop

length traj: 2411 frames
Carbons: 511
Oxygens: 2944
Combined selections: 3455
All atoms: 12287

 3 frames
carbon query and haystack
with atom_slice: 0.26453333103563637
no atom_slice: 0.45479931705631316
carbon query and oxygen haystack
with atom_slice: 0.38118267501704395
no atom_slice: 0.5180625190259889

 25 frames
carbon query and haystack
with atom_slice: 0.57133743702434
no atom_slice: 2.5859786020591855
carbon query and oxygen haystack
with atom_slice: 1.5568417840404436
no atom_slice: 3.0209178120130673

 242 frames
carbon query and haystack
with atom_slice: 4.820568470982835
no atom_slice: 24.34508377406746
carbon query and oxygen haystack
with atom_slice: 17.302444033091888
no atom_slice: 31.778738516033627

 2411 frames
carbon query and haystack
with atom_slice: 60.8008491731016
no atom_slice: 257.47405256098136
carbon query and oxygen haystack
with atom_slice: 343.60291336802766
no atom_slice: 493.1326912479708

This was done using the following script:

from contact_map import ContactFrequency                                        
import mdtraj as md                                                             
import timeit                                                                   
                                                                                
traj = md.load('54.xtc', top='54.gro')                                          
print("length traj: {} frames".format(str(len(traj))))                          
                                                                                
top = traj.topology                                                             
c = top.select('element C')                                                     
o = top.select('element O')                                                     
                                                                                
                                                                                
print("Carbons: {}".format(str(len(c))))                                        
print("Oxygens: {}".format(str(len(o))))                                        
print("Combined selections: {}".format(str(len(c)+len(o))))                     
print("All atoms: {}".format(str(top.n_atoms)))                                 
                                                                                
class Test(object):                                                             
    from contact_map import ContactFrequency                                    
    import mdtraj as md                                                         
    def __init__(self, skip=1):                                                 
        self.traj = md.load('54.xtc', top = '54.gro')[::skip]                   
        self.top = traj.topology                                                
        self.c = self.top.select('element C')                                   
        self.o = self.top.select('element O')                                   
                                                                                
    def use_atom_slice_cs(self):                                                
        ContactFrequency._class_use_atom_slice = True                           
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.c)          
                                                                                
    def use_atom_slice_os(self):                                                
        ContactFrequency._class_use_atom_slice = True                           
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.o)          
                                                                                
                                                                                
    def use_no_atom_slice_cs(self):                                             
        ContactFrequency._class_use_atom_slice = False                          
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.c)          
                                                                                
    def use_no_atom_slice_os(self):                                             
        ContactFrequency._class_use_atom_slice = False                          
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.o)          
                                                                                
                                                                                
for skip in [1000,100,10,1]:                                                    
    test = Test(skip)                                                           
    frames = len(test.traj)                                                     
    print("\n {} frames".format(str(frames)))                                   
    print("carbon query and haystack")                                          
    print("with atom_slice: {}".format(timeit.Timer(test.use_atom_slice_cs).timeit(number=1)))
    print("no atom_slice: {}".format(timeit.Timer(test.use_no_atom_slice_cs).timeit(number=1)))
    print("carbon query and oxygen haystack")                                   
    print("with atom_slice: {}".format(timeit.Timer(test.use_atom_slice_os).timeit(number=1)))
    print("no atom_slice: {}".format(timeit.Timer(test.use_no_atom_slice_os).timeit(number=1)))

I do not have a straightforward way of providing you with the .xtc and .gro as they are 100 MB (when tarred) and the login of the figshare seems to be down. Could you try it on one of your systems to see how it holds up?

sroet · 2018-07-09T10:08:34Z

Alright, I am pretty confident it behaves properly now. every single contactmap/frequency in the example notebook (that has all the waters already cut out) is at least as fast or a lot faster (when query and haystack are both defined). This should be the worst real world case.
The absolute worst case scenario for this code would be a dense contactmap/contactfrequency (a lot of int to int mappings) for a topology with a single atom not included in the query/haystack (not triggering the safeguard to not do the mapping if nothing is cut). But I could not think of a real world example for this.

You could also check this performance by checking out this branch and by running the example notebook. Too deactivate the use of atom_slice add a cell after the import cell with:

ContactMap._class_use_atom_slice = False
ContactFrequency._class_use_atom_slice = False
ContactDifference._class_use_atom_slice = False

…ization

sroet · 2018-07-10T14:41:05Z

@dwhswenson this is ready for a review and a merge

sroet · 2018-07-10T14:52:46Z

coverage falls due to an error on the last line python-codacy-coverage -r cov.xml:
2018-07-10 14:44:41,194 - ERROR - environment variable CODACY_PROJECT_TOKEN is not defined.
(on all builds them, which is interesting )

EDIT: And it bumped up again to 100%, strange...

dwhswenson

Looks mostly good. One other change, aside from the comments above: In the main example, you didn't run the stuff under the "Use a different cutoff". I know they take a while, but could you re-run that before this goes into master?

Overall, it seems to me that what we really want here is to create two different IndexManager classes; one that works without atom slicing and one that works with atom slicing. This would move all the code you now have for that out of the ContactObject class, essentially abstracting the stuff we have now. This would also get rid of a lot of if/else blocks you have.

But we'll merge it in like this now, and try that as part of a later update.

contact_map/contact_map.py

dwhswenson · 2019-02-08T23:13:43Z

@sroet : Is there anything left to do on this? I had been misreading GitHub's status on my review comments (not noticing that they were outdated) and didn't realize you'd already dealt with pretty much everything.

The only (tiny) thing left is that my style for the alignment of multiline statement where the linebreak is immediately after the opening punctuation to be this:

dict_comprehension = {
    key: function_that_creates_value(key)
    for key in some_long_name_here
}

(similar to "one true brace" in C). You're still one indent level too far in. (I can understand why my comment wasn't clear.) Obviously not a big deal, but either you update it or I will. You might as well get credit for the commit (and keep the git-blame for the line!)

Go ahead and update that in a (really fast) commit, and let me know if there's anything else you're cleaning up in this before I merge.

Also, note that your GitHub email wasn't set (probably in gitconfig on whatever computer you were working on.) Please make sure that's set to take credit for future contributions!

sroet · 2019-02-11T12:51:40Z

@dwhswenson Did the last comment, and fixed the email.

sroet added 2 commits April 25, 2018 22:57

added atom slice code

17d949d

added at least 1 test

ea177ee

sroet added 2 commits May 15, 2018 19:36

atom slice does not alter the result, but is 10 times as slow

1f934c8

updated some pep8

caf225e

sroet added 10 commits May 23, 2018 22:10

Added documentation

698d4e5

Added use_atom_slice to dask runner

f440526

Added tests for use_atom_slice

4753c00

reworked code to fix code climate issues

57d4129

Refactor to make _use_atom_slice a class variable

9b484ca

changed class variable name and all atom tuple to sorted list

10107cd

reworked to do everything in sliced space

88bd67f

finished up contact map optimisation

ff5f20c

Mocked out slow md.atom_slice

52bd026

Merge branch 'master' into use_atom_slice

0fb5d0a

sroet added 4 commits July 9, 2018 17:41

moved back to tuple int to int mapping and fixed some pep8 stuff

8983fe5

added test for 100% coverage

5f7e96f

Merge branch 'master' into use_atom_slice

046c7bb

Updated the example and added an example on how to shut off the optim…

9276bec

…ization

sroet changed the title ~~WIP:Use atom slice~~ Use atom slice Jul 10, 2018

dwhswenson added enhancement performance labels Jul 19, 2018

dwhswenson requested changes Oct 10, 2018

View reviewed changes

Sander Roet added 2 commits November 17, 2018 14:57

merge upstream/master

4d882d9

fixed residue to ignore atoms when using atom_slice

1d584f9

work through comments David

263e7c4

final style issue

6526d28

dwhswenson approved these changes Feb 13, 2019

View reviewed changes

dwhswenson merged commit 9ed6977 into dwhswenson:master Feb 13, 2019

sroet deleted the use_atom_slice branch February 14, 2019 09:36

dwhswenson mentioned this pull request Jun 3, 2019

Release 0.4.0 #55

Merged

dwhswenson mentioned this pull request Sep 3, 2020

IndexManagers for atom slicing #87

Merged

sroet mentioned this pull request May 17, 2021

Faster load function and more omogeneus file parser interface mdtraj/mdtraj#1648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use atom slice #35

Use atom slice #35

sroet commented Apr 25, 2018 •

edited

Loading

dwhswenson commented Apr 25, 2018

sroet commented May 15, 2018

sroet commented Jun 1, 2018

sroet commented Jul 9, 2018 •

edited

Loading

sroet commented Jul 10, 2018

sroet commented Jul 10, 2018 •

edited

Loading

dwhswenson left a comment

dwhswenson commented Feb 8, 2019 •

edited

Loading

sroet commented Feb 11, 2019

Use atom slice #35

Use atom slice #35

Conversation

sroet commented Apr 25, 2018 • edited Loading

dwhswenson commented Apr 25, 2018

sroet commented May 15, 2018

sroet commented Jun 1, 2018

sroet commented Jul 9, 2018 • edited Loading

sroet commented Jul 10, 2018

sroet commented Jul 10, 2018 • edited Loading

dwhswenson left a comment

Choose a reason for hiding this comment

dwhswenson commented Feb 8, 2019 • edited Loading

sroet commented Feb 11, 2019

sroet commented Apr 25, 2018 •

edited

Loading

sroet commented Jul 9, 2018 •

edited

Loading

sroet commented Jul 10, 2018 •

edited

Loading

dwhswenson commented Feb 8, 2019 •

edited

Loading