Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use atom slice #35

Merged
merged 22 commits into from
Feb 13, 2019
Merged

Use atom slice #35

merged 22 commits into from
Feb 13, 2019

Conversation

sroet
Copy link
Collaborator

@sroet sroet commented Apr 25, 2018

This uses atom_slice to cut down a trajectory before computing the neighborlist. This should speed up contactmap/frequency calculations for real systems
The code is not yet completely implemented, but is only tested at one point, and not yet benchmarked for now.

TODO:

  • write documentation
  • write proper tests
  • update example
  • do benchmarking

EDIT: found the first bug already (only converting the list/ not the numpy arrays)

@dwhswenson
Copy link
Owner

The Travis error is a py-3.4 error that has existed for a few weeks now. It looks like something goes wrong in the pandas install. That's not on you (exists independently of this PR), but you're more than welcome to investigate! I haven't had time.

On the CodeClimate stuff, I'm somewhat ambivalent. Things that better fit the current CodeClimate rules (relaxed from the defaults) are better, but I'm willing to make exceptions. Maybe try making use_atom_slice a class variable that defaults to True? That way we can turn it off for more thorough testing, but you don't increase argument count. That also removes cognitive complexity from ContactObject.__init__. I'll take a closer look at the complaints about ContactObject.contact_map and offer suggestions later (possibly offline). Obviously, you intend to fix your TODOs, so I'm not worried about those.

Finally, don't worry about serializing use_atom_slice. I take the view that things that only affect the performance, not the actual output, don't need to be serialized. We expect that this should be that sort of situation.

@sroet
Copy link
Collaborator Author

sroet commented May 15, 2018

Alright, I benchmarked everything in contact_map.ipynb. The atom slicing adds some overhead but is faster for most situations, on my system (default slices 2722 atoms to 1408):

  • The single atom_slice per ContactMap or ContactFrequency: adds about 0.4 s. This scales with the number of atoms left after the slicing. It however does not scale with the length of the trajectory (as long as both the original and sliced trajectory fit in memory)

  • The computation time of the neighborlist reduces by the relative amount of atoms sliced, from 0.02 s to 0.01 s per frame for the defaults

  • The conversion of the neighborlist to the original indices adds about 0.01 s per frame for the default settings, this also time reduces if less atoms are left after slicing

Even for this trajectory without waters using atom_slice performs better for almost everything in the example, except for the ContactMap (110 ms to 516 ms because of the single 0.4s overhead) and the ContactFrequency with only a different query and no different haystack (2.87s to 3.18s). On the other hand for the example where both the query and haystack are defined (slicing the system down to 90 atoms) it is much faster (2.71s to 348ms).

Seeing this I would argue that use_atom_slice=True should be the default for both ContactMap and ContactFrequency. What is your opinion on this @dwhswenson ? And could you see if these benchmarks hold up you?

@sroet
Copy link
Collaborator Author

sroet commented Jun 1, 2018

@dwhswenson This is ready for some benchmarking. I don't expect the working of the code to change, I only need to write extra documentation/tests and update the example.
Using the script on the bottom on different slices of a carbon (CH4 in united atoms) water (TIP4P) system:
On a node with 32 logical cores:

length traj: 2411 frames
Carbons: 511
Oxygens: 2944
Combined selections: 3455
All atoms: 12287

 3 frames
carbon query and haystack
with atom_slice: 0.20131815969944
no atom_slice: 0.28427401185035706
carbon query and oxygen haystack
with atom_slice: 0.2735431417822838
no atom_slice: 0.3182491138577461

 25 frames
carbon query and haystack
with atom_slice: 0.47005893290042877
no atom_slice: 1.526077225804329
carbon query and oxygen haystack
with atom_slice: 1.1776891574263573
no atom_slice: 1.7423365116119385

 242 frames
carbon query and haystack
with atom_slice: 4.201747737824917
no atom_slice: 14.126300185918808
carbon query and oxygen haystack
with atom_slice: 14.567059725522995
no atom_slice: 21.696870408952236

 2411 frames
carbon query and haystack
with atom_slice: 55.376811660826206
no atom_slice: 159.75818028301
carbon query and oxygen haystack
with atom_slice: 337.8058380037546
no atom_slice: 406.616727553308

And on a 4 core desktop

length traj: 2411 frames
Carbons: 511
Oxygens: 2944
Combined selections: 3455
All atoms: 12287

 3 frames
carbon query and haystack
with atom_slice: 0.26453333103563637
no atom_slice: 0.45479931705631316
carbon query and oxygen haystack
with atom_slice: 0.38118267501704395
no atom_slice: 0.5180625190259889

 25 frames
carbon query and haystack
with atom_slice: 0.57133743702434
no atom_slice: 2.5859786020591855
carbon query and oxygen haystack
with atom_slice: 1.5568417840404436
no atom_slice: 3.0209178120130673

 242 frames
carbon query and haystack
with atom_slice: 4.820568470982835
no atom_slice: 24.34508377406746
carbon query and oxygen haystack
with atom_slice: 17.302444033091888
no atom_slice: 31.778738516033627

 2411 frames
carbon query and haystack
with atom_slice: 60.8008491731016
no atom_slice: 257.47405256098136
carbon query and oxygen haystack
with atom_slice: 343.60291336802766
no atom_slice: 493.1326912479708

This was done using the following script:

from contact_map import ContactFrequency                                        
import mdtraj as md                                                             
import timeit                                                                   
                                                                                
traj = md.load('54.xtc', top='54.gro')                                          
print("length traj: {} frames".format(str(len(traj))))                          
                                                                                
top = traj.topology                                                             
c = top.select('element C')                                                     
o = top.select('element O')                                                     
                                                                                
                                                                                
print("Carbons: {}".format(str(len(c))))                                        
print("Oxygens: {}".format(str(len(o))))                                        
print("Combined selections: {}".format(str(len(c)+len(o))))                     
print("All atoms: {}".format(str(top.n_atoms)))                                 
                                                                                
class Test(object):                                                             
    from contact_map import ContactFrequency                                    
    import mdtraj as md                                                         
    def __init__(self, skip=1):                                                 
        self.traj = md.load('54.xtc', top = '54.gro')[::skip]                   
        self.top = traj.topology                                                
        self.c = self.top.select('element C')                                   
        self.o = self.top.select('element O')                                   
                                                                                
    def use_atom_slice_cs(self):                                                
        ContactFrequency._class_use_atom_slice = True                           
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.c)          
                                                                                
    def use_atom_slice_os(self):                                                
        ContactFrequency._class_use_atom_slice = True                           
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.o)          
                                                                                
                                                                                
    def use_no_atom_slice_cs(self):                                             
        ContactFrequency._class_use_atom_slice = False                          
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.c)          
                                                                                
    def use_no_atom_slice_os(self):                                             
        ContactFrequency._class_use_atom_slice = False                          
        _ = ContactFrequency(self.traj, query=self.c, haystack=self.o)          
                                                                                
                                                                                
for skip in [1000,100,10,1]:                                                    
    test = Test(skip)                                                           
    frames = len(test.traj)                                                     
    print("\n {} frames".format(str(frames)))                                   
    print("carbon query and haystack")                                          
    print("with atom_slice: {}".format(timeit.Timer(test.use_atom_slice_cs).timeit(number=1)))
    print("no atom_slice: {}".format(timeit.Timer(test.use_no_atom_slice_cs).timeit(number=1)))
    print("carbon query and oxygen haystack")                                   
    print("with atom_slice: {}".format(timeit.Timer(test.use_atom_slice_os).timeit(number=1)))
    print("no atom_slice: {}".format(timeit.Timer(test.use_no_atom_slice_os).timeit(number=1)))                                                                      

I do not have a straightforward way of providing you with the .xtc and .gro as they are 100 MB (when tarred) and the login of the figshare seems to be down. Could you try it on one of your systems to see how it holds up?

@sroet
Copy link
Collaborator Author

sroet commented Jul 9, 2018

Alright, I am pretty confident it behaves properly now. every single contactmap/frequency in the example notebook (that has all the waters already cut out) is at least as fast or a lot faster (when query and haystack are both defined). This should be the worst real world case.
The absolute worst case scenario for this code would be a dense contactmap/contactfrequency (a lot of int to int mappings) for a topology with a single atom not included in the query/haystack (not triggering the safeguard to not do the mapping if nothing is cut). But I could not think of a real world example for this.

You could also check this performance by checking out this branch and by running the example notebook. Too deactivate the use of atom_slice add a cell after the import cell with:

ContactMap._class_use_atom_slice = False
ContactFrequency._class_use_atom_slice = False
ContactDifference._class_use_atom_slice = False

@sroet sroet changed the title WIP:Use atom slice Use atom slice Jul 10, 2018
@sroet
Copy link
Collaborator Author

sroet commented Jul 10, 2018

@dwhswenson this is ready for a review and a merge

@sroet
Copy link
Collaborator Author

sroet commented Jul 10, 2018

coverage falls due to an error on the last line python-codacy-coverage -r cov.xml:
2018-07-10 14:44:41,194 - ERROR - environment variable CODACY_PROJECT_TOKEN is not defined.
(on all builds them, which is interesting )

EDIT: And it bumped up again to 100%, strange...

Copy link
Owner

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. One other change, aside from the comments above: In the main example, you didn't run the stuff under the "Use a different cutoff". I know they take a while, but could you re-run that before this goes into master?

Overall, it seems to me that what we really want here is to create two different IndexManager classes; one that works without atom slicing and one that works with atom slicing. This would move all the code you now have for that out of the ContactObject class, essentially abstracting the stuff we have now. This would also get rid of a lot of if/else blocks you have.

But we'll merge it in like this now, and try that as part of a later update.

contact_map/contact_map.py Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
contact_map/contact_map.py Outdated Show resolved Hide resolved
@dwhswenson
Copy link
Owner

dwhswenson commented Feb 8, 2019

@sroet : Is there anything left to do on this? I had been misreading GitHub's status on my review comments (not noticing that they were outdated) and didn't realize you'd already dealt with pretty much everything.

The only (tiny) thing left is that my style for the alignment of multiline statement where the linebreak is immediately after the opening punctuation to be this:

dict_comprehension = {
    key: function_that_creates_value(key)
    for key in some_long_name_here
}

(similar to "one true brace" in C). You're still one indent level too far in. (I can understand why my comment wasn't clear.) Obviously not a big deal, but either you update it or I will. You might as well get credit for the commit (and keep the git-blame for the line!)

Go ahead and update that in a (really fast) commit, and let me know if there's anything else you're cleaning up in this before I merge.

Also, note that your GitHub email wasn't set (probably in gitconfig on whatever computer you were working on.) Please make sure that's set to take credit for future contributions!

@sroet
Copy link
Collaborator Author

sroet commented Feb 11, 2019

@dwhswenson Did the last comment, and fixed the email.

@dwhswenson dwhswenson merged commit 9ed6977 into dwhswenson:master Feb 13, 2019
@sroet sroet deleted the use_atom_slice branch February 14, 2019 09:36
@dwhswenson dwhswenson mentioned this pull request Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants