-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaskContactTrajectory #101
Conversation
Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in
Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here. |
I know, but the reason I was hesitant of the first implementation was because 3x slowdown was way more than the one for
That is a bit out-of-scope for this PR (in my opinion), but I will make an issue to track a smarter loading implementation. |
@dwhswenson This is ready for a review |
(The first comment now gives an overview of changes and additions in this PR) |
Codecov Report
@@ Coverage Diff @@
## master #101 +/- ##
==========================================
+ Coverage 99.52% 99.53% +0.01%
==========================================
Files 13 13
Lines 1043 1070 +27
==========================================
+ Hits 1038 1065 +27
Misses 5 5
Continue to review full report at Codecov.
|
@dwhswenson , friendly monthly ping to make sure it is still on a list somewhere :) |
@dwhswenson After #106 I am not too certain what to do with the notebook that is added here (examples/dask_contact_trajectory.ipynb ), do we want to wrap that into |
Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into |
(in other words, feel free to make that split) |
Will do, in the meantime I have no clue why codecov suddenly thinks all the dask code is not hit (it claims to be covered localy...) |
It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up |
Nope, seems to be a similar issue as was solved already on openpathsampling (can't find the PR back, however)
2a2dc38 seems to fix that issue (and GA complaining about not knowing |
Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight! |
No rush. This is just a generic maintenance evening for me, it is just nice to have these PRs ready to go (again) for whenever you have some time. |
…dd DaskContactTrajectory to the latter
I did split them out, added One thing that I don't like on my local doc build is that the sidebar behaves erratically (If you open one of the notebooks it permanently hides "Exporting data to other tools" until you click on "Userguide/Examples" again (not the toggle, but the actual link)) |
Alright, this should be merge-able again |
After |
That solved the issue, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking forever on this one -- lgtm, will merge now!
This PR includes:
non-public API break
]ContactTrajectory._build_contacts()
Now returns oneiterator
ofn_frames
with(frame_n.atom_contacts, frame_n.residue_contacts)
for eachframe
instead oflist of atom_contacts, list residue_contacts
(this was necessary for the least amount of duplicate code between the dask implementation andContactTrajectory
, but can be reverted if required).feature
] AddedDaskContactTrajectory
(thedask
implementation forContactTrajectory
) and some related convenience functionsmisc
] Add cluster shutdown to the DaskContactFrequency exampleOriginal WIP comment below:
Things required:
Rework how trajectories are loaded (This is really really bad right now (loads the whole trajectoryn+1
times forn
slices))ContactTrajectory
@dwhswenson There are two fast options on how to fix the the loading issue:1) load outside ofdask
andscatter
2) Load the whole trajectory as apure
task in dask (which understands that thisload
task would always return the same data so it does not repeat) and do the slicing as a separate tasknvm, we solved this issue already for
DaskContactFrequency
, it tries to load in 1 chunk per worker.Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory
(Like with a
skip
with a single call tomdtraj.iterload()
)