Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskContactTrajectory #101

Merged
merged 12 commits into from
Apr 11, 2021
Merged

Conversation

sroet
Copy link
Collaborator

@sroet sroet commented Nov 4, 2020

This PR includes:

  • [non-public API break] ContactTrajectory._build_contacts() Now returns one iterator of n_frames with (frame_n.atom_contacts, frame_n.residue_contacts) for each frame instead of list of atom_contacts, list residue_contacts (this was necessary for the least amount of duplicate code between the dask implementation and ContactTrajectory, but can be reverted if required).
  • [feature] Added DaskContactTrajectory (the dask implementation for ContactTrajectory) and some related convenience functions
  • [misc] Add cluster shutdown to the DaskContactFrequency example

Original WIP comment below:

Things required:

  • Implement DaskContactTrajectory
  • Rework how trajectories are loaded (This is really really bad right now (loads the whole trajectory n+1 times for n slices))
  • Use the already clever loading that tries to load 1 chunk per worker and work on chunked trajectories
  • Use the clever _build_contacts that was already present in ContactTrajectory
  • Add tests
  • Add Documentation
  • Add example

@dwhswenson There are two fast options on how to fix the the loading issue:
1) load outside of dask and scatter
2) Load the whole trajectory as a pure task in dask (which understands that this load task would always return the same data so it does not repeat) and do the slicing as a separate task

nvm, we solved this issue already for DaskContactFrequency, it tries to load in 1 chunk per worker.

Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory
(Like with a skip with a single call to mdtraj.iterload())

@sroet
Copy link
Collaborator Author

sroet commented Nov 4, 2020

To indicate why the loading is bad:
Screenshot from 2020-11-04 22-38-24

It is also clearly visible from the task stream (purple are 101 md.load(traj)[slice] calls)

Screenshot from 2020-11-04 22-40-22

@sroet
Copy link
Collaborator Author

sroet commented Nov 11, 2020

It is better now:
Screenshot from 2020-11-11 23-27-38

With the following task loading
Screenshot from 2020-11-11 23-29-32

@dwhswenson
Copy link
Owner

Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.

Longer term we might need to think on how to prevent the requirement that the whole trajectory needs to fit into memory
(Like with a skip with a single call to mdtraj.iterload())

Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.

@sroet
Copy link
Collaborator Author

sroet commented Nov 12, 2020

Looks pretty good so far. One thing to keep in mind is that the use case for this is really going to be multiple nodes -- for same-node parallelization, MDTraj already uses OpenMM, and I think we can eventually figure out how to get numba to handle the loops in ContactObject._contact_map. So per-frame, we'll have good parallelism in shared memory. But the loop over frames can be spread over nodes.

I know, but the reason I was hesitant of the first implementation was because 3x slowdown was way more than the one for DaskContactFrequency. I am pretty happy with the current implementation and it should not change to much until the WIP disappears.

Agreed. One thing I was thinking about is that it also might be nice to allow multiple files. In practice, this is what I see people do with very large trajectories. For example, a DESRES trajectory I've been playing with is 100 files with 1000 frames each. A list of filenames instead of just a single filename would be reasonable input here.

That is a bit out-of-scope for this PR (in my opinion), but I will make an issue to track a smarter loading implementation.

@sroet sroet changed the title WIP: DaskContactTrajectory DaskContactTrajectory Nov 20, 2020
@sroet
Copy link
Collaborator Author

sroet commented Nov 20, 2020

@dwhswenson This is ready for a review

@sroet
Copy link
Collaborator Author

sroet commented Nov 20, 2020

(The first comment now gives an overview of changes and additions in this PR)

@codecov
Copy link

codecov bot commented Dec 21, 2020

Codecov Report

Merging #101 (9d56908) into master (65cf1de) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   99.52%   99.53%   +0.01%     
==========================================
  Files          13       13              
  Lines        1043     1070      +27     
==========================================
+ Hits         1038     1065      +27     
  Misses          5        5              
Impacted Files Coverage Δ
contact_map/__init__.py 100.00% <100.00%> (ø)
contact_map/contact_trajectory.py 100.00% <100.00%> (ø)
contact_map/dask_runner.py 100.00% <100.00%> (ø)
contact_map/frequency_task.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65cf1de...9d56908. Read the comment docs.

@sroet
Copy link
Collaborator Author

sroet commented Jan 21, 2021

@dwhswenson , friendly monthly ping to make sure it is still on a list somewhere :)

@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

@dwhswenson After #106 I am not too certain what to do with the notebook that is added here (examples/dask_contact_trajectory.ipynb ), do we want to wrap that into integrations.ipynb? (which seems to already reference functionality from this PR)

@dwhswenson
Copy link
Owner

Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb and performance.ipynb -- performance might also eventually include numba integration, which may have a force-off switch like atom_slice

@dwhswenson
Copy link
Owner

(in other words, feel free to make that split)

@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

(in other words, feel free to make that split)

Will do, in the meantime I have no clue why codecov suddenly thinks all the dask code is not hit (it claims to be covered localy...)

@dwhswenson
Copy link
Owner

I have no clue why codecov suddenly thinks all the dask code is not hit

It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up

@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

It may be just slow to catch it. We only run optional integrations in the mdtraj-dev build, which takes longer to install. I had that happen on a PR -- eventually it caught up

Nope, seems to be a similar issue as was solved already on openpathsampling (can't find the PR back, however)

Issue detecting commit SHA. Please run actions/checkout with fetch-depth > 1 or set to 0,

2a2dc38 seems to fix that issue (and GA complaining about not knowing auto-update-python). I can cherry pick that one over to it's own PR, if you want

@dwhswenson
Copy link
Owner

Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!

@sroet sroet mentioned this pull request Mar 11, 2021
@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

Sure, please do cherry pick it over. That's a PR I can actually promise to review tonight!

No rush. This is just a generic maintenance evening for me, it is just nice to have these PRs ready to go (again) for whenever you have some time.

@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

Yeah, I think that makes sense. As that section gets longer, I'm thinking to split that notebook into exporting_data.ipynb and performance.ipynb -- performance might also eventually include numba integration, which may have a force-off switch like atom_slice

I did split them out, added DaskContactTrajectory and updated the doc building.

One thing that I don't like on my local doc build is that the sidebar behaves erratically (If you open one of the notebooks it permanently hides "Exporting data to other tools" until you click on "Userguide/Examples" again (not the toggle, but the actual link))

@sroet
Copy link
Collaborator Author

sroet commented Mar 11, 2021

Alright, this should be merge-able again

@dwhswenson
Copy link
Owner

One thing that I don't like on my local doc build is that the sidebar behaves erratically

After make clean && make html? The default Makefile is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.

@sroet
Copy link
Collaborator Author

sroet commented Mar 12, 2021

After make clean && make html? The default Makefile is pretty conservative in what it changes (to keep build times fast). Regularly messes up the sidebar.

That solved the issue, thanks!

Copy link
Owner

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking forever on this one -- lgtm, will merge now!

@dwhswenson dwhswenson merged commit eb0c208 into dwhswenson:master Apr 11, 2021
@sroet sroet deleted the daskcontacttrajectory branch April 12, 2021 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants