Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up multiprocessing when using a large amount of data #13

Merged
merged 2 commits into from
Apr 3, 2020
Merged

Speed up multiprocessing when using a large amount of data #13

merged 2 commits into from
Apr 3, 2020

Conversation

cdcapano
Copy link
Contributor

Currently, all data that is stored in the lnpost class, along with all the points in the kde, are pushed out to the children processes on every single iteration. This is because _GetLnProbWrapper is initialized after the pool is created, and re-initialized for each iteration. If lnpost involves a large (~90MB) amount of data, this can substantially slow multiprocessing down on more than just a few cores. The reason for this is the head process needs to rewrite all of the data to the memory pipe from which all the children read. All of the children therefore have to wait until the head process is done. Furthermore, each child can only read from the common pipe one at a time. This can really slow things down when using a large number of cores. For example, on a 20 core machine with ~90MB of data being passed around, I found it was actually ~7xs faster to use 6 cores rather than 20.

This patch fixes this by creating a single instance of _GetLnProbWrapper before the multiprocessing pool is initialized, and making it a global variable. (The use of a global variable seems like a bad idea, but appears to be the recommended way to go about it; see http://stackoverflow.com/a/10118250, for example.) This causes each child to have their own copy of the wrapper in their name space. On each iteration, only the new points to test are pushed out to the children. Since this is only ever a few 10s of floats per walker, this never gets to be more than a few KB. A method is also added to _GetLnProbWrapper that sets the kde. Whenever the kde is updated, this method is called to push out the new kde to the children. I tested this patch using the above ~90MB of data. Run time on 20 cores improved from ~100 days (this is an estimate, I didn't actually let it run that long) to ~1.75 days (that run did complete).

Because the pool needs to be initialized after the wrapper, a pre-initialized pool can no longer be provided to the sampler. Instead, a pool class can be passed, which is used to create the pool after the wrapper is initialized. The class is assumed to be something that takes the number of processes to use as the only initialization variable.

@ahnitz
Copy link
Contributor

ahnitz commented Dec 5, 2016

I reverted the path so that kombine goes back to accepting a standard pool instance.

@bfarr @adrn We've also tested using this from PyCBC. Using schwimmbad and Kombine, we've been able to run over MPI on ~100 cores so far, and we'll be looking to expand this to O(1000) soon. There is a corresponding patch on the PyCBC inference side to enable this. The important part of this patch that makes the MPI work correctly is the explicit transmission of the kde updates.

gwastro/pycbc#1308

@bfarr
Copy link
Owner

bfarr commented Dec 6, 2016

Thanks for tracking down these inefficiencies!

These changes seem to break the basic usage examples (e.g., this one), as well as the unit tests.

I'm happy to look into this, but probably won't be able to get to it for another day or two.

@ahnitz
Copy link
Contributor

ahnitz commented Dec 6, 2016

@bfarr I'm still looking into it, but I'm pretty confused by the remaining travis error. I can't reproduce it on ATLAS, but I'll rebuild my stack elsewhere and see if I can get it to show up.

@ahnitz
Copy link
Contributor

ahnitz commented Jan 20, 2017

@bfarr Hi Ben, I didn't want to leave you hanging, but here is the status. In short, we need a way to broadcast information to sub-processes, without doing so on every single function call. If you transfer data such as what is needed to calculate the likelihood or the internal kde on every call, that actually turns out to be a huge overhead and you can't really parallelize to many cores. We had tried using the pool object to send updates of the kde and the likelihood class to the subprocesses as required, however, for multiprocessing, this is not safe, as one cannot guaranatee that every subprocess receives the update. That is why the unit test is failing, and why it is a bit difficult to reproduce always on other machines we've tried. I don't have a great solution at the moment that would keep things generic to just a pool object, but I'll give it some thought. Your thoughts are welcome as well.

@codecov-io
Copy link

codecov-io commented Feb 14, 2017

Codecov Report

Merging #13 into master will decrease coverage by 0.36%.
The diff coverage is 65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
- Coverage   72.81%   72.44%   -0.37%     
==========================================
  Files           7        7              
  Lines         721      744      +23     
==========================================
+ Hits          525      539      +14     
- Misses        196      205       +9
Impacted Files Coverage Δ
kombine/sampler.py 72.72% <65%> (-0.83%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1bb7c2...c670afe. Read the comment docs.

@ahnitz
Copy link
Contributor

ahnitz commented Feb 14, 2017

@bfarr Hi Ben, I think this is now ready for your review. To resolve the problem I talked about before, we need to have an additional guarantee from the pool class to be able to run a function on every process. I've implemented that outside of this library in a pool class. The patch here will do nothing different if given a standard pool object, but if it has the capability to broadcast a function to every process, it can speed up the memory transfers by a large factor, removing a critical bottleneck. The unittests and examples should now behave as normal.

@ahnitz
Copy link
Contributor

ahnitz commented Apr 5, 2018

@bfarr @cdcapano Any update on what would should do with this PR?

@ahnitz
Copy link
Contributor

ahnitz commented Jun 21, 2018

@cdcapano @bfarr Poke

1 similar comment
@ahnitz
Copy link
Contributor

ahnitz commented Aug 9, 2018

@cdcapano @bfarr Poke

bruce-edelman added a commit to bruce-edelman/kombine that referenced this pull request Mar 31, 2020
@bruce-edelman bruce-edelman changed the base branch from master to devel April 3, 2020 21:59
@bruce-edelman bruce-edelman merged commit 294de17 into bfarr:devel Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants