Speed up multiprocessing when using a large amount of data #13

cdcapano · 2016-11-14T13:59:49Z

Currently, all data that is stored in the lnpost class, along with all the points in the kde, are pushed out to the children processes on every single iteration. This is because _GetLnProbWrapper is initialized after the pool is created, and re-initialized for each iteration. If lnpost involves a large (~90MB) amount of data, this can substantially slow multiprocessing down on more than just a few cores. The reason for this is the head process needs to rewrite all of the data to the memory pipe from which all the children read. All of the children therefore have to wait until the head process is done. Furthermore, each child can only read from the common pipe one at a time. This can really slow things down when using a large number of cores. For example, on a 20 core machine with ~90MB of data being passed around, I found it was actually ~7xs faster to use 6 cores rather than 20.

This patch fixes this by creating a single instance of _GetLnProbWrapper before the multiprocessing pool is initialized, and making it a global variable. (The use of a global variable seems like a bad idea, but appears to be the recommended way to go about it; see http://stackoverflow.com/a/10118250, for example.) This causes each child to have their own copy of the wrapper in their name space. On each iteration, only the new points to test are pushed out to the children. Since this is only ever a few 10s of floats per walker, this never gets to be more than a few KB. A method is also added to _GetLnProbWrapper that sets the kde. Whenever the kde is updated, this method is called to push out the new kde to the children. I tested this patch using the above ~90MB of data. Run time on 20 cores improved from ~100 days (this is an estimate, I didn't actually let it run that long) to ~1.75 days (that run did complete).

Because the pool needs to be initialized after the wrapper, a pre-initialized pool can no longer be provided to the sampler. Instead, a pool class can be passed, which is used to create the pool after the wrapper is initialized. The class is assumed to be something that takes the number of processes to use as the only initialization variable.

ahnitz · 2016-12-05T10:46:31Z

I reverted the path so that kombine goes back to accepting a standard pool instance.

@bfarr @adrn We've also tested using this from PyCBC. Using schwimmbad and Kombine, we've been able to run over MPI on ~100 cores so far, and we'll be looking to expand this to O(1000) soon. There is a corresponding patch on the PyCBC inference side to enable this. The important part of this patch that makes the MPI work correctly is the explicit transmission of the kde updates.

gwastro/pycbc#1308

bfarr · 2016-12-06T07:09:10Z

Thanks for tracking down these inefficiencies!

These changes seem to break the basic usage examples (e.g., this one), as well as the unit tests.

I'm happy to look into this, but probably won't be able to get to it for another day or two.

ahnitz · 2016-12-06T21:22:16Z

@bfarr I'm still looking into it, but I'm pretty confused by the remaining travis error. I can't reproduce it on ATLAS, but I'll rebuild my stack elsewhere and see if I can get it to show up.

ahnitz · 2017-01-20T17:38:51Z

@bfarr Hi Ben, I didn't want to leave you hanging, but here is the status. In short, we need a way to broadcast information to sub-processes, without doing so on every single function call. If you transfer data such as what is needed to calculate the likelihood or the internal kde on every call, that actually turns out to be a huge overhead and you can't really parallelize to many cores. We had tried using the pool object to send updates of the kde and the likelihood class to the subprocesses as required, however, for multiprocessing, this is not safe, as one cannot guaranatee that every subprocess receives the update. That is why the unit test is failing, and why it is a bit difficult to reproduce always on other machines we've tried. I don't have a great solution at the moment that would keep things generic to just a pool object, but I'll give it some thought. Your thoughts are welcome as well.

…n each iteration

codecov-io · 2017-02-14T15:16:22Z

Codecov Report

Merging #13 into master will decrease coverage by 0.36%.
The diff coverage is 65%.

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
- Coverage   72.81%   72.44%   -0.37%     
==========================================
  Files           7        7              
  Lines         721      744      +23     
==========================================
+ Hits          525      539      +14     
- Misses        196      205       +9

Impacted Files	Coverage Δ
kombine/sampler.py	`72.72% <65%> (-0.83%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1bb7c2...c670afe. Read the comment docs.

ahnitz · 2017-02-14T15:26:48Z

@bfarr Hi Ben, I think this is now ready for your review. To resolve the problem I talked about before, we need to have an additional guarantee from the pool class to be able to run a function on every process. I've implemented that outside of this library in a pool class. The patch here will do nothing different if given a standard pool object, but if it has the capability to broadcast a function to every process, it can speed up the memory transfers by a large factor, removing a critical bottleneck. The unittests and examples should now behave as normal.

ahnitz · 2018-04-05T16:04:10Z

@bfarr @cdcapano Any update on what would should do with this PR?

ahnitz · 2018-06-21T13:46:19Z

@cdcapano @bfarr Poke

ahnitz · 2018-08-09T03:50:39Z

@cdcapano @bfarr Poke

cdcapano mentioned this pull request Nov 14, 2016

Allow kde_size < nwalkers #12

Closed

store prob wrapper global so only new points are pushed to children o…

2d382a3

…n each iteration

Merge branch 'master' into improve_multiprocessing

c670afe

bruce-edelman mentioned this pull request Nov 13, 2019

Add python 3.7 / python 3.8 CI testing, along with flake8 formatting tests to CI #21

Closed

bruce-edelman added a commit to bruce-edelman/kombine that referenced this pull request Mar 31, 2020

PATCH FROM PR: bfarr#13

c19b478

bruce-edelman changed the base branch from master to devel April 3, 2020 21:59

bruce-edelman approved these changes Apr 3, 2020

View reviewed changes

bruce-edelman merged commit 294de17 into bfarr:devel Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up multiprocessing when using a large amount of data #13

Speed up multiprocessing when using a large amount of data #13

cdcapano commented Nov 14, 2016

ahnitz commented Dec 5, 2016

bfarr commented Dec 6, 2016

ahnitz commented Dec 6, 2016

ahnitz commented Jan 20, 2017

codecov-io commented Feb 14, 2017 •

edited

Loading

ahnitz commented Feb 14, 2017

ahnitz commented Apr 5, 2018

ahnitz commented Jun 21, 2018

ahnitz commented Aug 9, 2018

Speed up multiprocessing when using a large amount of data #13

Speed up multiprocessing when using a large amount of data #13

Conversation

cdcapano commented Nov 14, 2016

ahnitz commented Dec 5, 2016

bfarr commented Dec 6, 2016

ahnitz commented Dec 6, 2016

ahnitz commented Jan 20, 2017

codecov-io commented Feb 14, 2017 • edited Loading

Codecov Report

ahnitz commented Feb 14, 2017

ahnitz commented Apr 5, 2018

ahnitz commented Jun 21, 2018

ahnitz commented Aug 9, 2018

codecov-io commented Feb 14, 2017 •

edited

Loading