-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up multiprocessing when using a large amount of data #13
Conversation
I reverted the path so that kombine goes back to accepting a standard pool instance. @bfarr @adrn We've also tested using this from PyCBC. Using schwimmbad and Kombine, we've been able to run over MPI on ~100 cores so far, and we'll be looking to expand this to O(1000) soon. There is a corresponding patch on the PyCBC inference side to enable this. The important part of this patch that makes the MPI work correctly is the explicit transmission of the kde updates. |
Thanks for tracking down these inefficiencies! These changes seem to break the basic usage examples (e.g., this one), as well as the unit tests. I'm happy to look into this, but probably won't be able to get to it for another day or two. |
@bfarr I'm still looking into it, but I'm pretty confused by the remaining travis error. I can't reproduce it on ATLAS, but I'll rebuild my stack elsewhere and see if I can get it to show up. |
@bfarr Hi Ben, I didn't want to leave you hanging, but here is the status. In short, we need a way to broadcast information to sub-processes, without doing so on every single function call. If you transfer data such as what is needed to calculate the likelihood or the internal kde on every call, that actually turns out to be a huge overhead and you can't really parallelize to many cores. We had tried using the pool object to send updates of the kde and the likelihood class to the subprocesses as required, however, for multiprocessing, this is not safe, as one cannot guaranatee that every subprocess receives the update. That is why the unit test is failing, and why it is a bit difficult to reproduce always on other machines we've tried. I don't have a great solution at the moment that would keep things generic to just a pool object, but I'll give it some thought. Your thoughts are welcome as well. |
Codecov Report
@@ Coverage Diff @@
## master #13 +/- ##
==========================================
- Coverage 72.81% 72.44% -0.37%
==========================================
Files 7 7
Lines 721 744 +23
==========================================
+ Hits 525 539 +14
- Misses 196 205 +9
Continue to review full report at Codecov.
|
@bfarr Hi Ben, I think this is now ready for your review. To resolve the problem I talked about before, we need to have an additional guarantee from the pool class to be able to run a function on every process. I've implemented that outside of this library in a pool class. The patch here will do nothing different if given a standard pool object, but if it has the capability to broadcast a function to every process, it can speed up the memory transfers by a large factor, removing a critical bottleneck. The unittests and examples should now behave as normal. |
Currently, all data that is stored in the
lnpost
class, along with all the points in the kde, are pushed out to the children processes on every single iteration. This is because_GetLnProbWrapper
is initialized after the pool is created, and re-initialized for each iteration. Iflnpost
involves a large (~90MB) amount of data, this can substantially slow multiprocessing down on more than just a few cores. The reason for this is the head process needs to rewrite all of the data to the memory pipe from which all the children read. All of the children therefore have to wait until the head process is done. Furthermore, each child can only read from the common pipe one at a time. This can really slow things down when using a large number of cores. For example, on a 20 core machine with ~90MB of data being passed around, I found it was actually ~7xs faster to use 6 cores rather than 20.This patch fixes this by creating a single instance of
_GetLnProbWrapper
before the multiprocessing pool is initialized, and making it a global variable. (The use of a global variable seems like a bad idea, but appears to be the recommended way to go about it; see http://stackoverflow.com/a/10118250, for example.) This causes each child to have their own copy of the wrapper in their name space. On each iteration, only the new points to test are pushed out to the children. Since this is only ever a few 10s of floats per walker, this never gets to be more than a few KB. A method is also added to_GetLnProbWrapper
that sets the kde. Whenever the kde is updated, this method is called to push out the new kde to the children. I tested this patch using the above ~90MB of data. Run time on 20 cores improved from ~100 days (this is an estimate, I didn't actually let it run that long) to ~1.75 days (that run did complete).Because the pool needs to be initialized after the wrapper, a pre-initialized pool can no longer be provided to the sampler. Instead, a pool class can be passed, which is used to create the pool after the wrapper is initialized. The class is assumed to be something that takes the number of processes to use as the only initialization variable.