Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used #405

Maryamtajalli · 2021-09-19T15:02:43Z

Hello,

I am parallelizing emcee using multiprocessing module as stated in the emcee document. However, htop shows that the program keeps using a limited number of cores (26-27). The computer I am running my code on has 80 cores and I would like to make use of all of them to speed up the code. Could you please help me with this? I did not find any solution in the document. This only happens with emcee and when I use multiprocessing module in other programs I see with htop that all 80 cores are being used.

This is my code:

with Pool() as pool:

        sampler = emcee.EnsembleSampler(nwalkers, npars, logpfunc, pool=pool)

        start = []
        if len(sys.argv) > 1:
            print('using last spread of %s to initialize walkers'%sys.argv[1])
            startfile = h5py.File('%s'%sys.argv[1], 'r')

            for i in range(nwalkers):
                tmp = np.zeros(npars)
                for nn in range(npars):
                    tmp[nn] = startfile[pars[nn]['name']][i, -1]
                start.append(tmp)
            startfile.close()

        else:
            for i in range(nwalkers):
                tmp = np.zeros(npars)
                for j in range(npars):
                    a, b = (bounds[j][0] - pars[j]['guess'])/pars[j]['spread'], (bounds[j][1] - pars[j]['guess'])/pars[j]['spread']
                    p0 = truncnorm.rvs(a, b, size=1)*pars[j]['spread'] + pars[j]['guess']
                    tmp[j] = p0

                start.append(tmp)

        print('Sampling')

        sampler.run_mcmc(start, nstep, progress=True)

The text was updated successfully, but these errors were encountered:

dfm · 2021-09-20T12:18:29Z

emcee will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.

Maryamtajalli · 2021-09-20T12:38:26Z

emcee will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.

I see, thank you very much. I once tried to parallelize my model evaluation since there is an inevitable for loop in my logposterior function which is the main reason my posterior is computationally expensive, but I ran into the error:

daemonic processes are not allowed to have children

dfm · 2021-09-20T12:57:35Z

Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!

Maryamtajalli · 2021-09-23T14:33:38Z

Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!

Hello, I was thinking about one of your suggestions: writing a C code for model evaluation. My log_posterior function contains an integral inside a for loop, and using scipy.integrate makes my code super slow but the values of the log_posterior much more accurate. I have made use of other numerical integration methods (such as Monte Carlo integration) to speed up calculations, but as I said, the results are not as precise as scipy.integrate. I was wondering whether it would be possible to write my log_posterior function in C and then somehow wrap this computationally expensive function into my main Python code and pass it to emcee.
Also, do you think using async / await in order to run both the log_posterior function and emcee in parallel can improve the performance?
I would appreciate if you could let me know if these are good solutions for speeding up emcee or not.

andyfaff · 2021-09-23T22:41:09Z

Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function.
You may have already followed these steps for improved efficiency, forgive me if you have:

Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the func you're passing to scipy.integrate first.
Can you vectorise func? For example, quadrature would speed up if you can pass a func that receives and returns vector arrays.
Alternatively, since you mention that you call a scipy.integrate function in a for loop, could you eliminate that loop by using [quad_vec(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).

Note that the difference between 2 and 3 is that quadrature is returning the result from a single integration, quad_vec is returning results from multiple integrations.

You can often get good speedups by cythonising func (either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1.
If you carry out 4., then you can improve yet again by generating a LowLevelCallable from the cython function, and passing it to scipy.integrate.quad. The overhead from the external quadpack library calling Python functions repeatedly can be significant.

From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.

Maryamtajalli · 2021-09-29T11:36:12Z

Dear Andrew, thank you very much for your great suggestions. I vectorized func and using quadraure, the code speeded up by a factor of almost 2, although the results from quadrature were not as accurate as quad. I will also try to cythonize func and see how things will be further improved.

Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function. You may have already followed these steps for improved efficiency, forgive me if you have:

Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the func you're passing to scipy.integrate first.

Can you vectorise func? For example, quadrature would speed up if you can pass a func that receives and returns vector arrays.

Alternatively, since you mention that you call a scipy.integrate function in a for loop, could you eliminate that loop by using [quad_vec(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).

Note that the difference between 2 and 3 is that quadrature is returning the result from a single integration, quad_vec is returning results from multiple integrations.

You can often get good speedups by cythonising func (either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1.

If you carry out 4., then you can improve yet again by generating a LowLevelCallable from the cython function, and passing it to scipy.integrate.quad. The overhead from the external quadpack library calling Python functions repeatedly can be significant.

From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used #405

Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used #405

Maryamtajalli commented Sep 19, 2021 •

edited

Loading

dfm commented Sep 20, 2021

Maryamtajalli commented Sep 20, 2021

dfm commented Sep 20, 2021

Maryamtajalli commented Sep 23, 2021 •

edited

Loading

andyfaff commented Sep 23, 2021

Maryamtajalli commented Sep 29, 2021

Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used #405

Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used #405

Comments

Maryamtajalli commented Sep 19, 2021 • edited Loading

dfm commented Sep 20, 2021

Maryamtajalli commented Sep 20, 2021

dfm commented Sep 20, 2021

Maryamtajalli commented Sep 23, 2021 • edited Loading

andyfaff commented Sep 23, 2021

Maryamtajalli commented Sep 29, 2021

Maryamtajalli commented Sep 19, 2021 •

edited

Loading

Maryamtajalli commented Sep 23, 2021 •

edited

Loading