Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emcee and MPI with mpi4py #310

Open
FabRez opened this issue Sep 25, 2019 · 3 comments
Open

Emcee and MPI with mpi4py #310

FabRez opened this issue Sep 25, 2019 · 3 comments

Comments

@FabRez
Copy link

FabRez commented Sep 25, 2019

General information:

  • emcee version:3.0rc2
  • platform: Ubuntu
  • installation method (pip/conda/source/other?): pip install

Problem description:

Hi all,

I am pretty new of MPI parallelization and this is indeed my first try to use it on a python code

I have tried a parallelization of emcee using mpi4py in a different way w.r.t. the minimal examples proposed in the documentation of the code (which actually uses multiprocessing).

The idea behind this workaround is that I want to have each process to write its own"backend" file when emcee is running which, to my knowledge so far, cannot be realized using the "pool" option of the sampler. Furthermore, w.r.t. the "pool" method I can draw much more model at
the same time with speed comparable to when I run the code without parallelization.

what I would like to know is if this is the right way to obtain what I wanted to achieve or if there's another way to do that using the "pool" method. The workaround works fine but I am concerned about the results I am drawing from this, are the results of each MCMC process really independent from one another in this way, or there's something I am missing ?

Best,
F.

Minimal example:

import emcee
from mpi4py import MPI

def lnlike(theta, x, y, yerr):
    m, b, lnf = theta
    model = m * x + b
    inv_sigma2 = 1.0/(yerr**2 + model**2*np.exp(2*lnf))
    return -0.5*(np.sum((y-model)**2*inv_sigma2 - np.log(inv_sigma2)))

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

sampler={}
backends= {}

filename = "examples_"+str(rank+1)+".h5"

ndim, nwalkers = 3, 10  

# use backends option of emcee to store chains 

backends[str(rank)]=emcee.backends.HDFBackend(filename)

backends[str(rank)].reset(nwalkers,ndim)

# create an initialization for the walkers 

pos = [p0+ 1e-4*np.random.randn(ndim) for i in range(nwalkers)]    

sampler[str(rank)] = emcee.EnsembleSampler(nwalkers, ndim, lnlike, args=(x, y, yerr),backend=backends[str(rank)])

#start the actual mcmc process

sampler[str(rank)].run_mcmc(pos, 10000,progress=True)

``
@dfm
Copy link
Owner

dfm commented Sep 26, 2019

There isn't anything fundamentally wrong about this approach (it can actually be useful since you have many independent ensembles in this case). But one thing that you need to be careful about is making sure that the random number generator on each process gets properly initialized or you'll end up with chains that are correlated or even the same.

@wenbinlu
Copy link

I ran into a similar situation. Is it possible to random.seed() the random numbers used in emcee? Does emcee already adopt a particular way of seeding the random number generator (in this way I won't have a control)?

@dfm
Copy link
Owner

dfm commented Oct 31, 2019

The EnsembleSampler creates a RandomState (that it owns) when it's initialized so the easiest way to have control over the state is to run a np.random.seed before initializing the sampler. It would be good to add a seed argument to the sampler, but it's not yet implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants