Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF backend slow #393

Open
AndreWaehlisch opened this issue Jun 24, 2021 · 3 comments
Open

HDF backend slow #393

AndreWaehlisch opened this issue Jun 24, 2021 · 3 comments

Comments

@AndreWaehlisch
Copy link

General information:

  • emcee version: 3.0.2
  • platform: opensuse
  • installation method (pip/conda/source/other?): pip

I am looking at your saving example at https://emcee.readthedocs.io/en/v3.0.2/tutorials/monitor/

Problem description:

I very much like the flexibility offered by the HDF backend, being able to save my chain to a file and to continue at a later point in time (especially as backup for long computations when my cpu node dies). However, when I have a fast log_prob function the overhead of opening/writing/closing the HDF file on each iteration seems to be disproportionately high and the overall computation performance is painfully slow.

Expected behavior:

Perhaps an easy solution would be an option to only save the chain state on every n-th iteration (where n is some adjustable number or n is calculated based on the relative progress). This may save some overhead by only opening/closing the HDF file once in a while.

@dfm
Copy link
Owner

dfm commented Jun 24, 2021

Yes - this is a nice proposal and I'd be happy to review such a PR, but I'm not immediately sure how painful it would be to implement.

If you're planning on thinning your chain eventually, you could use the thin_by parameter for sample or run_mcmc (see here), but your suggestion would be much better in general!

@jpmvferreira
Copy link

I was about to suggest this as well, I even opened a discussions on the emcee google groups - link - and in the forth reply a user gave a small example of how to run emcee as an iterator, which I already was doing, and I was now trying to figure how to to periodically write the chain to disk.

One thing that I had in mind was saving the chain at the same time it would compute the autocorrelation time, in parallel, because for longer chains that computation can take a while and that way you would make the most out of that time.

In the mean time I'll keep trying to figure that out myself, but having this by default would be great!

@jpmvferreira
Copy link

I've been digging through the source code and each time a step is computer its saved to the backend, using the method save_step. If it happens to be an HDF file, on each new step the file is opened and then the contents saved onto it.

The same thing happens when getting a value from this backend, with the method get_value, where the file is opened, the relevant entry is read, put into memory, and then used to perform whatever computation it was called for.

I was trying to modify the sample function on ensemble.py, but I realized that if I were to save periodically if the user would try to, let's say, get the autocorrelation time, then the file would be outdated and the result would be wrong, so all modifications should be in the backend itself, right?

If so then it would have to be added a way to communicate the buffer to the backend, which at the moment there doesn't seem to be any way of communicating with the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants