(swmr)=

# Reading an HDF5 file while it is being written

This tutorial will show how it is possible to read from an HDF5 file while it is being written by the `HDFBackend` without incurring into crashes of the process that is currently writing to the file.

By default, an HDF5 file can be either read or written and it is not possible to do both at the same time. Trying to do so anyway might lead to crashes of the `emcee` process because the information in the file is not properly synchronized.

**Important note**: the relevance of this effect depends on how much time is spent on writing relative to the time taken to perform a step. If saving takes a relatively small time, it is less likely that you will encounter crashes of the emcee process.

In [1]:
%config InlineBackend.figure_format = "retina"

from matplotlib import rcParams
rcParams["savefig.dpi"] = 100
rcParams["figure.dpi"] = 100
rcParams["font.size"] = 20

In [2]:
import multiprocessing as mp
import os
import time
import numpy as np
import emcee

## Example of failing to write and read at the same time.

Let us consider the following.

We want to perform an MCMC on some process: in a script we define the log-probability function and do the set up of emcee in a function.

In [3]:
def lnprob(x):
    time.sleep(0.0001)
    return 0.

def writer():
    nwalkers = 100
    nsteps = 1000
    if os.path.isfile('backend.h5'):
        os.remove('backend.h5')
    backend = emcee.backends.HDFBackend('backend.h5')
    backend.reset(nwalkers,1)
    sampler = emcee.EnsembleSampler(nwalkers,1,lnprob,backend=backend)
    pos0 = np.ones(nwalkers) + ((np.random.random(nwalkers)-0.5)*2e-3)
    print(pos0.shape)
    sampler.run_mcmc(pos0[:, None],nsteps,progress=True,store=True)

We also have a script to read the chain from the HDF5 file.

In [4]:
def reader():
    backend = emcee.backends.HDFBackend('backend.h5',read_only=True)
    chain = backend.get_chain()
    print(chain.shape)

Now, we start the chain in the background, with the help of the `multiprocessing` module.

In [8]:
writer_proc = mp.Process(target=writer)
writer_proc.start()

(100,)


 16%|██████████▋                                                      | 164/1000 [00:03<00:18, 44.11it/s]
Process Process-2:
Traceback (most recent call last):
  File "/home/ale/miniconda3/envs/emceefork/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ale/miniconda3/envs/emceefork/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/ipykernel_17208/3509143556.py", line 15, in writer
    sampler.run_mcmc(pos0[:, None],nsteps,progress=True,store=True)
  File "/home/ale/Documenti/DOTTORATO/Progetti/emcee/src/emcee/ensemble.py", line 438, in run_mcmc
    for results in self.sample(initial_state, iterations=nsteps, **kwargs):
  File "/home/ale/Documenti/DOTTORATO/Progetti/emcee/src/emcee/ensemble.py", line 405, in sample
    self.backend.save_step(state, accepted)
  File "/home/ale/Documenti/DOTTORATO/Progetti/emcee/src/emcee/backends/hdf.py", line 292, in save_step
    with sel

If for any reason you want to stop the reader, please uncomment the following cell and run it.

In [9]:
#writer_proc.terminate()

While the chain is running, we want to check on the status of the chain, se we read it.

In [10]:
reader()

(50, 100, 1)


In case the function call above worked without causing problems, let us try repeating the read several times in succession.  If instead the reader already crashed, then running the cell below is not needed and you should read further down.

In [11]:
imax=50
i=0
while writer_proc.is_alive():
    i+=1
    try:
        reader()
    except:
        print("Could not read from the backend.")
    if i>=imax:
        break
    time.sleep(0.5)
if i<imax and writer_proc.exitcode not in [0,None]:
    print("The writer crashed after {} tries.".format(i))

(77, 100, 1)
Could not read from the backend.
Could not read from the backend.
(144, 100, 1)
(164, 100, 1)
The writer crashed after 5 tries.


What very likely happened is that the cell above produced two kinds of outputs:

- `"Could not read from the backend."`, which means that the reader threw an exception while opening the file for reading. This happens when the file is read while the writer is in the middle of writing.
- `"The writer died after N tries."`, and it shows that the writer crashed trying to open the file while it was already opened by the reader.

The take home message is that without precautions _it is not safe to read an HDF5 file while it is being written_.

## How to read while the file is being written

To completely avoid crashing the writer and interrupting a possibly long-running chain there are two things that have to be done:
1) turn on the Single Writer Multiple Reader (SWMR) mode
2) deactivate the HDF5 file locking mechanism when reading

### 1. Activating SWMR

The SWMR mode has to be activated bith for the writer and the reader. Taking code from the above example, you would have to use
```python
backend = emcee.backends.HDFBackend('backend.h5',swmr=True)
```
for the writer and
```python
backend = emcee.backends.HDFBackend('backend.h5',read_only=True, swmr=True)
```
for the reader.

### 2. Deactivating the file locking

When running a chain, `emcee` opens and closes the HDF5 file of the backend at each step. This is an issue, because it is important that the writer opens the file before the reader can read it. Under the hood, there is a file locking mechanism which prevents mistakes when dealing with multiple processes accessing the file, and the writer has to receive the lock to be able to actually write, otherwise it will crash. The solution is to remove the locking mechanism from the reader, which is as simple as including
```python
import os
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
```
in the script that reads the file.

### A word of warning

Due to the complex oeprations performed by HDF, it might still happen that the reader crashes while trying to read the file. This, as of today, is anavoidable, but also a minor inconvenience compared to crasing the writer process.

## A complete solution

We show below a complete, working solution that does not crash the writer.

In [12]:
def writer_fixed():
    nwalkers = 100
    nsteps = 1000
    if os.path.isfile('backend.h5'):
        os.remove('backend.h5')
    # write to backend with SWMR mode active
    backend = emcee.backends.HDFBackend('backend.h5',swmr=True)
    backend.reset(nwalkers,1)
    sampler = emcee.EnsembleSampler(nwalkers,1,lnprob,backend=backend)
    pos0 = np.ones(nwalkers) + ((np.random.random(nwalkers)-0.5)*2e-3)
    print(pos0.shape)
    sampler.run_mcmc(pos0[:, None],nsteps,progress=True,store=True)
    
def reader_fixed():
    # read from backend with SWMR mode active
    backend = emcee.backends.HDFBackend('backend.h5',read_only=True,swmr=True)
    chain = backend.get_chain()
    print(chain.shape)

In [13]:
writer_proc = mp.Process(target=writer_fixed)
writer_proc.start()

(100,)


100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:23<00:00, 43.18it/s]


In [14]:
# set the environment variable
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
imax=500
i=0
while writer_proc.is_alive():
    i+=1
    try:
        reader_fixed()
    except:
        print("Could not read from the backend.")
    if i>=imax:
        break
    time.sleep(0.5)
if i<imax and writer_proc.exitcode not in [0,None]:
    print("The writer crashed after {} tries.".format(i))
else:
    print("The writer did not crash.")
del os.environ['HDF5_USE_FILE_LOCKING'] # we remove the environment variable to restore the initial environment

(30, 100, 1)
(51, 100, 1)
(71, 100, 1)
(93, 100, 1)
Could not read from the backend.
(137, 100, 1)
Could not read from the backend.
(180, 100, 1)
(202, 100, 1)
(223, 100, 1)
Could not read from the backend.
(266, 100, 1)
(289, 100, 1)
(311, 100, 1)
(334, 100, 1)
(356, 100, 1)
(378, 100, 1)
(399, 100, 1)
(421, 100, 1)
Could not read from the backend.
(462, 100, 1)
Could not read from the backend.
Could not read from the backend.
Could not read from the backend.
(551, 100, 1)
Could not read from the backend.
(596, 100, 1)
(618, 100, 1)
(639, 100, 1)
(660, 100, 1)
(681, 100, 1)
(703, 100, 1)
(725, 100, 1)
(748, 100, 1)
(770, 100, 1)
(792, 100, 1)
Could not read from the backend.
Could not read from the backend.
(856, 100, 1)
(878, 100, 1)
(899, 100, 1)
(920, 100, 1)
(943, 100, 1)
(964, 100, 1)
Could not read from the backend.
The writer did not crash.


The output you should see at this point should make it clear that, while the reader not always succeeded, the writer survived until the end. 