# MPI Parallel HDF5

**Source:** *Python and HDF5* by Andrew Collette, O'Reilly 2013.

We would like to share a single file among multiple processes and have some magic for synchronizing reads and writes.

<img src="./img/MPI.png" width=600 />

### `mpi4py`

#### `demo.py`

```python
from mpi4py import MPI
comm = MPI.COMM_WORLD
print "Hello World (from process %d)" % comm.rank
```

### Run

```shell
$ mpiexec -n 4 python demo.py
Hello World (from process 0)
Hello World (from process 1)
Hello World (from process 3)
Hello World (from process 2)
```

### Use the MPI/IO-based VFD

For this to work, you'll need a version of the HDF5 library with the MPI-parallel extensions enabled. Unfortunately, none of the mainstream Python distros ship with 

```python
from mpi4py import MPI
import h5py

f = h5py.File("foo.hdf5", "w", driver="mpio", comm=MPI.COMM_WORLD)
```

#### Our Favorite Example

```python
import numpy as np
import h5py
from mpi4py import MPI

comm = MPI.COMM_WORLD   # Communicator which links all our processes together
rank = comm.rank        # Number which identifies this process.  Since we'll
                        # have 4 processes, this will be in the range 0-3.

f = h5py.File('coords.hdf5', driver='mpio', comm=comm)

coords_dset = f['coords']
distances_dset = f.create_dataset('distances', (1000,), dtype='f4')

idx = rank*250  # This will be our starting index.  Rank 0 handles coordinate
                # pairs 0-249, Rank 1 handles 250-499, Rank 2 500-749, and
                # Rank 3 handles 750-999.

coords = coords_dset[idx:idx+250]  # Load process-specific data

result = np.sqrt(np.sum(coords**2, axis=1))  # Compute distances

distances_dset[idx:idx+250] = result  # Write process-specific data

f.close()
```

### Atomicity

Without the line `f.atomic = True` all bets are off as to what will be printed.

```python
import h5py
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

with h5py.File('atomicdemo.hdf5', 'w', driver='mpio', comm=comm) as f:

    f.atomic = True  # Enables strict atomic mode (requires HDF5 1.8.9+)

    dset = f.create_dataset('x', (1,), dtype='i')

    if rank == 0:
        dset[0] = 42

    comm.Barrier()

    if rank == 1:
        print dset[0]
```

## Topics for Discussion

- Building the HDF5 library with parallel extensions enabled
  + `CC=/usr/local/mpi/bin/mpicc ./configure --enable-parallel --prefix=<install-directory>`
- Building `h5py` with parallel extensions enabled
  + `python setup.py build --mpi [--hdf5=/path/to/parallel/hdf5]`