diff --git a/README.md b/README.md index f327f4c93356..884e3bb878cb 100755 --- a/README.md +++ b/README.md @@ -367,6 +367,26 @@ deepspeed --include="worker-2:0,1" \ --deepspeed --deepspeed_config ds_config.json ``` +### MPI Compatibility +As described above, DeepSpeed provides its own parallel launcher to help launch +multi-node/multi-gpu training jobs. If you prefer to launch your training job +using MPI (e.g., mpirun), we provide support for this. It should be noted that +DeepSpeed will still use the torch distributed NCCL backend and *not* the MPI +backend. To launch your training job with mpirun + DeepSpeed you simply pass us +an additional flag `--deepspeed_mpi`. DeepSpeed will then use +[mpi4py](https://pypi.org/project/mpi4py/) to discover the MPI environment (e.g., +rank, world size) and properly initialize torch distributed for training. In this +case you will explicitly invoke `python` to launch your model script instead of using +the `deepspeed` launcher, here is an example: +```bash +mpirun python \ + \ + --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json +``` + +If you want to use this feature of DeepSpeed, please ensure that mpi4py is +installed via `pip install mpi4py`. + ## Resource Configuration (single-node) In the case that we are only running on a single node (with one or more GPUs) DeepSpeed *does not* require a hostfile as described above. If a hostfile is