From e3bd4e989dedcafb4aa2844f64e7d44cfcf3c7eb Mon Sep 17 00:00:00 2001 From: Jeff Rasley Date: Thu, 27 Feb 2020 10:02:06 -0800 Subject: [PATCH 1/3] Update requirements.txt --- requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/requirements.txt b/requirements.txt index dd2c5fdf478f..5772558deaaa 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,3 +9,4 @@ nvidia-ml-py3 pytest pytest-forked pre-commit +mpi4py From 92ef27d7b3b89a03e61c4581ffe6292d07b2d3e7 Mon Sep 17 00:00:00 2001 From: Jeff Rasley Date: Thu, 27 Feb 2020 10:09:55 -0800 Subject: [PATCH 2/3] add text about mpirun --- README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/README.md b/README.md index f327f4c93356..e95ef824c8da 100755 --- a/README.md +++ b/README.md @@ -367,6 +367,23 @@ deepspeed --include="worker-2:0,1" \ --deepspeed --deepspeed_config ds_config.json ``` +### MPI Compatibility +As described above, DeepSpeed provides its own parallel launcher to help launch +multi-node/multi-gpu training jobs. If you prefer to launch your training job +using MPI (e.g., mpirun), we provide support for this. It should be noted that +DeepSpeed will still use the torch distributed NCCL backend and *not* the MPI +backend. To launch your training job with mpirun + DeepSpeed you simply pass us +an additional flag `--deepspeed_mpi`. DeepSpeed will then use +[mpi4py](https://pypi.org/project/mpi4py/) to discover the MPI environment (e.g., +rank, world size) and properly initialize torch distributed for training. In this +case you will explicitly invoke `python` to launch your model script instead of using +the `deepspeed` launcher, here is an example: +```bash +mpirun python \ + \ + --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json +``` + ## Resource Configuration (single-node) In the case that we are only running on a single node (with one or more GPUs) DeepSpeed *does not* require a hostfile as described above. If a hostfile is From 1ed239570d6271ac6533cd52bdb89bf9e42b9960 Mon Sep 17 00:00:00 2001 From: Jeff Rasley Date: Thu, 27 Feb 2020 10:17:05 -0800 Subject: [PATCH 3/3] remove mpi4py from requirements.txt and update doc --- README.md | 3 +++ requirements.txt | 1 - 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index e95ef824c8da..884e3bb878cb 100755 --- a/README.md +++ b/README.md @@ -384,6 +384,9 @@ mpirun python \ --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json ``` +If you want to use this feature of DeepSpeed, please ensure that mpi4py is +installed via `pip install mpi4py`. + ## Resource Configuration (single-node) In the case that we are only running on a single node (with one or more GPUs) DeepSpeed *does not* require a hostfile as described above. If a hostfile is diff --git a/requirements.txt b/requirements.txt index 5772558deaaa..dd2c5fdf478f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,4 +9,3 @@ nvidia-ml-py3 pytest pytest-forked pre-commit -mpi4py