facebookresearch · vtantia · Nov 8, 2021 · Jan 5, 2021 · Jan 5, 2021 · Jan 5, 2021
diff --git a/docs/source/deep_dive/slowmo_ddp.rst b/docs/source/deep_dive/slowmo_ddp.rst
@@ -5,61 +5,64 @@ Training neural networks in a distributed data-parallel manner results in non-li
 between the different nodes (as well as, to a lesser extent though, synchronization between the different nodes). So, a distributed training run
 with 8 nodes is not 8x faster than a run with 1 node as we would expect it to be.
 
-SlowMo Distributed Data Parallel aims to solve this by replacing the exact allreduce between gradients, which is typically done, with an approximate
+SlowMo Distributed Data Parallel aims to solve this by replacing the typical exact allreduce between gradients with an approximate
 averaging of parameters. This approximate averaging reduces both the time spent on communication as well as the synchronization between different
-nodes.  It uses one of the following two algorithms (configurable) as a base algorithm for this purpose -
+nodes.  It uses one of the following two algorithms (configurable) as a base algorithm for this purpose:
 
-* `Local <https://arxiv.org/abs/1602.05629>`_ `SGD <https://arxiv.org/abs/1705.09056>`_. This algorithm does an allreduce of the parameters every few iterations.
+* Local SGD (papers `#1 <https://arxiv.org/abs/1602.05629>`_ and `#2 <https://arxiv.org/abs/1705.09056>`_). This algorithm does an allreduce of the parameters every few iterations.
 
 * `Stochastic Gradient Push <https://arxiv.org/abs/1811.10792>`_ (SGP). This algorithm involves one-to-one communications between nodes.
 
-These base algorithms (LocalSGD and SGP), when used only by themselves, result in reduced accuracy. The `SlowMo <https://arxiv.org/abs/1910.00643>`_
-algorithm removes this accuracy loss by doing a slow momentum step, typically, every 48 iterations.
+These base algorithms (LocalSGD and SGP), when used only by themselves, result in reduced model quality (measured as accuracy in a classification 
+setting). The `SlowMo <https://arxiv.org/abs/1910.00643>`_ algorithm alleviates this issue by doing a slow momentum step, typically, every 48 iterations.
 
 The training process with SlowMo looks as follows:
 
 1. Compute the forward pass.
 
 2. Compute the backward pass.
 
-3. During the backward pass, using a backward hook, the gradients are synchronized using allreduce across the different GPUs on a node.
+3. During the backward pass, using a backward hook, on each node, the gradients are synchronized using allreduce across the different GPUs on
+   that node.
 
-4. Perform the ``optimizer.step()`` to update parameters on a node with the gradients of that node.
+4. Perform the ``optimizer.step()`` to update parameters on each node with the gradients of that node.
 
 5. Approximately average the parameters using a base algorithm - one of LocalSGD or SGP (both are described above).
 
 6. Perform the slow momentum update step once every ``slowmo_frequency`` (typically 48) iterations. In this step, the parameters on different
-   nodes are reduced, followed by a ``slowmo_optimizer.step()``. Note that this ``slowmo_optimizer`` is different from the original optimizer,
-   and it is done in a `Zero-1 <https://fairscale.readthedocs.io/en/latest/deep_dive/oss_sdp_fsdp.html>`_ like manner to save memory.
+   nodes are (exactly) averaged, followed by a ``slowmo_optimizer.step()``. Note that this ``slowmo_optimizer`` is different from the original optimizer,
+   and it is done in a `Zero-1 <./oss_sdp_fsdp.html>`_ like manner to save memory.
 
 Best practices for using ``SlowMoDistributedDataParallel``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-1. SlowMo will be useful in deep learning workloads which run on greater than 2 nodes in clusters with a slow interconnect, eg Ethernet.
+1. SlowMo will be useful in deep learning workloads which run on more than 2 nodes in clusters with a slow interconnect, eg Ethernet.
 
-2. SlowMo should be useful in your workload if the following condition holds:
+2. SlowMo should be useful in your workload if the following condition holds (in case you are using SGP as the base algorithm, the value of ``localsgd_frequency`` can be plugged in as 2):
 
    :math:`\textrm{time_taken_for_all_reduce_of_gradients} \times (1 - \frac{1}{\textrm{localsgd_frequency}} ) > \textrm{time_taken_for_backward_pass}`
 
    In clusters with slower interconnect, ``time_taken_for_all_reduce_of_gradients`` will go up, leading to SlowMo being more useful. ``localsgd_frequency``
    is also an important factor here. More details on varying that to affect performance are in tip 2 of
    `Performance tips for SlowMoDistributedDataParallel`_.
 
-3. ``slowmo_momentum`` will need to be tuned for obtaining good accuracy. A random search across 4 values from [0.1, 0.2, ..., 0.7] should be good enough
+3. ``slowmo_momentum`` will need to be tuned for obtaining good model quality. A grid search across {0.0, 0.1, 0.2, 0.4, 0.6} should be good enough
    for tuning. This ``slowmo_momentum`` value holds consistent across multiple runs with similar settings.  When the number of nodes used is increased,
    however, a higher value of ``slow_momentum`` should be needed. More details about this can be found in the
-   `documentation <https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html>`_.
+   `documentation <../api/experimental/nn/slowmo_ddp.html>`_.
 
-4. Adding SlowMo involves two steps, which can be found in the `tutorial <https://fairscale.readthedocs.io/en/latest/tutorials/slowmo_ddp.html>`_.
+4. Adding SlowMo to existing Distributed Data Parallel code involves two steps, which can be found in the `tutorial <../tutorials/slowmo_ddp.html>`_.
 
 Performance tips for ``SlowMoDistributedDataParallel``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-1. ``nprocs_per_node`` should be set to the number of GPUs in a node. This allows the API to exploit the fast interconnect between different GPUs
-   on a node.
+1. ``nprocs_per_node`` should be set to the number of GPUs on a node (this number should be the same on each node). This allows the API
+   to exploit the fast interconnect between different GPUs on a node.
 
-2. Increasing the ``localsgd_frequency`` results in an increase in speed. However, it comes with a tradeoff of reducing the accuracy.
+2. Increasing the ``localsgd_frequency`` results in an increase in speed. However, it comes with a tradeoff of reducing the model quality.
    We recommend keeping the ``localsgd_frequency`` at 3.
 
-3. ``slowmo_memory_efficient`` should typically be used. It reduces memory usage by sharding the extra slow momentum optimizer's parameters in
-   a `Zero-1`_ like manner.
+3. ``slowmo_memory_efficient`` should typically be used (this is the default behavior). It reduces memory usage by sharding the additional
+   slow momentum optimizer's parameters in a `Zero-1`_ like manner.
+
+4. A call to ``model.zero_grad()`` should be made after ``optimizer.step()`` in order to save memory for the ``model.perform_slowmo()`` step.
diff --git a/docs/source/tutorials/slowmo_ddp.rst b/docs/source/tutorials/slowmo_ddp.rst
@@ -7,48 +7,14 @@ clusters with low interconnect speeds between different nodes. When using
 SlowMo, the models on the different nodes are no longer kept in sync after each
 iteration, which leads to the optimization dynamics being affected. The end
 result is close to the results of Distributed Data Parallel, but is not exactly
-the same. Let's suppose that your trainer looks like:
+the same. 
 
-.. code-block:: python
-
-
-    import torch
-    from torch.nn.parallel import DistributedDataParallel as DDP
-
-
-    def train(
-        rank: int,
-        world_size: int,
-        epochs: int):
-
-        # process group init
-        dist_init(rank, world_size)
-
-        # Problem statement
-        model = MyAwesomeModel().to(rank)
-        model = DDP(model, device_ids=[rank])
-        dataloader = MySuperFastDataloader()
-        loss_ln = MyVeryRelevantLoss()
-        optimizer = MyAmazingOptimizer()
-
-        # Any relevant training loop, nothing specific to SlowMoDDP
-        # For example:
-        model.train()
-        for e in range(epochs):
-            for (data, targets) in dataloader:
-                data, targets = data.to(rank), targets.to(rank)
-                # Train
-                model.zero_grad()
-                outputs = model(data)
-                loss = loss_fn(outputs, targets)
-                loss.backward()
-                optimizer.step()
-
-
-Then using SlowMo Distributed Data Parallel is simply replacing the DDP call with a call to
-``fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel`` and adding a
-``model.perform_slowmo(optimizer)`` call after ``optimizer.step()``, as follows. The different
-points at which ``use_slowmo`` is used below help demonstrate these changes.
+If you have code that is setup to use Distributed Data Parallel, using SlowMo Distributed Data Parallel
+is simply replacing the DDP call with a call to
+``fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel``, adding a
+``model.perform_slowmo(optimizer)`` call after ``optimizer.step()``, and moving the ``model.zero_grad()``
+to be after ``optimizer.step()``, as follows. The different points at which ``use_slowmo`` is used
+below help demonstrate these changes:
 
 .. code-block:: python
 
@@ -84,18 +50,20 @@ points at which ``use_slowmo`` is used below help demonstrate these changes.
             for (data, target) in dataloader:
                 data, target = data.to(rank), target.to(rank)
                 # Train
-                model.zero_grad()
+                if not use_slowmo:
+                    model.zero_grad()
                 outputs = model(data)
                 loss = loss_fn(outputs, target)
                 loss.backward()
                 optimizer.step()
                 if use_slowmo:
+                    model.zero_grad()
                     model.perform_slowmo(optimizer)  # SlowMoDDP specific
 
 In the example above, when using SlowMoDDP, we are reducing the total communication between
-nodes by 3 times as the default ``localsgd_frequency`` is set to 3.
+nodes by 3 times as the default ``localsgd_frequency`` is set to 3 by default.
 SlowMoDDP takes in ``slowmo_momentum`` as a parameter. This parameter may need to be tuned
 depending on your use case. It also takes in ``nproces_per_node`` which should be typically set
 to the number of GPUs on a node. Please look at the
-`documentation <https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html>`_
+`documentation <../api/experimental/nn/slowmo_ddp.html>`_
 for more details on these parameters as well as other advanced settings of the SlowMo algorithm.
diff --git a/fairscale/experimental/nn/data_parallel/gossip/distributed.py b/fairscale/experimental/nn/data_parallel/gossip/distributed.py
@@ -284,7 +284,9 @@ def __init__(
 
         self.slowmo_memory_efficient = slowmo_memory_efficient
         self.slowmo_num_shards = min(self.process_world_size, slowmo_num_shards) if self.slowmo_memory_efficient else 1
-        self.is_computing_slowmo = self.process_rank < self.slowmo_num_shards if self.slowmo_memory_efficient else True
+        self.is_current_node_a_slowmo_shard = (
+            self.process_rank < self.slowmo_num_shards if self.slowmo_memory_efficient else True
+        )
 
         self.nprocs_per_node_device = torch.tensor([self.nprocs_per_node], device=comm_device, dtype=first_param_dtype)
 
@@ -660,7 +662,7 @@ def _init_global_momentum_buffers(self, optimizer: torch.optim.Optimizer) -> Non
 
         self.world_portion_length = (total_elements + self.slowmo_num_shards - 1) // self.slowmo_num_shards
 
-        if not self.is_computing_slowmo:
+        if not self.is_current_node_a_slowmo_shard:
             return
 
         self.portion_start = self.process_rank * self.world_portion_length if self.slowmo_memory_efficient else 0
@@ -747,7 +749,7 @@ def _global_momentum_step(self, optimizer: torch.optim.Optimizer) -> None:
         if self.slowmo_memory_efficient:
             self._distributed_comm(optimizer, mode="gather")
 
-        if self.is_computing_slowmo:
+        if self.is_current_node_a_slowmo_shard:
             self._perform_local_optimization(optimizer)
 
         if self.slowmo_memory_efficient: