From 14bd9c62a3084ebc1cb3b5feb1e6b83f90be6449 Mon Sep 17 00:00:00 2001
From: Myle Ott <myleott@fb.com>
Date: Mon, 7 Jan 2019 10:40:39 -0800
Subject: [PATCH] Update docs for --lazy-load and torch.distributed.launch

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/433

Differential Revision: D13588032

Pulled By: myleott

fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81
---
 docs/getting_started.rst | 66 +++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 41 deletions(-)

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
index 6f5d5a67a1..1658d4109f 100644
--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
@@ -154,51 +154,35 @@ Fairseq supports FP16 training with the ``--fp16`` flag:
 
     > python train.py --fp16 (...)
 
-Distributed training
---------------------
+Lazily loading large training datasets
+--------------------------------------
 
-Distributed training in fairseq is implemented on top of
-`torch.distributed <http://pytorch.org/docs/master/distributed.html>`__.
-Training begins by launching one worker process per GPU. These workers
-discover each other via a unique host and port (required) that can be
-used to establish an initial connection. Additionally, each worker has a
-rank, that is a unique number from 0 to n-1 where n is the total number
-of GPUs.
+By default fairseq loads the entire training set into system memory. For large
+datasets, the ``--lazy-load`` option can be used to instead load batches on-demand.
+For optimal performance, use the ``--num-workers`` option to control the number
+of background processes that will load batches.
 
-If you run on a cluster managed by
-`SLURM <https://slurm.schedmd.com/>`__ you can train a large
-English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs
-each (in total 128 GPUs) using this command:
+Distributed training
+--------------------
 
-.. code-block:: console
+Distributed training in fairseq is implemented on top of ``torch.distributed``.
+The easiest way to launch jobs is with the `torch.distributed.launch
+<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
 
-    > DATA=...   # path to the preprocessed dataset, must be visible from all nodes
-    > PORT=9218  # any available TCP port that can be used by the trainer to establish initial connection
-    > sbatch --job-name fairseq-py --gres gpu:8 --cpus-per-task 10 \
-        --nodes 16 --ntasks-per-node 8 \
-        --wrap 'srun --output train.log.node%t --error train.stderr.node%t.%j \
-        python train.py $DATA \
-        --distributed-world-size 128 \
-        --distributed-port $PORT \
-        --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
-        --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
-        --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
-        --label-smoothing 0.1 --wd 0.0001'
-
-Alternatively you can manually start one process per GPU:
+For example, to train a large English-German Transformer model on 2 nodes each
+with 8 GPUs (in total 16 GPUs), run the following command on each node,
+replacing ``node_rank=0`` with ``node_rank=1`` on the second node:
 
 .. code-block:: console
 
-    > DATA=...  # path to the preprocessed dataset, must be visible from all nodes
-    > HOST_PORT=master.example.com:9218  # one of the hosts used by the job
-    > RANK=...  # the rank of this process, from 0 to 127 in case of 128 GPUs
-    > LOCAL_RANK=... # the local rank of this process, from 0 to 7 in case of 8 GPUs per machine
-    > python train.py $DATA \
-        --distributed-world-size 128 \
-        --distributed-init-method 'tcp://$HOST_PORT' \
-        --distributed-rank $RANK \
-        --device-id $LOCAL_RANK \
-        --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
-        --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
-        --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
-        --label-smoothing 0.1 --wd 0.0001
+    > python -m torch.distributed.launch --nproc_per_node=8 \
+        --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
+        --master_port=1234 \
+        train.py data-bin/wmt16_en_de_bpe32k \
+        --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
+        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+        --lr 0.0005 --min-lr 1e-09 \
+        --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+        --max-tokens 3584 \
+        --fp16