From 14bd9c62a3084ebc1cb3b5feb1e6b83f90be6449 Mon Sep 17 00:00:00 2001 From: Myle Ott Date: Mon, 7 Jan 2019 10:40:39 -0800 Subject: [PATCH] Update docs for --lazy-load and torch.distributed.launch Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/433 Differential Revision: D13588032 Pulled By: myleott fbshipit-source-id: 0e5ff361e27b206c4490264f0f51863367499e81 --- docs/getting_started.rst | 66 +++++++++++++++------------------------- 1 file changed, 25 insertions(+), 41 deletions(-) diff --git a/docs/getting_started.rst b/docs/getting_started.rst index 6f5d5a67a1..1658d4109f 100644 --- a/docs/getting_started.rst +++ b/docs/getting_started.rst @@ -154,51 +154,35 @@ Fairseq supports FP16 training with the ``--fp16`` flag: > python train.py --fp16 (...) -Distributed training --------------------- +Lazily loading large training datasets +-------------------------------------- -Distributed training in fairseq is implemented on top of -`torch.distributed `__. -Training begins by launching one worker process per GPU. These workers -discover each other via a unique host and port (required) that can be -used to establish an initial connection. Additionally, each worker has a -rank, that is a unique number from 0 to n-1 where n is the total number -of GPUs. +By default fairseq loads the entire training set into system memory. For large +datasets, the ``--lazy-load`` option can be used to instead load batches on-demand. +For optimal performance, use the ``--num-workers`` option to control the number +of background processes that will load batches. -If you run on a cluster managed by -`SLURM `__ you can train a large -English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs -each (in total 128 GPUs) using this command: +Distributed training +-------------------- -.. code-block:: console +Distributed training in fairseq is implemented on top of ``torch.distributed``. +The easiest way to launch jobs is with the `torch.distributed.launch +`__ tool. - > DATA=... # path to the preprocessed dataset, must be visible from all nodes - > PORT=9218 # any available TCP port that can be used by the trainer to establish initial connection - > sbatch --job-name fairseq-py --gres gpu:8 --cpus-per-task 10 \ - --nodes 16 --ntasks-per-node 8 \ - --wrap 'srun --output train.log.node%t --error train.stderr.node%t.%j \ - python train.py $DATA \ - --distributed-world-size 128 \ - --distributed-port $PORT \ - --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \ - --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \ - --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \ - --label-smoothing 0.1 --wd 0.0001' - -Alternatively you can manually start one process per GPU: +For example, to train a large English-German Transformer model on 2 nodes each +with 8 GPUs (in total 16 GPUs), run the following command on each node, +replacing ``node_rank=0`` with ``node_rank=1`` on the second node: .. code-block:: console - > DATA=... # path to the preprocessed dataset, must be visible from all nodes - > HOST_PORT=master.example.com:9218 # one of the hosts used by the job - > RANK=... # the rank of this process, from 0 to 127 in case of 128 GPUs - > LOCAL_RANK=... # the local rank of this process, from 0 to 7 in case of 8 GPUs per machine - > python train.py $DATA \ - --distributed-world-size 128 \ - --distributed-init-method 'tcp://$HOST_PORT' \ - --distributed-rank $RANK \ - --device-id $LOCAL_RANK \ - --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \ - --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \ - --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \ - --label-smoothing 0.1 --wd 0.0001 + > python -m torch.distributed.launch --nproc_per_node=8 \ + --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \ + --master_port=1234 \ + train.py data-bin/wmt16_en_de_bpe32k \ + --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ + --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ + --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ + --lr 0.0005 --min-lr 1e-09 \ + --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --max-tokens 3584 \ + --fp16