# Distributed Data Parallel

In this tutorial we will cover distributed data parallel training using PyTorch [distributed](https://pytorch.org/docs/stable/distributed.html) package. 

[DistributedDataParallel](https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html) (DDP) implements data parallelism at the module level. It uses communication collectives in the [torch.distributed](https://pytorch.org/tutorials/intermediate/dist_tuto.html) package to synchronize gradients, parameters, and buffers.

To initialize DDP we use the `int_process_group` function. This tutorial uses the `NCCL` backend for IPC with `TCP initialization` mechanism (by setting the url tcp://<master_address>:<master_port>). See the [distributed](https://pytorch.org/docs/stable/distributed.html) documentation for details on the other methods supported.



We will use a two node setup, with two gpus each as shown in the diagram below:

![network](./images/docker-ptgtc.jpeg)


Following jobs will be launched for DDP training: 

1. Rank 0 (master) on Node1
2. Rank 1 (worker1) on Node1
2. Rank 2 (worker1) on Node2
2. Rank 3 (worker2) on Node2

Launch the first two jobs on Node1. Open a command terminal window and run following commands for the master job.

    export NCCL_DEBUG=info 
    python ./code/ddp_tutorial.py 0 0

Open a second command terminal window on Node1 and run following commands for the worker process:

    export NCCL_DEBUG=info
    python ./code/ddp_tutorial.py 1 1

Launch the second set of jobs on Node2. Open a command terminal window and run following commands for the master job.

    export NCCL_DEBUG=info
    python ./code/ddp_tutorial.py 2 0

Open a second command terminal window on Node2 and run following commands for the worker process:

    export NCCL_DEBUG=info
    python ./code/ddp_tutorial.py 3 1

Open a third command terminal window on Node1 and run following commands to monitor the ports:

    lsof -i -P -n