## summary

- 数据并行 vs. 模型并行
    - 数据并行：模型拷贝（per device），数据 split/chunk（batch 上）
        
        - the module is replicated on each device, and each replica handles a portion of the input. 
        - During the backwards pass, gradients from each replica are summed into the original module.
            
    - 模型并行：数据拷贝（per device），模型 split/chunk（显然是单卡放不下模型的情况下）
- DP => DDP
    - DP：`nn.DataParallel`
        - https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html
    - DDP: `DistributedDataParallel`
    - Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel and Distributed Data Parallel.
- 参考
    - https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
    - https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

## Imports and parameters


In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 
device

device(type='cuda', index=0)

## dummy dataset

In [5]:
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        # 100*5
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        # (5, )
        return self.data[index]

    def __len__(self):
        # 100
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, 
                         shuffle=True)

In [6]:
next(iter(rand_loader)).shape

torch.Size([30, 5])

## simple model

In [7]:
class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        # 5 => 2
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

## DataParallel

- https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html
    - device_ids=None, 
        - 参与训练的 GPU 有哪些，device_ids=gpus；
    - output_device=None
        - 用于汇总梯度的 GPU 是哪个，output_device=gpus\[0\]
    - dim=0
- The parallelized module must have its parameters and buffers on device_ids[0] before running(forward/backward) this DataParallel module.
    - `model.to('cuda:0')`

In [8]:
# (5, 2)
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = nn.DataParallel(model)

Let's use 2 GPUs!


In [9]:
model

DataParallel(
  (module): Model(
    (fc): Linear(in_features=5, out_features=2, bias=True)
  )
)

In [10]:
# model = model.to(device)
model.to(device)

DataParallel(
  (module): Model(
    (fc): Linear(in_features=5, out_features=2, bias=True)
  )
)

### tensors：to(device)

In [11]:
a = torch.randn(3, 4)
print('a.is_cuda', a.is_cuda)
b = a.to('cuda:0')
print('a.is_cuda', a.is_cuda)
print('b.is_cuda', b.is_cuda)
# a and b are different 

a.is_cuda False
a.is_cuda False
b.is_cuda True


### models：to(device)

In [12]:
a = Model(3, 4)
print(next(a.parameters()).is_cuda)
b = a.to('cuda:0')
print(next(a.parameters()).is_cuda)
print(next(b.parameters()).is_cuda)
# a and b point to the same model 

False
True
True


## run the model (forward)

In [13]:
for data in rand_loader:
    # input_var can be on any device, including CPU
    input = data.to(device)
#     input = data
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())

	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
	In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
