# 1. Scaling UP (mono-node)

A Note on GPU training. We naturally assume that GPU is better than CPU, but it really depends on the workflow. You need to saturate the GPU memory.

Now let's go single-node multi-GPU. The same model will be pushed to all available devices, each of which will
1. Perform forward pass with its specific batch of data
2. Compute the loss and perform backward pass including weights update
4. The weights are then collected are synchronized across all devices for next pass

#### **Cleanup**

You might need to clean up your ghost runs if something fails and break the training logic. You can do this one of two ways:
* If run inside the same PID as the training from a `python train.py`:
<code>
import gc, torch; gc.collect(); torch.cuda.empty_cache()
</code>
* Otherwise, try to kill the job still running on the GPU, by get the ghost job's PID with the command `nvitop`

In [1]:
!nvitop

Thu Feb 17 14:37:20 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│[32m   0  Tesla P100-PCIE...  Off  [0m│[32m 00000000:00:04.0 Off [0m│[32m                    0 [0m│
│[32m MAX   41C    P0    27W / 250W [0m│[32m      2MiB / 16281MiB [0m│[32m      0%      Default [0m│
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│[32m   1  Tesla P100-PCIE...  Off  [0m│[32m 00000000:00:05.0 Off [0m│[32m                    0 [0m│
│[32m MAX   38C    P0    27W / 250W [0m│[32m      2MiB / 16281MiB [0m│[32m      0%      D

In [2]:
# Replace in the command below the PID=99999999 by the PID number produced by nvitop
!sudo kill -15 99999999

kill: (99999999): No such process


## 1.1. Achieving Data Parallelism

#### **1.1.1. Strategies**

DP consists of parallelizing the model, and training each instance of the model with a different mini-batch of data of size `batch_size // num_parallel_instances`. Each model will converge differently on its mini-batch, so the weights are collected and usually averaged after `p` batches, then synchronized with all instances for the next round of passes.

#### **0. Over CPU**

Let's launch a reference training on CPU. Take a look at the `_trainup_cpu.py` script.

In [1]:
!python trainup_cpu.py --batch-size 512 \
                       --num-processes 2 > ${HOME}/.kosmoss/logs/trainup_cpu.stdout

Global seed set to 42
Traceback (most recent call last):
  File "trainup_cpu.py", line 68, in <module>
    args.num_processes
  File "trainup_cpu.py", line 25, in main
    out_channels=y_feats
  File "/opt/conda/lib/python3.7/site-packages/kosmoss/parallel/models.py", line 104, in __init__
    self.normalization_layer = LitMLP.Normalize(self.epsilon)
  File "/opt/conda/lib/python3.7/site-packages/kosmoss/parallel/models.py", line 80, in __init__
    stats = torch.load(osp.join(DATA_PATH, f"stats-flattened-{step}.pt"))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file 

#### **1. Launching a training on GPU**

In [2]:
!cat trainup_gpu.py

import os.path as osp
import psutil
from pytorch_lightning import Trainer, seed_everything
from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
from typing import Union

from kosmoss import CONFIG, LOGS_PATH, METADATA
from kosmoss.parallel.data import FlattenedDataModule
from kosmoss.parallel.models import LitMLP

def main(batch_size: int,
         lr: float,
         strategy: Union['ddp', 'horovod'],
         gpus: int,
         num_nodes: int) -> None:

    seed_everything(42, workers=True)
    
    step = CONFIG['timestep']
    params = METADATA[str(step)]['flattened']

    x_feats = params['x_shape'][-1]
    y_feats = params['y_shape'][-1]

    mlp = LitMLP(
        in_channels=x_feats,
        hidden_channels=100,
        out_channels=y_feats,
        
        # Adjust the learning rate accordingly to account for the increase in total batch size
        # Or use a Lightning LR Finder functionality, or any other framework's finder
        lr=lr,
    )

    cores = p

Let's launch the training with 2 nodes and 1 GPU/node. Since we're on a single node, each node designates an independent process.

In [3]:
%%bash
python trainup_gpu.py --batch-size 512 \
                      --lr=1e-4 \
                      --strategy 'ddp' \
                      --gpus 1 \
                      --num-nodes 2 > ${HOME}/.kosmoss/logs/trainup_gpu_ddp.stdout

Global seed set to 42
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Global seed set to 42
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
bash: line 5: 28849 Killed                  python trainup_gpu.py --batch-size 512 --lr=1e-4 --strategy 'ddp' --gpus 1 --num-nodes 2 > ${HOME}/.kosmoss/logs/trainup_gpu_ddp.stdout


CalledProcessError: Command 'b"python trainup_gpu.py --batch-size 512 \\\n                      --lr=1e-4 \\\n                      --strategy 'ddp' \\\n                      --gpus 1 \\\n                      --num-nodes 2 > ${HOME}/.kosmoss/logs/trainup_gpu_ddp.stdout\n"' returned non-zero exit status 137.

#### **1.1.2. Horovod**

With a simple change in the Trainer options, you can rely on horovod backend to perform the computations. Prallelism is achieved by SPMD with MPI: one process per GPU potentially distributed accross multiple nodes, and collective computing is made by process of rank 0.

No need to adjust the learning rate `lr` this time, horovod takes care of that underneath.

In [None]:
%%bash
python trainup_gpu.py --batch-size 512 \
                      --strategy 'horovod' \
                      --gpus 1 \
                      --num-nodes 2 > ${HOME}/.kosmoss/logs/trainup_gpu_horovod.stdout

## **1.2. A Note on Model Parallelism**

You should really go for model parallelism starting at 500M parameters. No material on that, just know that it exists and it is complex subject that would require an entire session. Lightning comes standard with a series of distrubtion strategies, each with a specific implementation related to the network that first introduced it.

Refer to the Doc for more info.