# 1. Scaling UP (mono-node)

A Note on GPU training. We naturally assume that GPU is better than CPU, but it really depends on the workflow. You need to saturate the GPU memory, and compute surface.

This session is focused on providing the candidates with minimum information to scale their current workflow with HW acceleration **AT THE APPLICATION LEVEL**. Low-level, high-value optimization is also a viable angle to address distributed training and inference, but this session does not cover it.

## What's with GPUs anyway?

DL is basically linear algebra, with a few non-linear Maths. It turns out, GPUs are a great tool to process that kind of computations. A few pieces of information

In 2022, Nvidia is the leader in HW dedicated to DL. The company was the first to develop and push a [suite of libraries based on CUDA](https://developer.nvidia.com/gpu-accelerated-libraries) called CUDA-X for HW acceleration of ML/DL workloads, among which:
* `cuBLAS`, `cuFFT`, `CUDA MathLib`, `cuRAND`, `cuSOLVER`, `cuSPARSE`, `cuTENSOR` for GPU-accelerated basic linear algebra (2D + nD), Fast Fourier Transform, and standard Math primitives, computations on sparse matrices
* `cuDNN` for GPU-accelerated primitives for Deep NN
* `TensorRT` for high-performance DL inference optimizer and runtime for production deployment
* `DALI`, a portable open-source format for decoding and agumenting images and videos
* Additionally, Nvidia GPUs rely on the NCCL library for fast, multi-GPU, multi-node communications, also a great tool for distributed DL.

AMD also has a less-mature ML support with the [ROCm framework](https://www.amd.com/en/graphics/servers-solutions-rocm-ml).

A few startups have started to tackle the HW problem on very different angles, notably:
* [Graphcore](https://www.graphcore.ai/products/ipu) with its IPU die—250 TFlop and high in-processor-memory—, and SW stack (Poplar SDK) to convert existing TF and PT models into IPU-executable code
* [Cerebras](https://cerebras.net/chip/) with its massive 850,000 cores chip—the Wafer-Scale Engine—and high-bandwidth memory and memory-per-core

Google has also invested in Tensor-optimized HW with its [TPU devices](https://cloud.google.com/tpu) now only available in its [cloud platform GCP](https://cloud.google.com/) since version 3.

## A note on cleanup

You might need to clean up your ghost runs if something fails and break the training logic. You can do this one of two ways:
* If run inside the same PID as the training from a `python train.py`:
<code>
import gc, torch; gc.collect(); torch.cuda.empty_cache()
</code>
* Otherwise, try to kill the job still running on the GPU, by get the ghost job's PID with the command `nvitop`

In [1]:
!nvitop

Wed Mar 09 10:01:36 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│[32m   0  A100-SXM4-40GB      Off  [0m│[32m 00000000:07:00.0 Off [0m│[32m                    0 [0m│
│[32m MAX   29C    P0    53W / 400W [0m│[32m      3MiB / 39.59GiB [0m│[32m      0%      Default [0m│
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│[32m   1  A100-SXM4-40GB      Off  [0m│[32m 00000000:0F:00.0 Off [0m│[32m                    0 [0m│
│[32m MAX   28C    P0    53W / 400W [0m│[32m      3MiB / 39.59GiB [0m│[32m      0%      D

In [2]:
# Replace in the command below the PID=99999999 by the PID number produced by nvitop
!sudo kill -15 99999999

kill: (99999999): No such process


## 1.1. Achieving Data Parallelism

Now let's go single-node multi-GPU. The same model will be pushed to all available devices, each of which will
1. Perform forward pass with its specific batch of data
2. Compute the loss and perform backward pass including weights update
4. The weights are then collected are synchronized across all devices for next pass

#### **1.1.1. Strategies**

DP consists of parallelizing the model, and training each instance of the model with a different mini-batch of data of size `batch_size // num_parallel_instances`. Each model will converge differently on its mini-batch, so the weights are collected and usually averaged after `p` batches, then synchronized with all instances for the next round of passes.

#### **0. Over CPU**

Let's launch a reference training on CPU. Take a look at the `trainup_cpu.py` script.

In [3]:
!python trainup_cpu.py --batch-size 512 \
                       --num-processes 2 > ${HOME}/.kosmoss/logs/trainup_cpu.stdout

Global seed set to 42
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  rank_zero_warn(
Global seed set to 42
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 42
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------


  | Name                | Type       | Params
---------------------------------------------------
0 | normalization_layer | Normalize  | 0     
1 | net                 | Sequential | 488 K 
---------------------------------------------------
488 K     Trainable params
0         Non-trainable params
488 K     Total params
1.955     Total estimated model params size (MB)
  rank_zero_warn(
^C
Traceback (m

#### **1. Launching a training on GPU**

In [4]:
!cat trainup_gpu.py

import psutil
from pytorch_lightning import Trainer, seed_everything
from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
from typing import Union

from kosmoss import CONFIG, LOGS_PATH, METADATA
from kosmoss.parallel.data import FlattenedDataModule
from kosmoss.parallel.mlp import LitMLP

def main(batch_size: int,
         lr: float,
         strategy: Union['ddp', 'horovod'],
         gpus: int,
         num_nodes: int) -> None:

    seed_everything(42, workers=True)
    
    step = CONFIG['timestep']
    params = METADATA[str(step)]['flattened']

    x_feats = params['x_shape'][-1]
    y_feats = params['y_shape'][-1]

    mlp = LitMLP(
        in_channels=x_feats,
        hidden_channels=100,
        out_channels=y_feats,
        
        # Adjust the learning rate accordingly to account for the increase in total batch size
        # Or use a Lightning LR Finder functionality, or any other framework's finder
        lr=lr,
    )

    cores = psutil.cpu_count(logical=F

Let's launch the training with 2 nodes and 1 GPU/node. Since we're on a single node, each node designates an independent process.

In [5]:
%%bash
python trainup_gpu.py --batch-size 512 \
                      --lr=1e-4 \
                      --strategy 'ddp' \
                      --gpus 1 \
                      --num-nodes 2 > ${HOME}/.kosmoss/logs/trainup_gpu_ddp.stdout

Global seed set to 42
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Global seed set to 42
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2


Error while terminating subprocess (pid=2764243): 


## **1.2. A Note on Model Parallelism**

You should really go for model parallelism starting at 500M parameters. 

No material on that since the subject is complex and would require an entire session, just know that it exists.

Large topic, look at the [overview from Lightning guides](https://pytorch-lightning.readthedocs.io/en/stable/advanced/advanced_gpu.html#choosing-an-advanced-distributed-gpu-plugin), or the [in-depth documentation for the FairScale initiative](https://fairscale.readthedocs.io/en/latest/deep_dive/oss_sdp_fsdp.html).