## Additional Techniques

So far, we have covered mostly topics related to **parallelism** and **memory**.  
However, in addition to parallelization, there are many other techniques that are useful for **large-scale modeling**.

In this session, we will briefly introduce some of those techniques.


## 1. Kernel Fusion

Kernel fusion is a technique used to improve computation speed by using custom CUDA-implemented code.

### What is a Kernel?

A kernel can be thought of as code that runs directly on a GPU device.
CUDA provides syntax like the following:

<br>

```cpp
__device__ void func(){...}

__global__ void func(){...}

__host__ void func(){...}
```

<br>

Functions with prefixes such as __global__ or __device__ run on the GPU,
while functions with __host__ (or no prefix) run on the CPU.

In general, code that includes prefixes like __global__ or __device__ and runs on the GPU is called a kernel.

### What is Fusion??

In practice, we build models by combining small operations provided by PyTorch.
For example, we combine torch.matmul and torch.add to create an operation like: <br>
#### y = torch.matmul(x, w) + b
<br>
Similarly, we combine operations such as torch.split and torch.permute to implement Transformer head split & merge.

This approach is easy to implement and clean, but when analyzed internally, it is quite inefficient.


<br>

![](../images/kernel_fusion.png)

<br>

Let‚Äôs assume we execute an operation like:

#### torch.matmul(x, w) * b


What happens internally is as follows:

1. First, PyTorch‚Äôs host code loads tensors `x` and `w` and launches the kernel (cuBLAS) corresponding to `torch.matmul`.
2. The computed result is stored in memory.
3. Next, to compute `* b`, the previously stored result of `matmul(x, w)` and tensor `b` are loaded again.
4. The multiplication is performed, and the result is stored once more.

This is clearly inefficient.

Wouldn‚Äôt it be better if we could load `x`, `w`, and `b` **once**, perform all computations at once, and then store the result **only once**?

Unfortunately, PyTorch needs to provide operations in small, composable units to maximize user flexibility. As a result, except for a few special cases, this kind of repeated load/store behavior is unavoidable.

However, if we know how to write CUDA code, we don‚Äôt have to rely on PyTorch‚Äôs built-in kernels. We can directly implement a custom kernel that loads `x`, `w`, and `b` all at once, performs the computation, and stores the result in a single step.

For this reason, many CUDA programmers implement kernels that load all necessary parameters, perform the computation, and store the result in one go‚Äîreducing unnecessary load/store operations and function-call overhead.



### However‚Ä¶

Including CUDA programming from scratch in this presentation would greatly exceed its scope, so CUDA programming details are omitted here.

Instead, we recommend several **effective Transformer-related kernels** that have been released so far. Their documentation explains usage well, so you can refer to them to improve computational performance.  
(If time allows in the future, it would be great to create CUDA-related notebooks as well üòä)




<br>

#### Training Kernel
- [Apex Fused Kernel](https://github.com/NVIDIA/apex/tree/master/csrc): 
- [LightSeq Training Kernel](https://github.com/bytedance/lightseq/tree/master/lightseq/training)
- [DeepSpeed Training Kernel](https://www.deepspeed.ai/tutorials/inference-tutorial/)

#### Inference Kernel
- [LightSeq Inference Kernel](https://github.com/bytedance/lightseq/tree/master/lightseq/inference)
- [FastSeq NGram repeat blocking Kernel](https://github.com/microsoft/fastseq/tree/main/fastseq/clib/cuda)
- [Faster Transformer Kernel](https://github.com/NVIDIA/FasterTransformer)
- [DeepSpeed Inference Kernel](https://www.deepspeed.ai/tutorials/transformer_kernel/)
- [Turbo Transformer Kernel](https://github.com/Tencent/TurboTransformers)
- [Effective Transformer Kernel](https://github.com/bytedance/effective_transformer)

## 2. Progressive Layer Dropping (PLD)

Progressive Layer Dropping (PLD) is a technique that **randomly skips (drops) layers during training** to improve training speed and performance.  
According to the authors of the paper, they observed the following two key insights from the BERT training process.

<br>

![](../images/pld_1.png)

<br>

- **In the early stage of training**, the differences in **L2-Norm** and **cosine similarity** between the input and output of layers are **large**, but **these differences become smaller in the later stages**.
  - In other words, during early training, the model learns many new things and changes significantly, whereas in later training, it mainly performs fine-grained adjustments.

<br>

- In **Pre-LN models**, the **earlier layers** show relatively large differences between input and output, while the **later layers produce very similar outputs**.
  - This suggests that as we go deeper into the network, later layers tend to make only very small corrections instead of significantly changing the representations.

<br>

Based on these observations, PLD improves training speed and performance by **dropping some layers during training**.  
Using insights from experiments, layers that are considered more important‚Äî**early training stages and earlier layers**‚Äîare dropped less frequently, while layers considered less important‚Äî**later training stages and later layers**‚Äîare dropped more frequently. This strategy was shown to achieve good performance.

<br>

![](../images/pld_2.png)

Dropping is controlled by a **gate function**:

G ‚àà {0, 1}


which determines whether a layer is dropped or not.

If the gate function outputs `0`, the layer is skipped and only an **identity mapping** is passed to the next layer.

<br>

![](../images/pld_3.png)

PLD is currently implemented in **DeepSpeed**, but it is **not yet compatible with ZeRO**, which makes it difficult to use in real large-scale model training at the moment.  
Hopefully, PLD will be integrated with other DeepSpeed features in the near future üôÇ

For more details, please refer to the paper:  
https://arxiv.org/abs/2010.13369


## 3. 1-Bit Compressive Optimizers

- **1-bit Adam**
- **1-bit LAMB (1-bit LAdam)**

1-Bit Compressive Optimizers are techniques that **compress optimizer states such as momentum to 1-bit precision**.  
Gradients and optimizer states are frequently exchanged between distributed devices and often become **communication bottlenecks** in large-scale training.

For **SGD**, it has been common to compress the momentum term to 1-bit during communication and then apply **error compensation** afterward to improve communication efficiency.  
However, for **Adam**, this approach is more difficult because the **variance term introduces non-linearity**, making naive compression ineffective.

**1-bit Adam** solves this problem using a clever trick and achieves good performance even with heavy compression.

<br>

![](../images/one_bit_adam.png)

<br>

The key idea is as follows:

- During the **initial training phase**, Adam operates **without compression**.
- After reaching a certain point in training:
  - The **variance term is fixed to a constant value** and communicated without precision requirements.
  - Only the **momentum term is properly compressed to 1-bit** and communicated, similar to SGD.

This approach achieves **the same convergence rate as standard (uncompressed) Adam**, while **reducing communication volume by up to 5√ó**.

For more details, please refer to the paper:  
https://arxiv.org/abs/2102.02888 üôÇ


## 4. Curriculum Learning

Curriculum Learning is inspired by **how humans learn**.  
Humans typically start by learning **easy concepts** at a young age and gradually move on to **more difficult concepts** as they grow older.  
However, modern neural network training usually does **not** follow this pattern.

Curriculum Learning applies this human-like strategy to neural networks by:
- Showing **easy samples early in training**
- Gradually introducing **harder samples later in training**



### How do we define ‚Äúeasy‚Äù and ‚Äúhard‚Äù samples?

In the paper, the authors applied Curriculum Learning to **GPT pre-training** and used a very simple heuristic:

- **Shorter samples ‚Üí easy samples**
- **Longer samples ‚Üí hard samples**

In **Megatron-LM**, all samples are concatenated together during pre-training.  
To apply Curriculum Learning, the authors:
- **Chunked the input into shorter sequences during early training**
- Gradually **increased the input sequence length** as training progressed

<br>

![](../images/cl_1.png)

<br>

Despite its simplicity, this method showed **surprisingly strong performance**.  
From the graph, we can see that models trained with Curriculum Learning achieve **lower loss** compared to those trained without it.

Curriculum Learning is **already implemented in DeepSpeed**, so it is highly recommended to take advantage of this feature when training large-scale models.

For more details, please refer to the paper:  
https://arxiv.org/abs/2108.06084
