# Overview of Parallelism

In this session, before diving into parallel processing, we will take an overall look at the various parallelism techniques.


## 1. Parallelism

Parallelism refers to techniques that process multiple tasks simultaneously and is one of the most important concepts in large-scale modeling. In machine learning, it is mainly used to parallelize computation across multiple devices in order to improve speed and memory efficiency.

<br><br>


## 2. Data Parallelism

Data parallelism is a technique used to speed up training by processing data in parallel when the dataset is large. It works by replicating the model on every device and feeding different data to each device. As a result, the effective batch size can be increased proportionally to the number of devices. However, data parallelism is only possible when a single model can fully fit on one device.

![](../images/data_parallelism.png)

<br><br>


## 3. Model Parallelism

If a model is too large to fit entirely on a single device, its parameters must be split and placed across multiple devices. As a result, each device holds only a portion of the model’s parameters. This makes it possible to handle very large models using multiple smaller devices. Depending on the dimension along which the model is parallelized, model parallelism can be classified into **inter-layer** and **intra-layer** parallelism.

![](../images/model_parallelism.png)

### Inter-layer Model Parallelism
Inter-layer model parallelism splits the model based on layers. For example, layers 1, 2, and 3 can be assigned to GPU 1, while layers 4 and 5 are assigned to GPU 2. A representative example of this approach is Google’s **GPipe**.

![](../images/inter_layer.png)

### Intra-layer Model Parallelism
Intra-layer model parallelism splits tensors themselves regardless of layer boundaries. For instance, if a parameter tensor has a shape of `[256, 256]`, it can be split into `[128, 256]` or `[256, 128]`. A representative example of this approach is NVIDIA’s **Megatron-LM**.

![](../images/intra_layer.png)

### Pipeline Parallelism
Pipeline parallelism is a model parallelism technique designed to address the drawbacks of inter-layer model parallelism. When using inter-layer model parallelism, there is an inherent execution order among GPUs. For example, if layers 1, 2, and 3 cannot be executed, then layers 4 and 5 cannot be executed either. As a result, GPU 2 must wait until GPU 1 finishes its computation.

![](../images/pipeline_parallelism.png)

<br>

This is highly inefficient. Even though multiple GPUs are available, only one GPU is effectively utilized at a time. To solve this problem, pipeline parallelism overlaps computations in a pipelined manner, as illustrated below. (Sounds complicated, right? We’ll explain this in detail later.)

<br>

![](../images/pipeline_parallelism2.png)

<br><br>


## 4. Multi-dimensional Parallelism

The various parallelization techniques mentioned above can be applied simultaneously, and the dimensionality increases depending on how many types of parallelism are combined. As shown below, n-dimensional parallelism can be achieved in different ways.

- e.g. 2D parallelism: Data parallelism + inter-layer parallelism  
- e.g. 2D parallelism: Data parallelism + intra-layer parallelism  
- e.g. 3D parallelism: Data parallelism + intra-layer parallelism + pipeline parallelism  

![](../images/parallelism.png)

This type of multi-dimensional parallelism is currently one of the most popular approaches in large-scale model training. In addition to the methods mentioned above, there are other techniques such as **ZeRO**, which will be explained in detail in a later chapter.
