# Module 2
## Lab 2: Back to Libraries

Now that we understand HOW sharding is done, we can understand how this tool is utilized by  libaries to optimize your model inference/training across multiple accelerators. If you want a deeper dive on individual sharding patterns that make up parallism patterns, and the collectives associated with them, we highly recommend [this chapter](https://jax-ml.github.io/scaling-book/sharding/) of **How to Scale your Model**.

In this notebook though, we will focus on the different parallism strategies that can be implemented by these libaries, and what the considerations with them are.

We'll use a 1B parameter model for demonstration purposes, but will show how you can combine these techniques with a larger model at the end.

### Parallism Patterns
This lab will be a hands on representation of common parallism patterns. If you want a achemic deep dive, we suggest you read through [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high_level_overview).

> Note we will be using Deepspeed here but this could just as easily be done with any distributed inference/training framework, they have similar interfaces

We will be covering the following:
- Data Parallism
- ZeRO-3/FSDP (these can be interchangable in most contexts)
- Tensor Parallism (Like we covered last chapter)

There are other strategies as well, and likely more will emerge, but these are the main ones and others will follow similar concepts. Throughout the first part of this module we'll still be using a small model to demonstrate different types of parallism. At the end we will combine these strategies to run a much larger model on the same instance!

The concepts we cover in this module can scale to thousands of GPUs.

Let's start by installing our libraries.


In [1]:
import os
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(parent_dir)

In [6]:
%pip install transformers accelerate deepspeed

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Now we will run through the different types of parallism. Keep an eye out for `OUTPUT BREAKDOWN` to see what the token generation and cost looks like. You may see some warnings/errors but these are benign and can be ignored.

## Definitions
Before we get started we'll provide an appendix that describes the different images we'll be going through to visualize the types of parallism, their advantages and disadvantages.

![](./assets/appendix.png)

- The A Matrix represents data being input, as well as it's sequence length and batch size depending on how large it is
- The B matrix represents the parameters for the model
- The training states represent addition memory needed for forward/backward pass in training, in inference this isn't relevant, and you essentially only have the forward pass
- We will be using GPUs but this could work with Neuron devices or any other accelerator
- We won't be using 2x2 topologies, but in reality, that's how you'd model out your GPU topology when using multiple types of parallism

With that out of the way let's jump into data parallism

#### Data Parallel
![](./assets/dp.png)
**How it works:**
Each GPU has a full copy of the model. Batches are split across GPUs. Gradients are synced after the backward pass. As you can see this affectively just splits the input data. If your model is too large, or the training states take up too much memory this won't help reduce your memory footprint. It's best used as a tool to improve throughput.

✅ Advantages:
- Easy to implement (e.g., torch.nn.DataParallel, DDP)
- Scales well for small to mid-sized models
- No model code changes required
- Best for large data sets and small models

❌ Disadvantages:
- Inefficient for very large models (can't fit on one GPU)
- All-reduce on gradients becomes a bottleneck at high scale


Let's see it in action. We'll use a smaller model (1B) because we aren't splitting the model this time.

> Note: Deepspeed does this automatically as you provide basic optimization and multiple GPUs, as it's a standard optimization technique.

In [2]:
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

# Generate a DeepSpeed config with key overrides only
# Using batch size 2, grad accumulation 2, and no fp16
ds_config = mutils.make_ds_config(
    fp16=False
)

# Run single-GPU benchmark (no sharding)
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=8,
    batch_sizes=[4],
    dtype=torch.bfloat16,
    sharding=False,
    world_size=1,
    ds_config=ds_config
)

mutils.reset_distributed_and_clear_memory()

# Run multi-GPU benchmark with sharding across 4 GPUs with a bigger batch size
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=32,
    batch_sizes=[16],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=4,
    ds_config=ds_config
)

mutils.reset_distributed_and_clear_memory()


[2025-05-27 15:13:30,113] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:13:33,935] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 4
Batch=4 | SeqLen=8
Elapsed GPU time: 0.5126s | TFLOP/s: 1.5 | AI: 22.86 FLOP/B
--------- OUTPUT BREAKDOWN ---------
🧠 Tokens generated: 32
⚡ Throughput: 62.42335 tokens/sec
⏱️ Total time: 0.51263 sec
💸 Cost per 1M tokens: $5.38438
------------------------------------
✅ Distributed env torn down and memory cleared.




[2025-05-27 15:13:46,100] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:13:46,149] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:13:46,174] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:13:46,199] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:13:48,552] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:13:48,552] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-27 15:13:48,735] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:13:48,749] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:13:48,769] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:13:48,831] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-2

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-27 15:13:53,288] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-27 15:13:53,289] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:13:53,289] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 7.63 GB, percent = 4.2%
[2025-05-27 15:13:53,290] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-27 15:13:53,290] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-27 15:13:53,396] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-27 15:13:53,396] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:13:53,396] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 7.63 GB, percent = 4.2%
[2025-05-27 15:13:53,398] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
Parameter Offload: Tota

Great! As you can see distributed data can allow us to process our workload much faster, this is similar to how we previously used batching to improove price/performance. Here we can do something similar. But once our model is large enough this stragegy will no longer work, and we'll run into the same issues as before. So we'll have to find ways to either reduce the model size or reduce the training states (in most cases both).

### ZeRO-3/FSDP
![](./assets/zero.png)
**How it works:**
ZeRO-3 (and FSDP) shards all model states — including parameters, gradients, and optimizer states — across GPUs. It avoids redundancy by ensuring that no GPU holds a full copy of the model. Parameters are temporarily reconstructed using AllGather at compute time.

✅ Advantages:
- Shards the entire model, including non-linear layers, embeddings, and optimizer states
- Enables training models that exceed per-GPU memory
- Integrates with existing PyTorch models via FSDP or DeepSpeed ZeRO

❌ Disadvantages:
- Requires full parameter AllGather before each forward/backward step
- Communication-intensive (especially with many GPUs)
- Can be slower without high-bandwidth interconnect (e.g., NVLink or InfiniBand)

> We will often stack ZeRO-3/FSDP with other strategies like Tensor and Data Parallelism. For example, we may shard linear layers with TP while still using ZeRO-3 to handle the remaining memory overhead from unsharded layers and optimizer state. This allows us to scale across both memory and compute bottlenecks.

In [3]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 2

# We use zero and offload additional memory to CPU
ds_config = mutils.make_ds_config(
    zero_config={
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "contiguous_gradients": True,
        "overlap_comm": True
    })

# Run benchmark with the generated config
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=seq_len,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size,
    ds_config=ds_config
)



[2025-05-27 15:16:14,285] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:14,295] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:16,635] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:16,636] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-27 15:16:16,686] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:16,790] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-27 15:16:16,851] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-27 15:16:16,851] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-27 15:16:16,851] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-27 15:16:17,799] [INFO] [logging.py:128:log_dist]

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-27 15:16:21,436] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-27 15:16:21,436] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:16:21,437] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.84 GB, percent = 3.8%
[2025-05-27 15:16:21,437] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-27 15:16:21,437] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-27 15:16:21,545] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-27 15:16:21,545] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:16:21,545] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.84 GB, percent = 3.8%
[2025-05-27 15:16:21,547] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
Parameter Offload: Tota

#### Tensor Parallism
![](./assets/tp.png)
**How it works:**
Individual layers are sharded across GPUs — e.g., split matrix rows/columns in linear layers. Typically this requires a custom implementation of the model for parallism. This is usually done for popular models by frameworks like Pytorch and Deepspeed, but keep this in mind when using cutting edge models or creating your own, if the architecure is unique the model definition will need to account for this.

✅ Advantages:
- Enables sharding of very large models/layers
- Reduces per-GPU memory usage
- Exploits fine-grained parallelism within layers
- Less CC than CC heavy DP
- Reduces compute

❌ Disadvantages:
- Requires deep model rewrites or tools like DeepSpeed/FSDP
- Requires custom communication (e.g., all_gather, reduce_scatter)
- CC (e.g., NCCL) can dominate runtime if not optimized or scaled too high

> We will stack types of parallism on top of each other as by themselves they may not be enough to to store in memory. For example, we will stack DP and TP in this case. You will see DP, and TP moving forward as well.

In [4]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 4

# Generate DeepSpeed config with FP16 and tensor parallelism across 4 GPUs
ds_config = mutils.make_ds_config(
    tensor_parallel=world_size
)

# Run benchmark using tensor parallelism
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=seq_len,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size,
    ds_config=ds_config
)


[2025-05-27 15:16:56,793] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:56,808] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:56,856] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:56,868] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-27 15:16:59,285] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:59,285] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-27 15:16:59,390] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:59,413] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:59,421] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-27 15:16:59,485] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-2

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-27 15:17:05,062] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-27 15:17:05,063] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:17:05,063] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.89 GB, percent = 4.9%
[2025-05-27 15:17:05,064] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-27 15:17:05,064] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-27 15:17:05,175] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-27 15:17:05,175] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-27 15:17:05,175] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.88 GB, percent = 4.9%
[2025-05-27 15:17:05,177] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
Parameter Offload: Tota

This Tensor parallism is exactly the same as what we demonstrated in the last lab. This allows us to launch a much larger model by utilizing more GPUs. 

# Use Cases
We've discussed some of the different approaches to parallism for models at scale, but they don't work in a vacuum. As you already saw, you can stack some of these strategies to get the best of both worlds. 

## Large Model Use Case
The following is an example of how you could scale a large model by leveraging ZeRO-3, DP, and TP together. 

![](./assets/large_model.png)

Stacking allows us to get memory savings, data savings, and optimization state saving across GPUs without overutilizing collective compute from ZeRO.

## Big Data Use Case
What about use cases where the model is small? In these cases we can focus only on DP, as ZeRO-3 and TP would add unnecessary communication, and the bottleneck is the data.

![](./assets/big_data.png)

As you can see we get most of the value from DP, and there's no need to introduce other forms of parallelism.

# Conclusion
The concepts introduced here don't cover all forms of parallelism, but foundationally when chosing how to scale a model across accelerators, consider what the tradeoffs of different strategies are, and the engineering effort required to implement them. Other forms of parallelism like Pipeline Parallelism can be further intrdocued to scale your workloads, but require even more invasive engineering.

When developing a platform or speaking to those who are, understand the use case for the model, and the priorities when it comes to scaling.