# Module 2
## Lab 2: Back to Libraries

Now that we understand HOW sharding is done, we can understand how this tool is utilized by  libaries to optimize your model inference/training across multiple accelerators. If you want a deeper dive on individual sharding patterns that make up parallism patterns, and the collectives associated with them, we highly recommend [this chapter](https://jax-ml.github.io/scaling-book/sharding/) of **How to Scale your Model**.

In this notebook though, we will focus on the different parallism strategies that can be implemented by these libaries, and what the considerations with them are.

### Parallism Patterns
This lab will be a hands on representation of common parallism patterns. If you want a achemic deep dive, we suggest you read through [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high_level_overview).

> Note we will be using Deepspeed here but this could just as easily be done with any distributed inference/training framework, they have similar interfaces

We will be covering the following:
- Data Parallism
- Tensor Parallism (Like we covered last chapter)
- Pipeline Parallism
- ZeRO

There are other strategies as well, and likely more will emerge, but these are the main ones and others will follow similar concepts. 

Let's start by installing our libraries.


In [2]:
import os
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(parent_dir)

In [None]:
%pip install transformers accelerate deepspeed

#### Data Parallel
**How it works:**
Each GPU has a full copy of the model. Batches are split across GPUs. Gradients are synced after the backward pass.

✅ Advantages:
- Easy to implement (e.g., torch.nn.DataParallel, DDP)
- Scales well for small to mid-sized models
- No model code changes required
- Best for large data sets and small models

❌ Disadvantages:
- Inefficient for very large models (can't fit on one GPU)
- All-reduce on gradients becomes a bottleneck at high scale


Let's see it in action. We'll use a smaller model (1B) because we aren't splitting the model this time.

In [31]:
import importlib
import src.utils.model_utils as mutils
importlib.reload(mutils)
os.environ["DEEPSPEED_LOG_LEVEL"] = "error"

# Create a temporary deepspeed config file
ds_config = {
    "train_batch_size": 2,
    "gradient_accumulation_steps": 1,
    "fp16": {"enabled": True},
    "zero_optimization": {"stage": 0}
}

# Running with 1 GPU
result = mutils.run_deepspeed_inference(model_name="NousResearch/Llama-3.2-1B", ds_config=ds_config, world_size=1)
mutils.reset_distributed_and_clear_memory()
print(result)

# Copy the model to 2 GPUs and split the data
mutils.run_deepspeed_inference(model_name="NousResearch/Llama-3.2-1B", ds_config=ds_config)
mutils.reset_distributed_and_clear_memory()
print(result)

[2025-05-05 20:59:09,466] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 10
[
  {
    "batch_size": 10,
    "world_size": 1,
    "avg_time_seconds": 0.025810943603515626,
    "local_gflops": 40218.75805650925,
    "aggregated_gflops": 40218.75805650925,
    "total_flops": 1038084096000,
    "estimated_memory_bytes": 2474659840,
    "arithmetic_intensity": 419.4855709946786,
    "cost_per_1m_tokens": 0.08675344933403863
  }
]
[2025-05-05 20:59:16,813] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-05 20:59:16,844] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 10
[
  {
    "batch_size": 10,
    "world_size": 2,
    "avg_time_seconds": 0.6220800170898437,
    "local_gflops": 41003.100108126644,
    "aggregated_gflops": 41003.100108126644,
    "total_flops": 25507209216000,
    "estimated_memor

Great! As you can see distributed data can allow us to process our workload much faster, this is similar to how we previously used batching to improove price/performance. Here we can do something similar. But once our model is large enough this stragegy will no longer work, and we'll run into the same issues as before.

In [15]:
mutils.reset_distributed_and_clear_memory()

✅ Distributed env torn down and memory cleared.


#### Tensor Parallism
**How it works:**
Individual layers are sharded across GPUs — e.g., split matrix rows/columns in linear layers.

✅ Advantages:
- Enables sharding of very large models/layers
- Reduces per-GPU memory usage
- Exploits fine-grained parallelism within layers

❌ Disadvantages:
- Requires deep model rewrites or tools like DeepSpeed/FSDP
- Requires custom communication (e.g., all_gather, reduce_scatter)
- Collective comms (e.g., NCCL) can dominate runtime if not optimized

In [23]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Nous-Hermes-Llama2-13b",
    seq_len=32,
    min_new_tokens=1,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding="tensor",
    world_size=2, # Number of GPUs
)

print(json.dumps(results, indent=2))

[2025-05-05 20:49:57,427] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-05 20:49:57,436] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 1
[
  {
    "batch_size": 1,
    "world_size": 2,
    "avg_time_seconds": 0.0659137954711914,
    "local_gflops": 6681.033493094344,
    "aggregated_gflops": 6681.033493094344,
    "total_flops": 440372275200,
    "estimated_memory_bytes": 13345280000,
    "arithmetic_intensity": 32.998354114713216,
    "cost_per_1m_tokens": 22.154359033372664
  }
]


#### Pipeline Parallelism
**How it works:**
Each GPU holds a different stage of the model. Batches are split into micro-batches and passed between GPUs sequentially.

✅ Advantages:
Works well for extremely deep models
- Spreads compute and memory across GPUs
- Compatible with tensor parallelism for hybrid scaling

❌ Disadvantages:
- Latency due to pipeline bubbles (idle GPUs while others compute)
- Complex micro-batching & scheduling
- Harder to load balance if layers are uneven in cost