# Module 2
## Lab 2: Back to Libraries

Now that we understand HOW sharding is done, we can understand how this tool is utilized by  libaries to optimize your model inference/training across multiple accelerators. If you want a deeper dive on individual sharding patterns that make up parallism patterns, and the collectives associated with them, we highly recommend [this chapter](https://jax-ml.github.io/scaling-book/sharding/) of **How to Scale your Model**.

In this notebook though, we will focus on the different parallism strategies that can be implemented by these libaries, and what the considerations with them are.

### Parallism Patterns
This lab will be a hands on representation of common parallism patterns. If you want a achemic deep dive, we suggest you read through [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high_level_overview).

> Note we will be using Deepspeed here but this could just as easily be done with any distributed inference/training framework, they have similar interfaces

We will be covering the following:
- Data Parallism
- Tensor Parallism (Like we covered last chapter)
- Pipeline Parallism
- ZeRO

There are other strategies as well, and likely more will emerge, but these are the main ones and others will follow similar concepts. 

Let's start by installing our libraries.


In [1]:
import os
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(parent_dir)

In [2]:
%pip install transformers accelerate deepspeed

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Now we will run through the different types of parallism. Keep an eye out for `OUTPUT BREAKDOWN` to see what the token generation and cost looks like. You may see some warnings/errors but these are benign and can be ignored.

#### Data Parallel
**How it works:**
Each GPU has a full copy of the model. Batches are split across GPUs. Gradients are synced after the backward pass.

✅ Advantages:
- Easy to implement (e.g., torch.nn.DataParallel, DDP)
- Scales well for small to mid-sized models
- No model code changes required
- Best for large data sets and small models

❌ Disadvantages:
- Inefficient for very large models (can't fit on one GPU)
- All-reduce on gradients becomes a bottleneck at high scale


Let's see it in action. We'll use a smaller model (1B) because we aren't splitting the model this time.

In [11]:
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

# Create a temporary deepspeed config file
ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,

    "fp16": { "enabled": True },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },

    "zero_optimization": {
        "stage": 1
    },

    "tensor_parallel": {
        "enabled": True,
        "tp_size": 4
    },

    "replace_with_kernel_inject": False,
    "enable_cuda_graph": False
}


# # Running with 1 GPU
# results = mutils.run_distributed_benchmark(
#     model_name="NousResearch/Llama-3.2-1B",
#     seq_len=32,
#     min_new_tokens=1,
#     batch_sizes=[100],
#     dtype=torch.bfloat16,
#     sharding=False,
#     world_size=1, # Number of GPUs
#     ds_config=ds_config
# )
# mutils.reset_distributed_and_clear_memory()

# Copy the model to 2 GPUs and split the data
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=8,
    min_new_tokens=1,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=4, # Number of GPUs
    ds_config=ds_config
)
mutils.reset_distributed_and_clear_memory()

[2025-05-14 17:01:34,462] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-14 17:01:34,477] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-14 17:01:34,489] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-14 17:01:34,497] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-14 17:01:37,424] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-14 17:01:37,424] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-14 17:01:37,424] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-14 17:01:37,593] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-14 17:01:37,593] [INFO] [comm.py:652:init_distributed] cdb=None
[

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/fused_adam/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...


[2025-05-14 17:01:38,700] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-14 17:01:38,700] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-14 17:01:38,702] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-14 17:01:38,702] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-05-14 17:01:38,702] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer
[2025-05-14 17:01:38,702] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2025-05-14 17:01:38,702] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2025-05-14 17:01:38,702] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-05-14 17:01:38,702] [INFO] [stage_1_and_2.py:152:__init__] R

Great! As you can see distributed data can allow us to process our workload much faster, this is similar to how we previously used batching to improove price/performance. Here we can do something similar. But once our model is large enough this stragegy will no longer work, and we'll run into the same issues as before.

#### Tensor Parallism
**How it works:**
Individual layers are sharded across GPUs — e.g., split matrix rows/columns in linear layers.

✅ Advantages:
- Enables sharding of very large models/layers
- Reduces per-GPU memory usage
- Exploits fine-grained parallelism within layers

❌ Disadvantages:
- Requires deep model rewrites or tools like DeepSpeed/FSDP
- Requires custom communication (e.g., all_gather, reduce_scatter)
- Collective comms (e.g., NCCL) can dominate runtime if not optimized

In [None]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 2
max_tokens =  seq_len + min_new_tokens

ds_config = {
    "replace_with_kernel_inject": True,
    "enable_cuda_graph": False,
    "tensor_parallel": {
        "enabled": True,
        "tp_size": world_size
    },
    # Optional tuning knobs to constrain token planning
    "max_tokens": max_tokens
}

results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Nous-Hermes-Llama2-13b",
    seq_len=32,
    min_new_tokens=1,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=2, # Number of GPUs
    ds_config=ds_config
)
mutils.reset_distributed_and_clear_memory()

[2025-05-07 15:21:48,942] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-07 15:21:48,950] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 1


Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  7.48it/s]


[2025-05-07 15:21:51,329] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-07 15:21:51,329] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-05-07 15:21:51,332] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-07 15:21:51,371] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 5120, 'intermediate_size': 13824, 'heads': 40, 'num_hidden_layers': -1, 'dtype': torch.bfloat16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': True, 'rotate_every_two': False, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'training_mp_size': 1, 'big

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  7.31it/s]
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/transformer_inference/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module transformer_inference...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/transformer_inference/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module transformer_inference...
Allowing ninja to set a default number of

ninja: no work to do.
Time to load transformer_inference op: 0.020978927612304688 seconds
ninja: no work to do.
Time to load transformer_inference op: 0.02348494529724121 seconds
------------------------------------------------------
Free memory : 7.072205 (GigaBytes)  
Total memory: 22.045044 (GigaBytes)  
Requested memory: 0.015776 (GigaBytes) 
Setting maximum total tokens (input + output) to 33 
WorkSpace: 0x7f10c1400000 
------------------------------------------------------
Batch=1 | Seq=32+1
Elapsed GPU time: 0.0664s | TFLOP/s: 6.6 | AI: 33.00 FLOP/B
🧠 Tokens generated: 2
⚡ Throughput: 30.05825 tokens/sec
⏱️ Total time: 0.06654 sec
💸 Cost per 1M tokens: $11.18199


Traceback (most recent call last):
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 303, in _run_finalizers
    finalizer()
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 227, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 303, in _run_finalizers
    finalizer()
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 227, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory


[
  {
    "batch_size": 1,
    "world_size": 2,
    "avg_time_seconds": 0.06635622406005859,
    "local_gflops": 6636.48785683498,
    "aggregated_gflops": 6636.48785683498,
    "total_flops": 440372275200,
    "estimated_memory_bytes": 13345280000,
    "arithmetic_intensity": 32.998354114713216,
    "cost_per_1m_tokens": 11.181992049225501
  }
]✅ Distributed env torn down and memory cleared.



This Tensor parallism is exactly the same as what we demonstrated in the last lab. This allows us to launch a much larger model by utilizing more GPUs. 

#### Pipeline Parallelism
**How it works:**
Each GPU holds a different stage of the model. Batches are split into micro-batches and passed between GPUs sequentially.

✅ Advantages:
Works well for extremely deep models
- Spreads compute and memory across GPUs
- Compatible with tensor parallelism for hybrid scaling

❌ Disadvantages:
- Latency due to pipeline bubbles (idle GPUs while others compute)
- Complex micro-batching & scheduling
- Harder to load balance if layers are uneven in cost

In [None]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 2
max_tokens =  seq_len + min_new_tokens

ds_config = {
    "pipeline_parallel": {
        "enabled": True,
        "pp_size": world_size
    },
    "max_tokens": max_tokens
}

results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Nous-Hermes-Llama2-13b",
    seq_len=32,
    min_new_tokens=1,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding="tensor",
    world_size=2, # Number of GPUs
    ds_config=ds_config
)
mutils.reset_distributed_and_clear_memory()

[2025-05-07 15:39:39,445] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-07 15:39:39,471] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

🔁 Running batch size = 1


Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  7.82it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  7.67it/s]


[2025-05-07 15:39:42,517] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-07 15:39:42,573] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown




ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/home/ec2-user/environment/src/utils/model_utils.py", line 403, in _distributed_worker
    res = benchmark_batch_sizes(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/environment/src/utils/model_utils.py", line 57, in benchmark_batch_sizes
    elapsed_s, tokens_generated, metrics, cost = benchmark_llm(
                                                 ^^^^^^^^^^^^^^
  File "/home/ec2-user/environment/src/utils/model_utils.py", line 201, in benchmark_llm
    model = load_sharded_model(model_name, dtype, sharding, world_size, seq_len, min_new_tokens, ds_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/environment/src/utils/model_utils.py", line 144, in load_sharded_model
    model = deepspeed.init_inference(
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/deepspeed/__init__.py", line 362, in init_inference
    ds_inference_config = DeepSpeedInferenceConfig(**config_dict)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.12/site-packages/deepspeed/runtime/config_utils.py", line 57, in __init__
    super().__init__(**data)
  File "/home/ec2-user/.local/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for DeepSpeedInferenceConfig
pipeline_parallel
  Extra inputs are not permitted [type=extra_forbidden, input_value={'enabled': True, 'pp_size': 2}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden
pp_engine
  Extra inputs are not permitted [type=extra_forbidden, input_value=True, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden
