# Module 2
## Lab 2: Back to Libraries

Now that we understand HOW sharding is done, we can understand how this tool is utilized by  libaries to optimize your model inference/training across multiple accelerators. If you want a deeper dive on individual sharding patterns that make up parallism patterns, and the collectives associated with them, we highly recommend [this chapter](https://jax-ml.github.io/scaling-book/sharding/) of **How to Scale your Model**.

In this notebook though, we will focus on the different parallism strategies that can be implemented by these libaries, and what the considerations with them are.

### Parallism Patterns
This lab will be a hands on representation of common parallism patterns. If you want a achemic deep dive, we suggest you read through [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high_level_overview).

> Note we will be using Deepspeed here but this could just as easily be done with any distributed inference/training framework, they have similar interfaces

We will be covering the following:
- Data Parallism
- Tensor Parallism (Like we covered last chapter)
- Pipeline Parallism
- ZeRO-3/FSDP (these can be interchangable in most contexts)

There are other strategies as well, and likely more will emerge, but these are the main ones and others will follow similar concepts. Throughout the first part of this module we'll still be using a small model to demonstrate different types of parallism. At the end we will combine these strategies to run a much larger model on the same instance!

The concepts we cover in this module can scale to thousands of GPUs.

Let's start by installing our libraries.


In [1]:
import os
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(parent_dir)

In [2]:
%pip install transformers accelerate deepspeed

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Now we will run through the different types of parallism. Keep an eye out for `OUTPUT BREAKDOWN` to see what the token generation and cost looks like. You may see some warnings/errors but these are benign and can be ignored.

## Definitions
Before we get started we'll provide an appendix that describes the different images we'll be going through to visualize the types of parallism, their advantages and disadvantages.

![](./assets/appendix.png)

- The A Matrix represents data being input, as well as it's sequence length and batch size depending on how large it is
- The B matrix represents the parameters for the model
- The training states represent addition memory needed for forward/backward pass in training, in inference this isn't relevant, and you essentially only have the forward pass
- We will be using GPUs but this could work with Neuron devices or any other accelerator
- We won't be using 2x2 topologies, but in reality, that's how you'd model out your GPU topology when using multiple types of parallism

With that out of the way let's jump into data parallism

#### Data Parallel
![](./assets/dp.png)
**How it works:**
Each GPU has a full copy of the model. Batches are split across GPUs. Gradients are synced after the backward pass. As you can see this affectively just splits the input data. If your model is too large, or the training states take up too much memory this won't help reduce your memory footprint. It's best used as a tool to improve throughput.

✅ Advantages:
- Easy to implement (e.g., torch.nn.DataParallel, DDP)
- Scales well for small to mid-sized models
- No model code changes required
- Best for large data sets and small models

❌ Disadvantages:
- Inefficient for very large models (can't fit on one GPU)
- All-reduce on gradients becomes a bottleneck at high scale


Let's see it in action. We'll use a smaller model (1B) because we aren't splitting the model this time.

> Note: Deepspeed does this automatically as you provide basic optimization and multiple GPUs, as it's a standard optimization technique.

In [9]:
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

# Create a temporary deepspeed config file
ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,

    "fp16": { "enabled": True },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },

    "zero_optimization": {
        "stage": 1
    },

    "replace_with_kernel_inject": False,
    "enable_cuda_graph": False
}
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=8,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=1, # Number of GPUs
    ds_config=ds_config
)


results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=32,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=2, # Number of GPUs
    ds_config=ds_config
)
mutils.reset_distributed_and_clear_memory()

[2025-05-21 20:24:34,386] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:24:34,409] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:24:36,075] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:24:36,075] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-21 20:24:36,172] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:24:36,636] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-21 20:24:36,757] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-21 20:24:36,757] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-21 20:24:36,757] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-21 20:24:36,991] [INFO] [logging.py:128:log_dist]

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/fused_adam/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Loading extension module fused_adam...


[2025-05-21 20:24:37,521] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-21 20:24:37,521] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-21 20:24:37,523] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-21 20:24:37,523] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-05-21 20:24:37,523] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer
[2025-05-21 20:24:37,523] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2025-05-21 20:24:37,523] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2025-05-21 20:24:37,523] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-05-21 20:24:37,523] [INFO] [stage_1_and_2.py:152:__init__] R



Batch=1 | SeqLen=32
Elapsed GPU time: 0.2025s | TFLOP/s: 3.9 | AI: 22.86 FLOP/B
Batch=1 | SeqLen=32
Elapsed GPU time: 0.2030s | TFLOP/s: 3.9 | AI: 22.86 FLOP/B
--------- OUTPUT BREAKDOWN ---------
🧠 Tokens generated: 64
⚡ Throughput: 315.70004 tokens/sec
⏱️ Total time: 0.20272 sec
💸 Cost per 1M tokens: $1.06465
------------------------------------
✅ Distributed env torn down and memory cleared.


Great! As you can see distributed data can allow us to process our workload much faster, this is similar to how we previously used batching to improove price/performance. Here we can do something similar. But once our model is large enough this stragegy will no longer work, and we'll run into the same issues as before. So we'll have to find ways to either reduce the model size or reduce the training states (in most cases both).

### ZeRO-3/FSDP
![](./assets/zero.png)
**How it works:**
ZeRO-3 (and FSDP) shards all model states — including parameters, gradients, and optimizer states — across GPUs. It avoids redundancy by ensuring that no GPU holds a full copy of the model. Parameters are temporarily reconstructed using AllGather at compute time.

✅ Advantages:
- Shards the entire model, including non-linear layers, embeddings, and optimizer states
- Enables training models that exceed per-GPU memory
- Integrates with existing PyTorch models via FSDP or DeepSpeed ZeRO

❌ Disadvantages:
- Requires full parameter AllGather before each forward/backward step
- Communication-intensive (especially with many GPUs)
- Can be slower without high-bandwidth interconnect (e.g., NVLink or InfiniBand)

> We will often stack ZeRO-3/FSDP with other strategies like Tensor and Data Parallelism. For example, we may shard linear layers with TP while still using ZeRO-3 to handle the remaining memory overhead from unsharded layers and optimizer state. This allows us to scale across both memory and compute bottlenecks.

In [13]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 2

ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,

    "fp16": { "enabled": True },
    
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        # optional perf tweaks:
        "contiguous_gradients": True,
        "overlap_comm": True
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },

    "replace_with_kernel_inject": False,
    "enable_cuda_graph": False
}


results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=32,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size, # Number of GPUs
    ds_config=ds_config
)

[2025-05-21 20:35:33,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:35:33,246] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:35:34,919] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:35:34,920] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-21 20:35:35,051] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:35:35,515] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-21 20:35:35,624] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-21 20:35:35,624] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-21 20:35:35,624] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-21 20:35:35,860] [INFO] [logging.py:128:log_dist]

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-21 20:35:39,511] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-21 20:35:39,511] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.79 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-21 20:35:39,512] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.62 GB, percent = 3.6%
[2025-05-21 20:35:39,512] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-21 20:35:39,512] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-21 20:35:39,620] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-21 20:35:39,620] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-21 20:35:39,620] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.62 GB, percent = 3.6%
[2025-05-21 20:35:39,622] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
Parameter Offload: T



Batch=1 | SeqLen=32
Elapsed GPU time: 1.5687s | TFLOP/s: 0.1 | AI: 22.85 FLOP/B
Batch=1 | SeqLen=32
Elapsed GPU time: 1.5746s | TFLOP/s: 0.1 | AI: 22.85 FLOP/B
--------- OUTPUT BREAKDOWN ---------
🧠 Tokens generated: 64
⚡ Throughput: 40.72200 tokens/sec
⏱️ Total time: 1.57163 sec
💸 Cost per 1M tokens: $8.25380
------------------------------------


#### Tensor Parallism
![](./assets/tp.png)
**How it works:**
Individual layers are sharded across GPUs — e.g., split matrix rows/columns in linear layers. Typically this requires a custom implementation of the model for parallism. This is usually done for popular models by frameworks like Pytorch and Deepspeed, but keep this in mind when using cutting edge models or creating your own, if the architecure is unique the model definition will need to account for this.

✅ Advantages:
- Enables sharding of very large models/layers
- Reduces per-GPU memory usage
- Exploits fine-grained parallelism within layers
- Less CC than CC heavy DP
- Reduces compute

❌ Disadvantages:
- Requires deep model rewrites or tools like DeepSpeed/FSDP
- Requires custom communication (e.g., all_gather, reduce_scatter)
- CC (e.g., NCCL) can dominate runtime if not optimized or scaled too high

> We will stack types of parallism on top of each other as by themselves they may not be enough to to store in memory. For example, we will stack DP and TP in this case. You will see DP, and TP moving forward as well.

In [11]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 4

ds_config = {
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,

  "fp16": {
    "enabled": True
  },

  "tensor_parallel": {
    "enabled": True,
    "tp_size": 4
  },

  ### ISSUE CODE REQUIRES OPTOMIZER FOR SOME REASON FIX
  "optimizer": {
    "type": "CPUAdam",         
    "params": {
      "lr": 3e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },


  "replace_with_kernel_inject": True,
  "enable_cuda_graph": False,

  "wall_clock_breakdown": False,
  "steps_per_print": 1000000
}



results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=32,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size, # Number of GPUs
    ds_config=ds_config
)

[2025-05-21 20:56:32,118] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:56:32,233] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:56:32,260] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:56:32,267] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-21 20:56:33,976] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:56:34,053] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:56:34,053] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-21 20:56:34,198] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:56:34,222] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-21 20:56:34,714] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-2

This Tensor parallism is exactly the same as what we demonstrated in the last lab. This allows us to launch a much larger model by utilizing more GPUs. 

#### Pipeline Parallelism
![](./assets/pp.png)
**How it works:**
Each GPU holds a different stage of the model. Batches are split into micro-batches and passed between GPUs sequentially.

✅ Advantages:
Works well for extremely deep models
- Spreads compute and memory across GPUs
- Compatible with tensor parallelism for hybrid scaling
- Reduce CC needed for parallism

❌ Disadvantages:
- Latency due to pipeline bubbles (idle GPUs while others compute)
- Complex micro-batching & scheduling
- Harder to load balance if layers are uneven in cost

> Note: We will be using DP as deepspeed defaults, but no TP

In [None]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
world_size = 4

ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,

    "fp16": { "enabled": True },

    "zero_optimization": {
        "stage": 1,
        "contiguous_gradients": True,
        "overlap_comm": True
    },


    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    }, # Ignore this for now
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        # optional perf tweaks:
        "contiguous_gradients": True,
        "overlap_comm": True
    },
    
    # Splitting the model across 4 GPUs
    "tensor_parallel": {
        "enabled": True,
        "tp_size": world_size
    },

    # pipeline parallel: split layers
    "pipeline": {
        "enabled": True,
        "stages": 4,                # 2 pipeline stages × 2-way TP × 2-way DP = 4 GPUs
        "partition_method": "parameters",
        "activation_checkpoint_interval": 1
    },

    "replace_with_kernel_inject": False,
    "enable_cuda_graph": False
}

results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Meta-Llama-3.1-8B",
    seq_len=32,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size, # Number of GPUs
    ds_config=ds_config
)

[2025-05-16 19:06:59,197] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-16 19:06:59,217] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-16 19:06:59,294] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-16 19:06:59,295] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-16 19:07:01,097] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-16 19:07:01,097] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2025-05-16 19:07:01,249] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.0.170, master_port=29500
[2025-05-16 19:07:01,249] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[202

Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  7.98it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  7.99it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.94it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.47it/s]


SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=1, data=0): 1, ProcessCoord(pipe=2, data=0): 2, ProcessCoord(pipe=3, data=0): 3}
[2025-05-16 19:07:02,119] [INFO] [module.py:396:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=1
     0: LlamaModel
stage=1 layers=1
     1: Linear
stage=2 layers=0
stage=3 layers=0
  loss: CrossEntropyLoss
[2025-05-16 19:07:02,138] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-16 19:07:02,138] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-05-16 19:07:02,187] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-16 19:07:02,188] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-05-16 19:07:02,699] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-16 19:07:02,699] [INFO] [config.py:733:__ini

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
[rank1]:[W516 19:07:07.988433556 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.


benchmark failed: optimizer got an empty parameter list
benchmark failed: optimizer got an empty parameter list


Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fba3c32e200>
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.12/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 303, in _run_finalizers
    finalizer()
  File "/usr/local/lib/python3.12/multiprocessing/util.py", line 227, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f20c8e32200>
Traceback (most recent call last):
  File "/home/ec2-user/.local/

Installed CUDA version 12.6 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
ninja: no work to do.
Time to load cpu_adam op: 2.288069725036621 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-05-16 19:07:08,240] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-16 19:07:08,240] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-16 19:07:08,247] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-05-16 19:07:08,247] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2025-05-16 19:07:08,247] [INFO] [logging.py:128:log_dist] [Rank 0] Cr

[rank0]:[W516 19:07:17.938833174 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.


KeyboardInterrupt: 