# Module 2
## Lab 2: Back to Libraries

Now that we understand HOW sharding is done, we can understand how this tool is utilized by  libaries to optimize your model inference/training across multiple accelerators. If you want a deeper dive on individual sharding patterns that make up parallism patterns, and the collectives associated with them, we highly recommend [this chapter](https://jax-ml.github.io/scaling-book/sharding/) of **How to Scale your Model**.

In this notebook though, we will focus on the different parallism strategies that can be implemented by these libaries, and what the considerations with them are.

### Parallism Patterns
This lab will be a hands on representation of common parallism patterns. If you want a achemic deep dive, we suggest you read through [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high_level_overview).

> Note we will be using Deepspeed here but this could just as easily be done with any distributed inference/training framework, they have similar interfaces

We will be covering the following:
- Data Parallism
- Tensor Parallism (Like we covered last chapter)
- Pipeline Parallism
- ZeRO-3/FSDP (these can be interchangable in most contexts)

There are other strategies as well, and likely more will emerge, but these are the main ones and others will follow similar concepts. Throughout the first part of this module we'll still be using a small model to demonstrate different types of parallism. At the end we will combine these strategies to run a much larger model on the same instance!

The concepts we cover in this module can scale to thousands of GPUs.

Let's start by installing our libraries.


In [1]:
import os
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
os.chdir(parent_dir)

In [2]:
%pip install transformers accelerate deepspeed

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Now we will run through the different types of parallism. Keep an eye out for `OUTPUT BREAKDOWN` to see what the token generation and cost looks like. You may see some warnings/errors but these are benign and can be ignored.

## Definitions
Before we get started we'll provide an appendix that describes the different images we'll be going through to visualize the types of parallism, their advantages and disadvantages.

![](./assets/appendix.png)

- The A Matrix represents data being input, as well as it's sequence length and batch size depending on how large it is
- The B matrix represents the parameters for the model
- The training states represent addition memory needed for forward/backward pass in training, in inference this isn't relevant, and you essentially only have the forward pass
- We will be using GPUs but this could work with Neuron devices or any other accelerator
- We won't be using 2x2 topologies, but in reality, that's how you'd model out your GPU topology when using multiple types of parallism

With that out of the way let's jump into data parallism

#### Data Parallel
![](./assets/dp.png)
**How it works:**
Each GPU has a full copy of the model. Batches are split across GPUs. Gradients are synced after the backward pass. As you can see this affectively just splits the input data. If your model is too large, or the training states take up too much memory this won't help reduce your memory footprint. It's best used as a tool to improve throughput.

‚úÖ Advantages:
- Easy to implement (e.g., torch.nn.DataParallel, DDP)
- Scales well for small to mid-sized models
- No model code changes required
- Best for large data sets and small models

‚ùå Disadvantages:
- Inefficient for very large models (can't fit on one GPU)
- All-reduce on gradients becomes a bottleneck at high scale


Let's see it in action. We'll use a smaller model (1B) because we aren't splitting the model this time.

> Note: Deepspeed does this automatically as you provide basic optimization and multiple GPUs, as it's a standard optimization technique.

In [3]:
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

# Generate a DeepSpeed config with key overrides only
# Using batch size 2, grad accumulation 2, and no fp16
ds_config = mutils.make_ds_config(
    fp16=False
)

# Run single-GPU benchmark (no sharding)
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=8,
    batch_sizes=[4],
    dtype=torch.bfloat16,
    sharding=False,
    world_size=1,
    ds_config=ds_config
)

mutils.reset_distributed_and_clear_memory()

# Run multi-GPU benchmark with sharding across 4 GPUs with a bigger batch size
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=32,
    batch_sizes=[16],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=4,
    ds_config=ds_config
)

mutils.reset_distributed_and_clear_memory()


[2025-05-26 21:56:57,018] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:00,840] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
üîÅ Running batch size = 4
Batch=4|Seq=8 Elapsed=0.5190s TFLOPs=1.5 AI=22.86
--- Breakdown ---
Tokens: 32, Time: 0.519s, Cost/1M: $5.45082
Cleared distributed env and caches.
[2025-05-26 21:57:12,862] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:13,092] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:13,169] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:13,190] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:15,273] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:57:15,469] [INFO

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-26 21:57:20,367] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-26 21:57:20,368] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-26 21:57:20,368] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 7.54 GB, percent = 4.2%
[2025-05-26 21:57:20,369] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
Parameter Offload: Total persistent parameters: 67584 in 33 params
[2025-05-26 21:57:22,768] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-05-26 21:57:22,769] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 2.3 GB         CA 2.3 GB         Max_CA 2 GB 
[2025-05-26 21:57:22,769] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.25 GB, percent = 5.6%
[2025-05-26 21:57:22,897] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-05-26 21:57:22,897

ERROR:root:Benchmark failed: cannot reshape tensor of 0 elements into shape [0, 23, -1, 64] because the unspecified dimension size -1 can be any value and is ambiguous
ERROR:root:Benchmark returned None
ERROR:root:Benchmark failed: cannot reshape tensor of 0 elements into shape [0, 25, -1, 64] because the unspecified dimension size -1 can be any value and is ambiguous
ERROR:root:Benchmark failed: cannot reshape tensor of 0 elements into shape [0, 26, -1, 64] because the unspecified dimension size -1 can be any value and is ambiguous
ERROR:root:Benchmark failed: cannot reshape tensor of 0 elements into shape [0, 22, -1, 64] because the unspecified dimension size -1 can be any value and is ambiguous


Cleared distributed env and caches.


Great! As you can see distributed data can allow us to process our workload much faster, this is similar to how we previously used batching to improove price/performance. Here we can do something similar. But once our model is large enough this stragegy will no longer work, and we'll run into the same issues as before. So we'll have to find ways to either reduce the model size or reduce the training states (in most cases both).

### ZeRO-3/FSDP
![](./assets/zero.png)
**How it works:**
ZeRO-3 (and FSDP) shards all model states ‚Äî including parameters, gradients, and optimizer states ‚Äî across GPUs. It avoids redundancy by ensuring that no GPU holds a full copy of the model. Parameters are temporarily reconstructed using AllGather at compute time.

‚úÖ Advantages:
- Shards the entire model, including non-linear layers, embeddings, and optimizer states
- Enables training models that exceed per-GPU memory
- Integrates with existing PyTorch models via FSDP or DeepSpeed ZeRO

‚ùå Disadvantages:
- Requires full parameter AllGather before each forward/backward step
- Communication-intensive (especially with many GPUs)
- Can be slower without high-bandwidth interconnect (e.g., NVLink or InfiniBand)

> We will often stack ZeRO-3/FSDP with other strategies like Tensor and Data Parallelism. For example, we may shard linear layers with TP while still using ZeRO-3 to handle the remaining memory overhead from unsharded layers and optimizer state. This allows us to scale across both memory and compute bottlenecks.

In [4]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 2

# Generate DeepSpeed config with FP16 enabled and default optimizer settings
ds_config = mutils.make_ds_config()

# Run benchmark with the generated config
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=seq_len,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size,
    ds_config=ds_config
)

[2025-05-26 21:57:55,636] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:55,651] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:57:57,968] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:57:57,968] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-26 21:57:58,030] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:57:58,491] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-26 21:57:58,622] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-26 21:57:58,622] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-26 21:57:58,622] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-26 21:57:58,855] [INFO] [logging.py:128:log_dist]

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-26 21:58:02,492] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-26 21:58:02,492] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.79 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-26 21:58:02,493] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.15 GB, percent = 3.4%
[2025-05-26 21:58:02,493] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-26 21:58:02,493] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-26 21:58:02,604] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-26 21:58:02,604] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-26 21:58:02,604] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.15 GB, percent = 3.4%
[2025-05-26 21:58:02,606] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
Parameter Offload: T

#### Tensor Parallism
![](./assets/tp.png)
**How it works:**
Individual layers are sharded across GPUs ‚Äî e.g., split matrix rows/columns in linear layers. Typically this requires a custom implementation of the model for parallism. This is usually done for popular models by frameworks like Pytorch and Deepspeed, but keep this in mind when using cutting edge models or creating your own, if the architecure is unique the model definition will need to account for this.

‚úÖ Advantages:
- Enables sharding of very large models/layers
- Reduces per-GPU memory usage
- Exploits fine-grained parallelism within layers
- Less CC than CC heavy DP
- Reduces compute

‚ùå Disadvantages:
- Requires deep model rewrites or tools like DeepSpeed/FSDP
- Requires custom communication (e.g., all_gather, reduce_scatter)
- CC (e.g., NCCL) can dominate runtime if not optimized or scaled too high

> We will stack types of parallism on top of each other as by themselves they may not be enough to to store in memory. For example, we will stack DP and TP in this case. You will see DP, and TP moving forward as well.

In [5]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
min_new_tokens = 1
world_size = 4

# Generate DeepSpeed config with FP16 and tensor parallelism across 4 GPUs
ds_config = mutils.make_ds_config(
    tensor_parallel=world_size
)

# Run benchmark using tensor parallelism
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=seq_len,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size,
    ds_config=ds_config
)


[2025-05-26 21:58:25,266] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:58:25,324] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:58:25,349] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:58:25,364] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 21:58:27,582] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:58:27,582] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-26 21:58:27,856] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:58:27,929] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:58:27,934] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 21:58:28,412] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-2

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...


[2025-05-26 21:58:32,488] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-05-26 21:58:32,488] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.79 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-26 21:58:32,489] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 7.61 GB, percent = 4.2%
[2025-05-26 21:58:32,489] [INFO] [stage3.py:168:__init__] Reduce bucket size 500000000
[2025-05-26 21:58:32,489] [INFO] [stage3.py:169:__init__] Prefetch bucket size 50000000
[2025-05-26 21:58:32,595] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-05-26 21:58:32,596] [INFO] [utils.py:782:see_memory_usage] MA 2.3 GB         Max_MA 2.3 GB         CA 2.79 GB         Max_CA 3 GB 
[2025-05-26 21:58:32,596] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 7.61 GB, percent = 4.2%
[2025-05-26 21:58:32,597] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4


KeyboardInterrupt: 

This Tensor parallism is exactly the same as what we demonstrated in the last lab. This allows us to launch a much larger model by utilizing more GPUs. 

#### Pipeline Parallelism
![](./assets/pp.png)
**How it works:**
Each GPU holds a different stage of the model. Batches are split into micro-batches and passed between GPUs sequentially.

‚úÖ Advantages:
Works well for extremely deep models
- Spreads compute and memory across GPUs
- Compatible with tensor parallelism for hybrid scaling
- Reduce CC needed for parallism

‚ùå Disadvantages:
- Latency due to pipeline bubbles (idle GPUs while others compute)
- Complex micro-batching & scheduling
- Harder to load balance if layers are uneven in cost

> Note: We will be using DP as deepspeed defaults, but no TP

In [5]:
import json
import importlib
import torch
import src.utils.model_utils as mutils
importlib.reload(mutils)

seq_len = 32
world_size = 4

# Generate DeepSpeed config with FP16, tensor parallelism, and pipeline parallelism across 4 GPUs
ds_config = mutils.make_ds_config(
    tensor_parallel=world_size //2,
    pipeline={
        "stages": world_size // 2,
        "partition_method": "parameters",
        "activation_checkpoint_interval": 1
    }
)

# Run benchmark using tensor + pipeline parallelism
results = mutils.run_distributed_benchmark(
    model_name="NousResearch/Llama-3.2-1B",
    seq_len=seq_len,
    batch_sizes=[1],
    dtype=torch.bfloat16,
    sharding=True,
    world_size=world_size,
    ds_config=ds_config
)

[2025-05-26 22:21:42,139] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 22:21:42,182] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 22:21:42,199] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 22:21:42,206] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-26 22:21:45,181] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 22:21:45,181] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-26 22:21:45,451] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 22:21:45,458] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-26 22:21:45,458] [INFO] [comm.py:652:init_distributed] cdb=None
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoor

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/fused_adam/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Loading extension module fused_adam...


ninja: no work to do.
Time to load fused_adam op: 0.040541648864746094 seconds
Time to load fused_adam op: 0.10106611251831055 seconds
[2025-05-26 22:21:45,981] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-26 22:21:46,113] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown
[2025-05-26 22:21:46,113] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2025-05-26 22:21:46,113] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2025-05-26 22:21:46,320] [INFO] [engine.py:146:__init__] is_pipe_partitioned= False is_grad_partitioned= False
[2025-05-26 22:21:46,326] [INFO] [engine.py:146:__init__] is_pipe_partitioned= False is_grad_partitioned= False
[2025-05-26 22:21:46,339] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
ninja: no work to do.
Time to load fused_adam op: 0.03986167907714844 seconds
Time to load fused_a

Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Using /home/ec2-user/.cache/torch_extensions/py312_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py312_cu124/fused_adam/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Loading extension module fused_adam...


[2025-05-26 22:21:46,854] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-26 22:21:46,854] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-26 22:21:46,856] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-26 22:21:46,856] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-05-26 22:21:46,856] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer
[2025-05-26 22:21:46,856] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2025-05-26 22:21:46,856] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2025-05-26 22:21:46,856] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-05-26 22:21:46,856] [INFO] [stage_1_and_2.py:152:__init__] R