# Assignment 6: Training Language Models Using DeepSpeed
## Overview
In this assignment, you will explore DeepSpeed, a powerful deep learning optimization library developed by Microsoft. DeepSpeed enables training of extreme-scale models through various optimization techniques including ZeRO (Zero Redundancy Optimizer), mixed precision training, model parallelism, and memory offloading. You will configure DeepSpeed for different scenarios and analyze the performance impacts.

## Objectives
- Understand DeepSpeed's optimization techniques
- Configure DeepSpeed for various performance optimization scenarios
- Measure and analyze performance metrics across different configurations
- Apply the knowledge to train larger models efficiently

## Environment Setup
1. Install the required package:
   ```
   pip install deepspeed>=0.6.5
   ```
2. Other typically required packages include PyTorch, transformers, accelerate, etc.
3. Additional packages may be required depending on your implementation.

## Grading Criteria
- Correct understanding and explanation of DeepSpeed concepts (40%)
- Proper implementation and configuration of DeepSpeed (30%)
- Thorough analysis of experimental results with metrics (20%)
- Report quality and clarity (10%)

Successfully answering all questions with thorough explanations and corresponding experimental records, and submitting a well-structured report that demonstrates your reasoning will earn you 8/10 of the total score. There is no optional puzzle this time.

### Reference Materials:  
1. [official docs](https://deepspeed.readthedocs.io/en/latest/)  
2. [hugging face](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)

### For any questions, please do one of the following actions in order of priority:  
1. Search for similar questions on Slack (https://app.slack.com/client/T088V95D8LC/C088L557RK8).  
2. Post a new question on Slack if it has not already been answered.  
3. For non-public questions, email Xiangyan Liu (e0950125@u.nus.edu) and Pengfei Zhou (e1374451@u.nus.edu) with the subject starting with "CS5260 2025 Spring".

## Tasks and Questions

For each of the following questions, you need to:
1. Explain the concept in detail
2. Demonstrate the configuration in a DeepSpeed config file
3. Run experiments and record relevant metrics (training speed, GPU memory usage, CPU utilization, etc.)
4. Analyze the results and provide insights

> **Note:** You do not need to train the models on the full dataset, training for a short duration and documenting the necessary metrics is sufficient. Also, evaluation is not required.

### Question 1: ZeRO Optimization Stages
What are the different ZeRO optimization stages in DeepSpeed? What are the characteristics of each stage?

**Note:** Experiment with different ZeRO stages and measure their impact on memory usage and training speed.

### Question 2: Mixed Precision Training
How to configure mixed precision training in DeepSpeed? 

**Note:** Compare the performance of FP32, FP16, and BF16 (if available) training in terms of memory usage, training speed, and convergence (optional).

### Question 3: Model Parallelism
How to configure DeepSpeed to implement model parallelism rather than just data parallelism?

**Note:** Demonstrate the configuration and analyze how it affects training large models across multiple GPUs.

### Question 4: Offload Techniques
What are DeepSpeed's offload techniques? How to enable them in the configuration?

**Note:** Experiment with offloading to CPU and/or NVMe and analyze the trade-offs between memory usage and training speed.

### Question 5: Training Larger Models
Using the knowledge gained from the previous questions, configure DeepSpeed to train a larger model (from the [pythia collection](https://huggingface.co/collections/EleutherAI/pythia-scaling-suite-64fb5dfa8c21ebb3db7ad2e1)) than would be possible with standard training approaches. Try to maximize model size while maintaining reasonable training speed.

**Note:** Demonstrate your approach, configuration, and analyze the results in terms of model size, memory usage, and training speed.


In [1]:
!pip3 install deepspeed

Collecting deepspeed
  Downloading deepspeed-0.16.7.tar.gz (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hjson (from deepspeed)
  Downloading hjson-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->deepspeed)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->deepspeed)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->deepspeed)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch->deepspeed)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collectin

### The following is a demonstration case showing how to enable DeepSpeed during LLM training:

In [None]:
# DeepSpeed Experiment Scripts and Configurations
# -----------------------------------------------
# This file bundle includes:
# 1) train_deepspeed.py - Main training script with metric logging and time-based stopping
# 2) DeepSpeed JSON config files for five experiments:
#    - Zero Optimization Stages (0,1,2,3)
#    - Mixed Precision (FP32, FP16, BF16)
#    - Model Parallelism
#    - Offload Techniques (CPU, NVMe)
#    - Large Model Training

# ======== train_deepspeed.py ========

In [3]:
%%writefile train_deepspeed.py
import os
import time
import json
import subprocess
import threading
import psutil
import argparse
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    TrainerCallback,
    DataCollatorForLanguageModeling
)
from transformers import TrainerCallback

# Global flag for logger threads
global logging_active
logging_active = True

def gpu_usage_logger(logfile="gpu_usage.csv", interval=2, start_time=None):
    global logging_active
    if start_time is None:
        start_time = time.time()
    with open(logfile, "w") as f:
        f.write("elapsed_time,gpu_index,gpu_utilization,memory_utilization,memory_used_MB,memory_total_MB\n")
    while logging_active:
        elapsed = time.time() - start_time
        try:
            cmd = [
                "nvidia-smi",
                "--query-gpu=index,utilization.gpu,utilization.memory,memory.used,memory.total",
                "--format=csv,noheader,nounits"
            ]
            result = subprocess.run(cmd, capture_output=True, text=True)
            lines = result.stdout.strip().split("\n")
            with open(logfile, "a") as f:
                for line in lines:
                    f.write(f"{elapsed:.2f}," + line + "\n")
        except Exception as e:
            print("[GPU Logger]", e)
        time.sleep(interval)

def cpu_usage_logger(logfile="cpu_usage.csv", interval=2, start_time=None):
    global logging_active
    if start_time is None:
        start_time = time.time()
    with open(logfile, "w") as f:
        f.write("elapsed_time,cpu_usage_percent\n")
    psutil.cpu_percent(interval=None)
    while logging_active:
        elapsed = time.time() - start_time
        try:
            cpu = psutil.cpu_percent(interval=None)
            with open(logfile, "a") as f:
                f.write(f"{elapsed:.2f},{cpu}\n")
        except Exception as e:
            print("[CPU Logger]", e)
        time.sleep(interval)

from transformers import TrainerCallback
import time

class StepTimeCallback(TrainerCallback):
    def __init__(self, logfile="step_times.csv"):
        super().__init__()
        self.logfile = logfile
        # these will get their first real value at train-begin
        self.start_time = None
        self.step_start = None

        # write header once
        with open(self.logfile, "w") as f:
            f.write("elapsed_time,step_time\n")

    def on_train_begin(self, args, state, control, **kwargs):
        # set both timers at exactly the same baseline
        now = time.time()
        self.start_time = now
        self.step_start = now

    def on_step_begin(self, args, state, control, **kwargs):
        # mark the beginning of *this* training step
        self.step_start = time.time()

    def on_step_end(self, args, state, control, **kwargs):
        now = time.time()

        # guard against uninitialized timers
        if self.start_time is None:
            self.start_time = now
        if self.step_start is None:
            # nothing to log this round, but seed it for next time
            self.step_start = now
            return

        step_time = now - self.step_start
        elapsed   = now - self.start_time

        with open(self.logfile, "a") as f:
            f.write(f"{elapsed:.2f},{step_time:.4f}\n")


# Callback to stop training after a max duration (seconds)
class TimeBasedStoppingCallback(TrainerCallback):
    def __init__(self, max_duration):
        super().__init__()
        self.max_duration = max_duration
        self.start_time = None

    def on_train_begin(self, args, state, control, **kwargs):
        self.start_time = time.time()
        self.step_start = self.start_time

    def on_step_end(self, args, state, control, **kwargs):
        if self.start_time and (time.time() - self.start_time) > self.max_duration:
            control.should_training_stop = True
        return control

# Compute summary metrics after training
def compute_summary(prefix, gpu_csv, cpu_csv, speed_csv, summary_file):
    import pandas as pd
    try:
        df_gpu = pd.read_csv(gpu_csv).sort_values("elapsed_time")
        df_cpu = pd.read_csv(cpu_csv).sort_values("elapsed_time")
        df_speed = pd.read_csv(speed_csv)

        avg_gpu_mem = df_gpu["memory_used_MB"].mean()
        avg_gpu_util = df_gpu["gpu_utilization"].mean()
        avg_cpu = df_cpu["cpu_usage_percent"].mean()
        avg_step_time = df_speed["step_time"].mean()
        steps_per_sec = 1.0 / avg_step_time if avg_step_time > 0 else 0.0

        with open(summary_file, "w") as f:
            f.write(f"Experiment {prefix} Summary:\n")
            f.write(f"Avg GPU Memory Used (MB): {avg_gpu_mem:.2f}\n")
            f.write(f"Avg GPU Utilization (%): {avg_gpu_util:.2f}\n")
            f.write(f"Avg CPU Usage (%): {avg_cpu:.2f}\n")
            f.write(f"Avg Step Time (s): {avg_step_time:.4f}\n")
            f.write(f"Est. Steps/sec: {steps_per_sec:.2f}\n")
        print(f"Written summary to {summary_file}")
    except Exception as e:
        print(f"[Summary Error] Could not compute summary for {prefix}: {e}")

# Main training function
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--deepspeed_config", type=str,
        default="ds_config.json",
        help="Path to DeepSpeed JSON config"
    )
    parser.add_argument(
        "--max_time", type=int,
        default=None,
        help="Max training time in seconds (e.g., 120 for 2 minutes)"
    )
    args = parser.parse_args()

     # load your DS config so we can mirror its fp16/bf16 flags in HF
    with open(args.deepspeed_config, "r") as f:
        ds_cfg = json.load(f)
    enable_fp16 = bool(ds_cfg.get("fp16", {}).get("enabled", False))
    enable_bf16 = bool(ds_cfg.get("bf16", {}).get("enabled", False))

    os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
    os.environ["NCCL_P2P_DISABLE"] = "1"

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"🚀 Using {torch.cuda.device_count()} GPUs for training!")

    MODEL_PATH = os.getenv("MODEL_PATH", "EleutherAI/pythia-410m")
    DATASET_HF_PATH = os.getenv("DATASET_HF_PATH", "xyliu6/deita_6k_processed_cs5260")
    OUTPUT_DIR = os.getenv("OUTPUT_DIR", "./output")
    LOG_DIR = os.getenv("LOG_DIR", "./logs")

    EPOCHS = int(os.getenv("EPOCHS", 1))
    BATCH_SIZE = int(os.getenv("BATCH_SIZE", 2))
    LEARNING_RATE = float(os.getenv("LEARNING_RATE", 2e-5))
    MAX_LENGTH = int(os.getenv("MAX_LENGTH", 1024))

    dataset = load_dataset(DATASET_HF_PATH)

    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        trust_remote_code=True
    )
    model.config.use_cache = False
    model.to(device)

    def preprocess(batch):
        texts = batch.get("formatted_text", [""])
        new_texts = [" ".join(t) if isinstance(t, list) else str(t) for t in texts]
        tokenized = tokenizer(
            new_texts,
            truncation=True,
            padding="max_length",
            max_length=MAX_LENGTH
        )
        tokenized["labels"] = tokenized.input_ids.copy()
        return tokenized

    tokenized = dataset.map(preprocess, batched=True)
    tokenized.set_format(type="torch", columns=["input_ids","attention_mask","labels"])

    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        logging_dir=LOG_DIR,
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        learning_rate=LEARNING_RATE,
        deepspeed=args.deepspeed_config,
        fp16=enable_fp16,
        bf16=enable_bf16,
        gradient_accumulation_steps=1,
        save_total_limit=1,
        report_to="none",
        remove_unused_columns=False
    )

    step_cb = StepTimeCallback()
    start_time = time.time()
    gpu_thread = threading.Thread(target=gpu_usage_logger, args=(f"gpu_{args.deepspeed_config}.csv",2,start_time))
    cpu_thread = threading.Thread(target=cpu_usage_logger, args=(f"cpu_{args.deepspeed_config}.csv",2,start_time))
    gpu_thread.start()
    cpu_thread.start()

    # build callbacks list
    callbacks = [step_cb]
    if args.max_time:
        time_cb = TimeBasedStoppingCallback(args.max_time)
        callbacks.append(time_cb)

    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        callbacks=callbacks,
        tokenizer=tokenizer,
        data_collator=data_collator
    )

    trainer.train()

    logging_active = False
    gpu_thread.join()
    cpu_thread.join()

    prefix = os.path.splitext(os.path.basename(args.deepspeed_config))[0]
    compute_summary(prefix,
                    f"gpu_{args.deepspeed_config}.csv",
                    f"cpu_{args.deepspeed_config}.csv",
                    f"step_times_{args.deepspeed_config}.csv",
                    f"{prefix}_summary.txt")

Overwriting train_deepspeed.py


In [7]:
%%bash
# 1) ZeRO stages 0–3 (unquoted EOF so ${stage} expands)
for stage in 1 2 3; do
  cat > ds_zero_stage${stage}.json << EOF
{
  "zero_optimization": { "stage": ${stage} },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto"
}
EOF
done

# 2) Mixed precision
cat > ds_fp32.json << 'EOF'
{
  "fp16":   { "enabled": false },
  "bf16":   { "enabled": false },
  "train_batch_size": "auto",
  "zero_optimization": {
    "stage": 1
  }
}
EOF

cat > ds_fp16.json << 'EOF'
{
  "fp16":   { "enabled": true },
  "bf16":   { "enabled": false },
  "train_batch_size": "auto",
  "zero_optimization": {
    "stage": 1
  }
}
EOF

cat > ds_bf16.json << 'EOF'
{
  "fp16":   { "enabled": false },
  "bf16":   { "enabled": true },
  "train_batch_size": "auto",
  "zero_optimization": {
    "stage": 1
  }
}
EOF


# 3) Model parallelism
cat > ds_model_parallel.json << EOF
{
  "zero_optimization": { "stage": 3 },
  "fp16": { "enabled": true },
  "pipeline": { "enabled": true, "stages": 2 },
  "train_batch_size": "auto"
}
EOF

# 4) Offload techniques
cat > ds_offload_cpu.json << EOF
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param":     { "device": "cpu", "pin_memory": true }
  },
  "fp16": { "enabled": true },
  "train_batch_size": "auto"
}
EOF

cat > ds_offload_nvme.json << EOF
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "nvme", "nvme_path": "/nvme/tmp" },
    "offload_param":     { "device": "cpu", "pin_memory": true }
  },
  "fp16": { "enabled": true },
  "train_batch_size": "auto"
}
EOF

# 5) Large model (Pythia-410m)
cat > ds_large_model.json << EOF
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true  },
    "offload_param":     { "device": "cpu",  "pin_memory": true }
  },
  "bf16":     { "enabled": true },
  "pipeline": { 
      "enabled": true, 
      "stages": 2  },
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto"
}
EOF


## Q1

In [9]:
!for cfg in ds_zero_stage1.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done

=== Running ds_zero_stage1.json ===
W0423 15:43:33.751000 2141 torch/distributed/run.py:793] 
W0423 15:43:33.751000 2141 torch/distributed/run.py:793] *****************************************
W0423 15:43:33.751000 2141 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 15:43:33.751000 2141 torch/distributed/run.py:793] *****************************************
2025-04-23 15:43:39.964818: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 15:43:39.968431: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:174542301

In [8]:
!for cfg in ds_zero_stage2.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_zero_stage2.json ===
W0423 15:31:10.292000 1284 torch/distributed/run.py:793] 
W0423 15:31:10.292000 1284 torch/distributed/run.py:793] *****************************************
W0423 15:31:10.292000 1284 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 15:31:10.292000 1284 torch/distributed/run.py:793] *****************************************
2025-04-23 15:31:16.309313: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 15:31:16.309402: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:174542227

In [7]:
!for cfg in ds_zero_stage*.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_zero_stage3.json ===
W0423 15:27:28.843000 1012 torch/distributed/run.py:793] 
W0423 15:27:28.843000 1012 torch/distributed/run.py:793] *****************************************
W0423 15:27:28.843000 1012 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 15:27:28.843000 1012 torch/distributed/run.py:793] *****************************************
2025-04-23 15:27:35.122356: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 15:27:35.122459: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:174542205

## Q2

In [10]:
!for cfg in ds_fp32.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_fp32.json ===
W0423 15:53:55.722000 2879 torch/distributed/run.py:793] 
W0423 15:53:55.722000 2879 torch/distributed/run.py:793] *****************************************
W0423 15:53:55.722000 2879 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 15:53:55.722000 2879 torch/distributed/run.py:793] *****************************************
2025-04-23 15:54:02.088994: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 15:54:02.088994: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745423642.11179

In [11]:
!for cfg in ds_fp16.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_fp16.json ===
W0423 15:57:08.882000 3186 torch/distributed/run.py:793] 
W0423 15:57:08.882000 3186 torch/distributed/run.py:793] *****************************************
W0423 15:57:08.882000 3186 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 15:57:08.882000 3186 torch/distributed/run.py:793] *****************************************
2025-04-23 15:57:15.002001: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 15:57:15.002065: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745423835.02423

In [12]:
!for cfg in ds_bf16.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_bf16.json ===
W0423 16:00:14.963000 3487 torch/distributed/run.py:793] 
W0423 16:00:14.963000 3487 torch/distributed/run.py:793] *****************************************
W0423 16:00:14.963000 3487 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 16:00:14.963000 3487 torch/distributed/run.py:793] *****************************************
2025-04-23 16:00:21.060589: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 16:00:21.060611: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745424021.08354

## Q3

In [13]:
!for cfg in ds_model_parallel.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done


=== Running ds_model_parallel.json ===
W0423 16:03:49.519000 3820 torch/distributed/run.py:793] 
W0423 16:03:49.519000 3820 torch/distributed/run.py:793] *****************************************
W0423 16:03:49.519000 3820 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 16:03:49.519000 3820 torch/distributed/run.py:793] *****************************************
2025-04-23 16:03:55.697494: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-23 16:03:55.697498: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:174542

## Q4

In [14]:
!for cfg in ds_offload_cpu.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done

=== Running ds_offload_cpu.json ===
W0423 16:06:50.108000 4124 torch/distributed/run.py:793] 
W0423 16:06:50.108000 4124 torch/distributed/run.py:793] *****************************************
W0423 16:06:50.108000 4124 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0423 16:06:50.108000 4124 torch/distributed/run.py:793] *****************************************
2025-04-23 16:06:56.637234: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745424416.661057    4127 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745424416.668380    4127 cuda_blas.cc:1418]

## Q5

In [8]:
# 1) Install a NumPy 1.24.x wheel
!pip install numpy==1.24.4

# 2) Restart your Kaggle session (so the new NumPy is actually loaded)

!for cfg in ds_large_model.json; do \
    echo "=== Running $cfg ==="; \
    CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 \
      train_deepspeed.py --deepspeed_config $cfg --max_time 120; \
  done

=== Running ds_large_model.json ===
W0425 14:32:43.451000 428 torch/distributed/run.py:793] 
W0425 14:32:43.451000 428 torch/distributed/run.py:793] *****************************************
W0425 14:32:43.451000 428 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0425 14:32:43.451000 428 torch/distributed/run.py:793] *****************************************
2025-04-25 14:32:49.491384: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-25 14:32:49.491548: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745591569.51