# Implementing DeepSpeed: A Hands-On Approach 🚀

## Learning Objectives 🎯
- Learn how to implement DeepSpeed to optimize training for large models.
- Understand the configuration of DeepSpeed for memory-efficient training.
- Gain hands-on experience with DeepSpeed, even on a single GPU, to grasp the key concepts.

## GPU Verification ✅
Verify that a GPU is available. While a T4 (free tier) GPU cannot handle the original 8B model, we will use a smaller 1.1B model for faster loading and to fit within Colab limits.

For those with multiple GPUs, DeepSpeed will show significant performance gains.

In [None]:
import torch
# Check so there is a gpu available, a T4(free tier) will NOT be enough for an 8B parameter model, but we can use a slightly smaller one and the lesson will remain the same
assert (torch.cuda.is_available()==True)

## Installing Axolotl and DeepSpeed 🛠️
Install Axolotl with DeepSpeed support. While DeepSpeed is optimized for multi-GPU setups, you can still run this configuration on a single GPU to understand how the system works and the benefits of the Zero Redundancy Optimizer (ZeRO).

In [None]:
!pip install -e 'git+https://github.com/axolotl-ai-cloud/axolotl.git@0aeb277#egg=axolotl[deepspeed]' # ensures the same version we used in the course

## Configuration Setup for DeepSpeed Training 📝
Set up a YAML configuration for training the model, using a smaller model for free-tier compatibility. The original model from the lesson can be used if you have access to more powerful hardware.

In [None]:
import yaml

train_config = {
    # "base_model": "unsloth/Meta-Llama-3.1-8B-Instruct", # The original model from the lesson
    "base_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", # For faster loading on Colab


    # dataset params
    "datasets": [
        {
            "path": "TheFuzzyScientist/squad-for-llms",
            "type": {
                "system_prompt": "Read the following context and concisely answer my question.",
                "field_system": "system",
                "field_instruction": "question",
                "field_input": "context",
                "field_output": "output",
                "format": "<|user|> {input} {instruction} </s> <|assistant|>",
                "no_input_format": "<|user|> {instruction} </s> <|assistant|>",
            },
        }
    ],
    "output_dir": "./models/Llama3_squad",

    # model params
    "sequence_length": 2048,
    "bf16": "auto",
    "tf32": False,

    # training params
    "micro_batch_size": 4,
    "num_epochs": 4,
    "optimizer": "adamw_bnb_8bit",
    "learning_rate": 0.0002,
    "logging_steps": 1,

    # LoRA / qLoRA
    "adapter": "lora",
    "lora_r": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_target_linear": True,

    # Gradient Accumulation
    "gradient_accumulation_steps": 1,

    # Gradient Checkpointing
    "gradient_checkpointing": True,
}


# Write the YAML file
with open("deepspeed_train.yml", 'w') as file:
    yaml.dump(train_config, file)


## DeepSpeed Configuration 🧠
Create a DeepSpeed configuration (Zero Stage 1) to enable memory optimization during training. This configuration will reduce memory usage and allow larger batch sizes, especially beneficial when scaling to multiple GPUs.

In [None]:
import json

zero1_conf = {
    "zero_optimization": {"stage": 1, "overlap_comm": True},
    "bf16": {"enabled": "auto"},
    "fp16": {
        "enabled": "auto",
        "auto_cast": False,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": False,
}

with open("zero1.json", 'w') as fp:
  json.dump(zero1_conf, fp)

## Launching DeepSpeed Training 🚀
Launch the training using the `accelerate launch` command with DeepSpeed enabled. While running this on a single GPU won't show the full benefits, it will still provide the learning experience and understanding of how DeepSpeed optimizes large-scale training.

In [None]:
!accelerate launch -m axolotl.cli.train deepspeed_train.yml --deepspeed zero1.json

In [None]:
# Optional: Merge the trained adapter
!accelerate launch -m axolotl.cli.merge_lora deepspeed_train.yml