# Applying FSDP: Real-World Usage and Best Practices 🚀

## Learning Objectives 🎯
- Learn how to apply Fully Sharded Data Parallel (FSDP) for large-scale models.
- Understand FSDP’s configurations for optimizing memory usage and efficiency.
- Gain hands-on experience by running FSDP even on a single GPU, keeping in mind that the real benefits become clear with multiple GPUs.

## GPU Check ✅
Ensure a GPU is available. A T4 GPU (free tier) won’t suffice for large models like a 70B parameter model, but we’ll use a smaller model for faster loading. **FSDP shines with multiple GPUs**, but you can still explore the key concepts using a single GPU.

In [None]:
import torch
# Check so there is a gpu available, a T4(free tier) will NOT be enough for an 14B parameter model, but we can use a slightly smaller one and the lesson will remain the same
assert (torch.cuda.is_available()==True)

## Installing Axolotl with FSDP 🛠️
Install Axolotl, ensuring FSDP support is included. This allows us to explore real-world usage scenarios. While FSDP is designed for distributed setups with multiple GPUs, we can still learn and practice on a single GPU for educational purposes.

In [None]:
!pip install -e 'git+https://github.com/axolotl-ai-cloud/axolotl.git@0aeb277#egg=axolotl' # ensures the same version we used in the course

Obtaining axolotl from git+https://github.com/axolotl-ai-cloud/axolotl.git@0aeb277#egg=axolotl
  Cloning https://github.com/axolotl-ai-cloud/axolotl.git (to revision 0aeb277) to ./src/axolotl
  Running command git clone --filter=blob:none --quiet https://github.com/axolotl-ai-cloud/axolotl.git /content/src/axolotl
[0m  Running command git checkout -q 0aeb277
  Resolved https://github.com/axolotl-ai-cloud/axolotl.git to commit 0aeb277
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fschat@ git+https://github.com/lm-sys/FastChat.git@27a05b04a35510afb1d767ae7e5990cbd278f8fe (from axolotl)
  Cloning https://github.com/lm-sys/FastChat.git (to revision 27a05b04a35510afb1d767ae7e5990cbd278f8fe) to /tmp/pip-install-5c0hhwsw/fschat_533ec7dc724f426e8da3b294833a545b
  Running command git clone --filter=blob:none --quiet https://github.com/lm-sys/FastChat.git /tmp/pip-install-5c0hhwsw/fschat_533ec7dc724f426e8da3b294833a545b
  Running command git rev-parse -q --verify 'sha^27a05b04

## FSDP Configuration for Training 📝
Set up the YAML configuration to train the model. This includes parameters optimized for FSDP, such as sharding configurations and memory efficiency techniques. The smaller model and dataset used here ensure that the training is manageable even on Colab.

## Applying FSDP Configurations 🧠
FSDP uses configurations like `full_shard` and `auto_wrap` to distribute memory across GPUs efficiently. Even though we’re running this on a single GPU, you’ll understand how the memory is handled and prepared for larger-scale distributed training.

In [None]:
import yaml

train_config = {
    # "base_model": "casperhansen/llama-3-70b-fp16", # will only work on at least 2 x 24GB Gpus
    "base_model": "unsloth/llama-3-8b-Instruct",

    # dataset params
    "datasets": [{"path": "Yukang/LongAlpaca-12k", "type": "alpaca"}],
    "output_dir": "./models/llama70B-LongAlpaca",

    # model params
    "sequence_length": 1024,
    "pad_to_sequence_len": True,
    "special_tokens": {"pad_token": "<|end_of_text|>"},

    "bf16": "auto",
    "tf32": False,

    # training params
    "micro_batch_size": 1,
    "num_epochs": 1,
    "optimizer": "adamw_torch",
    "learning_rate": 0.0002,

    "logging_steps": 1,

    # LoRA / qLoRA
    "adapter": "qlora",
    "lora_r": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_target_linear": True,

    # Gradient Accumulation
    "gradient_accumulation_steps": 1,

    # Gradient Checkpointing
    "gradient_checkpointing": True,

    # Low Precision
    "load_in_8bit": False,
    "load_in_4bit": True,

    # Flash Attention
    "flash_attention": False,

    # FSDP
    "fsdp": ["full_shard", "auto_wrap"],
    "fsdp_config": {
        "fsdp_offload_params": True,
        "fsdp_cpu_ram_efficient_loading": True,
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
    },
}



# Write the YAML file
with open("train_fsdp.yml", 'w') as file:
    yaml.dump(train_config, file)


## Launching FSDP Training 🚀
Start the training using FSDP with `accelerate launch`. FSDP works best in a multi-GPU setup, but you can still proceed with a single GPU to observe how it manages memory sharding and learn its benefits.

In [None]:
!accelerate launch -m axolotl.cli.train train_fsdp.yml

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2024-09-05 09:34:38.036546: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-05 09:34:38.057258: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-05 09:34:38.063421: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-05 09:34:38.078557: I tensorflow/core/platform/cpu_feature_guard.cc:210] This 

In [None]:
# Optional: Merge the trained adapter
!accelerate launch -m axolotl.cli.merge_lora train_fsdp.yml