# 🚀 HyperSloth Demo Training Notebook

This notebook demonstrates how to fine-tune large language models using HyperSloth's multi-GPU capabilities. It's equivalent to running:

```bash
hypersloth-train examples/example_sharegpt_lora_2gpus.py
```

## What This Demo Does

- **Multi-GPU Training**: Uses 2 GPUs with NCCL synchronization
- **Adaptive Batching**: Optimizes sequence sorting and padding
- **LoRA Fine-tuning**: Efficient parameter updates with Low-Rank Adaptation
- **Response-only Loss**: Calculates loss only on assistant responses

## Prerequisites

1. HyperSloth installed: `pip install git+https://github.com/anhvth/HyperSloth.git`
2. At least 2 GPUs available (adjust `gpus=[0, 1]` if needed)
3. Sufficient VRAM (reduce batch size if needed)

In [1]:
# Import HyperSloth configuration classes
from HyperSloth.hypersloth_config import *
from HyperSloth.scripts.hp_trainer import run_multiprocess_training, setup_envs

# Check GPU availability
import torch
print(f'🔥 CUDA Available: {torch.cuda.is_available()}')
print(f'🔥 GPU Count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'   GPU {i}: {torch.cuda.get_device_name(i)}')

🔥 CUDA Available: True
🔥 GPU Count: 4
   GPU 0: NVIDIA H100 80GB HBM3
   GPU 1: NVIDIA H100 80GB HBM3
   GPU 2: NVIDIA H100 80GB HBM3
   GPU 3: NVIDIA H100 80GB HBM3


## ⚙️ Configuration Setup

HyperSloth uses Pydantic models for type-safe configuration. We'll set up:

1. **Data Configuration**: Dataset and tokenization settings
2. **Training Configuration**: GPU allocation and loss calculation
3. **Model Configuration**: Base model and LoRA parameters
4. **Training Arguments**: Learning rate, batch size, and optimization settings

In [2]:
from HyperSloth.hypersloth_config import *
from HyperSloth.scripts.hp_trainer import run_multiprocess_training, setup_envs

# Main configuration using Pydantic models
hyper_config_model = HyperConfig(
    data=HFDatasetConfig(
        dataset_name="llamafactory/OpenThoughts-114k",
        split="train",
        tokenizer_name="Qwen/Qwen3-8B",  # does not matter same family qwen3
        num_samples=1000,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
        chat_template="chatml",
    ),
    training=TrainingConfig(
        gpus=[0, 1],
        loss_type="response_only",
    ),
    fast_model_args=FastModelArgs(
        model_name="unsloth/Qwen3-0.6b-bnb-4bit",
        max_seq_length=32_000,
        load_in_4bit=True,
    ),
    lora_args=LoraArgs(
        r=8,
        lora_alpha=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_dropout=0,
        bias="none",
        use_rslora=False,
    ),
)

# Training arguments using Pydantic model
training_config_model = TrainingArgsConfig(
    output_dir="outputs/qwen3-8b-openthought-2gpus/",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    logging_steps=3,
    num_train_epochs=3,
    lr_scheduler_type="linear",
    warmup_steps=5,
    save_total_limit=2,
    weight_decay=0.01,
    optim="adamw_8bit",
    seed=3407,
    report_to="none",  # tensorboard or wawndb
)

## 🏋️ Training Arguments

Configure the training hyperparameters for optimal performance:

In [3]:

setup_envs(hyper_config_model, training_config_model)

run_multiprocess_training(
    hyper_config_model.training.gpus, hyper_config_model, training_config_model
)

Global batch size: 16
[MP] Running on 2 GPUs


[32m03:31:54[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:44[0m | [1m🔧 GPU 1 (Rank 1/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit[0m
[32m03:31:54[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:50[0m | [1mTraining on GPU 1 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m03:31:54[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:53[0m | [1m🚀 Starting total training timer[0m
[32m03:31:54[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:44[0m | [1m🔧 GPU 0 (Rank 0/1) | Model: unsloth/Qwen3-0.6b-bnb-4bit[0m
[32m03:31:54[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:50[0m | [1mTraining on GPU 0 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m03:31:54[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:53[0m | [1m🚀 Starting total training timer[0m
Process Process-2:
Traceback (most recent call last):
  File "/home/anhvth5/miniconda3/envs/unsloth_env/lib/python3.11/multiprocessing/pro

Error in training, terminating all processes


Exception: Error in training