# 🚀 HyperSloth Demo Training Notebook

This notebook demonstrates how to fine-tune large language models using HyperSloth's multi-GPU capabilities. It's equivalent to running:

```bash
hypersloth-train examples/example_sharegpt_lora_2gpus.py
```

## What This Demo Does

- **Multi-GPU Training**: Uses 2 GPUs with NCCL synchronization
- **Adaptive Batching**: Optimizes sequence sorting and padding
- **LoRA Fine-tuning**: Efficient parameter updates with Low-Rank Adaptation
- **Response-only Loss**: Calculates loss only on assistant responses

## Prerequisites

1. HyperSloth installed: `pip install git+https://github.com/anhvth/HyperSloth.git`
2. At least 2 GPUs available (adjust `gpus=[0, 1]` if needed)
3. Sufficient VRAM (reduce batch size if needed)

In [1]:
%%capture
%load_ext autoreload
%autoreload 2

In [2]:
# Import HyperSloth configuration classes
from HyperSloth.hypersloth_config import *

# Check GPU availability
import torch
print(f'🔥 CUDA Available: {torch.cuda.is_available()}')
print(f'🔥 GPU Count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'   GPU {i}: {torch.cuda.get_device_name(i)}')


🔥 CUDA Available: True
🔥 GPU Count: 4
   GPU 0: NVIDIA H100 80GB HBM3
   GPU 1: NVIDIA H100 80GB HBM3
   GPU 2: NVIDIA H100 80GB HBM3
   GPU 3: NVIDIA H100 80GB HBM3


## ⚙️ Configuration Setup

HyperSloth uses Pydantic models for type-safe configuration. We'll set up:

1. **Data Configuration**: Dataset and tokenization settings
2. **Training Configuration**: GPU allocation and loss calculation
3. **Model Configuration**: Base model and LoRA parameters
4. **Training Arguments**: Learning rate, batch size, and optimization settings

In [None]:
from HyperSloth.hypersloth_config import *
from HyperSloth.scripts.hp_trainer import run_mp_training, setup_envs

# Main configuration using Pydantic models
hyper_config_model = HyperConfig(
    data=HFDatasetConfig(
        dataset_name="llamafactory/OpenThoughts-114k",
        split="train",
        tokenizer_name="Qwen/Qwen3-8B",  # does not matter same family qwen3
        num_samples=1000,
        instruction_part="<|im_start|>user\n",
        response_part="<|im_start|>assistant\n",
        chat_template="chatml",
    ),
    training=TrainingConfig(
        gpus=[0, 1,2,3],
        loss_type="response_only",
    ),
    fast_model_args=FastModelArgs(
        model_name="unsloth/Qwen3-0.6b-bnb-4bit",
        max_seq_length=32_000,
        load_in_4bit=True,
    ),
    lora_args=LoraArgs(
        r=8,
        lora_alpha=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_dropout=0,
        bias="none",
        use_rslora=False,
    ),
)

# Training arguments using Pydantic model
training_config_model = TrainingArgsConfig(
    output_dir="outputs/qwen3-8b-openthought-2gpus/",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    logging_steps=3,
    num_train_epochs=3,
    lr_scheduler_type="linear",
    warmup_steps=5,
    save_total_limit=2,
    weight_decay=0.01,
    optim="adamw_8bit",
    seed=3407,
    report_to="none",  # tensorboard or wawndb
)

setup_envs(hyper_config_model, training_config_model)

run_mp_training(
    hyper_config_model.training.gpus, hyper_config_model, training_config_model
)

Global batch size: 32
[MP] Running on 4 GPUs


[32m06:53:26[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 0 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU0[0m | [36mhp_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU2[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 2 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU2[0m | [36mhp_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU3[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 3 with output_dir outputs/qwen3-8b-openthought-2gpus/[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU3[0m | [36mhp_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m06:53:26[0m | [1mINFO    [0m | [36mGPU1[0m | [36mhp_trainer.py:42[0m | [1mTraining on GPU 1 with output_dir outputs/qwen3-8b-openthough

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_1
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=1] Patching log. Dir: outputs/qwen3-8b-openthought-2gpus/, GPUs: 4
[LOCAL_RANK=1] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch


[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36mlogging_config.py:140[0m | [1m⏱️  final_trainer_creation: 3.92s[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36minit_modules.py:155[0m | [1mReplacing DataCollatorForLanguageModeling with DataCollatorForSeq2Seq for better sequence handling[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36minit_modules.py:163[0m | [1mTrainer setup completed successfully[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36mlogging_config.py:140[0m | [1m⏱️  trainer_setup: 3.94s[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36minit_modules.py:119[0m | [1mAdd callback ShuffleData to Trainer UnslothSFTTrainer[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36mlogging_config.py:140[0m | [1m⏱️  trainer_creation: 3.95s[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU2[0m | [36mlogging_config.py:140[0m | [1m⏱️  total_setup: 36.37s[0m
[32m06:54:03[0m | [1mINFO    

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_3
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=3] Patching log. Dir: outputs/qwen3-8b-openthought-2gpus/, GPUs: 4
[LOCAL_RANK=3] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now 

  0%|          | 0/96 [00:00<?, ?it/s][32m06:54:03[0m | [1mINFO    [0m | [36mGPU3[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
  0%|          | 0/96 [00:00<?, ?it/s][32m06:54:03[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:21[0m | [1m🔄 Starting epoch 1[0m
[32m06:54:03[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
  0%|          | 0/96 [00:00<?, ?it/s][32m06:54:04[0m | [1mINFO    [0m | [36mGPU2[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_2
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=2] Patching log. Dir: outputs/qwen3-8b-openthought-2gpus/, GPUs: 4
[LOCAL_RANK=2] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[32m06:54:12[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1m📋 Dataloader examples logged to .log/dataloader_examples.html[0m
[32m06:54:12[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
