# 🚀 OpenSloth Demo Training Notebook

This notebook demonstrates how to fine-tune large language models using opensloth's multi-GPU capabilities. It's equivalent to running:

```bash
opensloth-train examples/example_sharegpt_lora_2gpus.py
```

## What This Demo Does

- **Multi-GPU Training**: Uses 2 GPUs with NCCL synchronization
- **Adaptive Batching**: Optimizes sequence sorting and padding
- **LoRA Fine-tuning**: Efficient parameter updates with Low-Rank Adaptation
- **Response-only Loss**: Calculates loss only on assistant responses

## Prerequisites

1. opensloth installed: `pip install git+https://github.com/anhvth/opensloth.git`
2. At least 2 GPUs available (adjust `gpus=[0, 1]` if needed)
3. Sufficient VRAM (reduce batch size if needed)

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ../

In [2]:
# Import opensloth configuration classes
from opensloth.opensloth_config import *

# Check GPU availability
import torch
print(f'🔥 CUDA Available: {torch.cuda.is_available()}')
print(f'🔥 GPU Count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'   GPU {i}: {torch.cuda.get_device_name(i)}')


🔥 CUDA Available: True
🔥 GPU Count: 4
   GPU 0: NVIDIA H100 80GB HBM3
   GPU 1: NVIDIA H100 80GB HBM3
   GPU 2: NVIDIA H100 80GB HBM3
   GPU 3: NVIDIA H100 80GB HBM3


## ⚙️ Configuration Setup

HyperSloth uses Pydantic models for type-safe configuration. We'll set up:

1. **Data Configuration**: Dataset and tokenization settings
2. **Training Configuration**: GPU allocation and loss calculation
3. **Model Configuration**: Base model and LoRA parameters
4. **Training Arguments**: Learning rate, batch size, and optimization settings

In [None]:
# %%writefile train_opensloth.py
from transformers.training_args import TrainingArguments
from opensloth.scripts.opensloth_trainer import run_mp_training, setup_envs
from opensloth.opensloth_config import (
    OpenSlothConfig,
    HFDatasetConfig,
    GpuConfig,
    FastModelArgs,
)


# # Main configuration using Pydantic models
def get_configs() -> tuple[OpenSlothConfig, TrainingArguments]:
    # Important: do not import transformers/unsloth related modules at the top level
    from transformers import TrainingArguments

    opensloth_config = OpenSlothConfig(
        data=HFDatasetConfig(
            tokenizer_name="Qwen/Qwen3-8B",
            chat_template="qwen3",
            instruction_part="<|im_start|>user\n",
            response_part="<|im_start|>assistant\n",
            num_samples=1000,
            nproc=52,
            max_seq_length=4096,
            source_type="hf",
            dataset_name="mlabonne/FineTome-100k",
            split="train",
        ),
        GpuConfig=GpuConfig(
            gpus=[0, 1, 2, 3],
        ),
        fast_model_args=FastModelArgs(
            model_name="unsloth/Qwen3-8B-bnb-4bit",
            max_seq_length=4096,
            load_in_4bit=True,
        ),
        lora_args=LoraArgs(
            r=8,
            lora_alpha=16,
            target_modules=[
                "q_proj",
                "k_proj",
                "v_proj",
                "o_proj",
                "gate_proj",
                "up_proj",
                "down_proj",
            ],
            lora_dropout=0,
            bias="none",
            use_rslora=False,
        ),
    )

    # # Training arguments using Pydantic model
    training_config = TrainingArguments(
        output_dir="outputs/qwen3-8b-FineTome-4gpus/",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=1e-5,
        logging_steps=1,
        num_train_epochs=1,
        lr_scheduler_type="linear",
        warmup_steps=5,
        save_total_limit=1,
        weight_decay=0.01,
        optim="adamw_8bit",
        seed=3407,
        report_to="tensorboard",  # tensorboard or wawndb
    )
    setup_envs(opensloth_config, training_config)
    return opensloth_config, training_config


if __name__ == "__main__":
    opensloth_config, training_config = get_configs()
    run_mp_training(opensloth_config.GpuConfig.gpus, opensloth_config, training_config)

Global batch size: 32
[MP] Running on 4 GPUs


[32m12:42:19[0m | [1mINFO    [0m | [36mGPU0[0m | [36mopensloth_trainer.py:42[0m | [1mTraining on GPU 0 with output_dir outputs/qwen3-8b-FineTome-4gpus/[0m
[32m12:42:19[0m | [1mINFO    [0m | [36mGPU0[0m | [36mopensloth_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m12:42:20[0m | [1mINFO    [0m | [36mGPU3[0m | [36mopensloth_trainer.py:42[0m | [1mTraining on GPU 3 with output_dir outputs/qwen3-8b-FineTome-4gpus/[0m
[32m12:42:20[0m | [1mINFO    [0m | [36mGPU3[0m | [36mopensloth_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m12:42:20[0m | [1mINFO    [0m | [36mGPU2[0m | [36mopensloth_trainer.py:42[0m | [1mTraining on GPU 2 with output_dir outputs/qwen3-8b-FineTome-4gpus/[0m
[32m12:42:20[0m | [1mINFO    [0m | [36mGPU2[0m | [36mopensloth_trainer.py:45[0m | [1m🚀 Starting total training timer[0m
[32m12:42:20[0m | [1mINFO    [0m | [36mGPU1[0m | [36mopensloth_trainer.py:42[0m | [1mTraining on GPU 1 wit

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_3
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients


Tokenizing dataset (num_proc=52): 100%|██████████| 1000/1000 [00:10<00:00, 94.21 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 25213.88 examples/s]
[32m12:43:18[0m | [1mINFO    [0m | [36mGPU3[0m | [36minit_modules.py:148[0m | [1mCreating final SFTTrainer with prepared dataset...[0m
[32m12:43:18[0m | [1mINFO    [0m | [36mGPU2[0m | [36minit_modules.py:148[0m | [1mCreating final SFTTrainer with prepared dataset...[0m
[32m12:43:18[0m | [1mINFO    [0m | [36mGPU0[0m | [36minit_modules.py:148[0m | [1mCreating final SFTTrainer with prepared dataset...[0m
[32m12:43:19[0m | [1mINFO    [0m | [36mGPU1[0m | [36minit_modules.py:148[0m | [1mCreating final SFTTrainer with prepared dataset...[0m
[32m12:43:20[0m | [1mINFO    [0m | [36mGPU3[0m | [36minit_modules.py:161[0m | [1mReplacing DataCollatorForLanguageModeling with DataCollatorForSeq2Seq for better sequence handling[0m
[32m12:43:20[0m | [1mINFO    [0m | 

[LOCAL_RANK=3] Patching log. Dir: outputs/qwen3-8b-FineTome-4gpus/, GPUs: 4
[LOCAL_RANK=3] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_2
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=2] Patching log. Dir: outputs/qwen3-8b-FineTome-4gpus/, GPUs: 4
[LOCAL_RANK=2] Log patc

  0%|          | 0/32 [00:00<?, ?it/s][32m12:43:21[0m | [1mINFO    [0m | [36mGPU2[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids dataset samples: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
Process Process-3:
Traceback (most recent call last):
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/anhvth5/projects/opensloth/src/opensloth/scripts/opensloth_trainer.py", line 69, in train_on_single_gpu
    trainer.train()
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 2240, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/ho

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Using compiler location: .cache/unsloth_compiled_cache_0
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Making `model.base_model.model.model` require gradients
[LOCAL_RANK=0] Patching log. Dir: outputs/qwen3-8b-FineTome-4gpus/, GPUs: 4
[LOCAL_RANK=0] Log patch initialization complete.
🔧 Patching Trainer to use RandomSamplerSeededByEpoch


[32m12:43:22[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:28[0m | [1m📋 Dataloader examples logged to .log/dataloader_examples.html[0m
[32m12:43:22[0m | [1mINFO    [0m | [36mGPU0[0m | [36mpatch_sampler.py:52[0m | [1m🎲 Sampler epoch 0: emitting 1000 indices
First ids dataset samples: [776, 507, 895, 922, 33, 483, 85, 750, 354, 523]
...Last ids: [104, 754, 142, 228, 250, 281, 759, 25, 114, 654][0m
Process Process-1:
Traceback (most recent call last):
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/anhvth5/projects/opensloth/src/opensloth/scripts/opensloth_trainer.py", line 69, in train_on_single_gpu
    trainer.train()
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/site-packages/tra


=== EXAMPLE #1 ===
[92m<|im_start|>user
How does the structure and function of the small intestine relate to nutrient absorption?<|im_end|>
<|im_start|>assistant
[0m[93m<think>

</think>

The small intestine is a crucial organ in the digestive system, responsible for the absorption of nutrients from the food we consume. Its structure and function are intricately related to its role in nutrient absorption.

1. Length and surface area: The small intestine is approximately 6 meters (20 feet) long, which provides an extensive surface area for nutrient absorption. The inner lining of the small intestine, known as the mucosa, is covered with tiny, finger-like projections called villi. These villi further increase the surface area for absorption. Each villus is covered with even smaller projections called microvilli, which form the brush border. This extensive surface area allows for efficient absorption of nutrients.

2. Specialized cells: The small intestine is lined with specialized ce



Error in training, terminating all processes


Exception: Error in training

In [32]:
!python train_opensloth.py

Global batch size: 128
[MP] Running on 4 GPUs
Global batch size: 128
[MP] Running on 4 GPUs
Global batch size: 128
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
[MP] Running on 4 GPUs
Global batch size: 128
    exitcode = _main(fd, parent_sentinel)
[MP] Running on 4 GPUs
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/anhvth5/miniconda3/envs/opensloth_env/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
Global batch size: 128
    main_content = runpy.run_path(main_path,
[MP] Running on 4 GPUs
       

### Compare with unsloth

In [6]:
import os
# from llm_utils import *
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import unsloth

Dataset({
    features: ['conversations', 'source', 'score', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

# Unsloth Main
 now copy code from https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb

In [16]:
%%writefile /tmp/train_unsloth.py
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Set visible devices for training
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
            model_name="unsloth/Qwen3-8B-bnb-4bit",
        max_seq_length=4096,
        load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
        r=8,
        lora_alpha=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_dropout=0,
        bias="none",
        use_rslora=False,
)
os.environ['HYPERSLOTH_LOCAL_RANK'] = '0'  # Set local rank for distributed training
from opensloth.dataset_utils import get_tokenized_dataset

# Get the tokenized dataset
from opensloth.opensloth_config import HFDatasetConfig
data = HFDatasetConfig(**{'tokenizer_name': 'Qwen/Qwen3-8B',
 'chat_template': 'qwen3',
 'instruction_part': '<|im_start|>user\n',
 'response_part': '<|im_start|>assistant\n',
 'num_samples': 10000,
 'nproc': 52,
 'max_seq_length': 4096,
 'source_type': 'hf',
 'dataset_name': 'mlabonne/FineTome-100k',
 'split': 'train'})

tokenized_dataset = get_tokenized_dataset(data)
tokenized_dataset
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = tokenized_dataset, # Use the tokenized dataset
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        output_dir = "outputs/qwen3-8b-FineTome-unsloth/",
        dataset_text_field = "text",
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8*4, # *4 to match 128 global batch size
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        learning_rate = 1e-5, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        # seed = 3407,
        report_to = "tensorboard", # Use this for WandB etc
    ),
)

from opensloth.patching.patch_sampler import RandomSamplerSeededByEpoch
from fastcore.all import patch


@patch
def _get_train_sampler(
    self: type(trainer), train_dataset=None
) -> RandomSamplerSeededByEpoch:
    """Get a custom sampler for the training dataset."""
    if train_dataset is None:
        train_dataset = self.train_dataset

    print(f"Using custom sampler for {train_dataset.__class__.__name__}")
    return RandomSamplerSeededByEpoch(train_dataset)  # type: ignore

trainer.train()

Overwriting /tmp/train_unsloth.py


In [17]:
!python /tmp/train_unsloth.py

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth 2025.5.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
[32m10:12:50[0m | [1mINFO    [0m | [36mGPU0[0m | [36mdataset_utils.py:222[0m | [1mPreparing dataset 1d8d794fd6fc0ba8...[0m
Tokenizing dataset (num_proc=52): 100%|█| 10000/10000 [00:12<00:00, 817.88 examp
Saving the dataset (1/1 shards): 100%|█| 10000/10000 [00:00<00:00, 79043.41 exam
Using custom sample