Collecting unsloth
  Downloading unsloth-2025.5.9-py3-none-any.whl.metadata (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.1/47.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting torchaudio
  Downloading torchaudio-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting torchvision
  Downloading torchvision-0.22.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting accelerate
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting lm-eval
  Downloading lm_eval-0.4.8-py3-none-any.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.5/50.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m


In [2]:
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
import json
import os
import gc
import wandb

os.environ["WANDB_DISABLED"] = "true"

# Check GPU availability and memory
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    print(f"GPU name: {torch.cuda.get_device_name()}")

# Clear GPU cache
torch.cuda.empty_cache()
gc.collect()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-06-06 05:42:09.990223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749188530.241016      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749188530.371431      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


🦥 Unsloth Zoo will now patch everything to make training faster!
CUDA available: True
GPU count: 1
GPU memory: 15.9 GB
GPU name: Tesla P100-PCIE-16GB


581

In [3]:
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage

print("Loading Mistral-7B-q4 model...")
try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/mistral-7b-bnb-4bit",  # More reliable than mistral-7b-q4
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
        trust_remote_code=True,
    )
    
    # Add padding token if missing
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print("Model loaded successfully!")
    
except Exception as e:
    print(f"Error loading model: {e}")
    print("Trying alternative model...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
        trust_remote_code=True,
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

# Add LoRA adapters
print("Adding LoRA adapters...")
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Added small dropout for regularization
    bias="none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

print("Model setup complete!")
print(f"Model parameters: {model.num_parameters():,}")

Loading Mistral-7B-q4 model...
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.5.9: Fast Mistral patching. Transformers: 4.52.4.
   \\   /|    Tesla P100-PCIE-16GB. Num GPUs = 1. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 6.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Model loaded successfully!
Adding LoRA adapters...


Unsloth 2025.5.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Model setup complete!
Model parameters: 7,283,675,136


In [4]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Handle None values
        input_text = input_text if input_text is not None else ""
        
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        
        # Truncate if too long to prevent memory issues
        if len(text) > max_seq_length * 4:  # Rough token estimate
            text = text[:max_seq_length * 4] + EOS_TOKEN
            
        texts.append(text)
    
    return {"text": texts}

# Load dataset
print("Loading Alpaca dataset...")
try:
    dataset = load_dataset("tatsu-lab/alpaca", split="train")
    print(f"Original dataset size: {len(dataset)}")
    
    # Take a smaller subset for faster training and memory efficiency with 7B model
    dataset = dataset.select(range(1500))  # Use first 1500 examples for 7B model
    print(f"Using subset size: {len(dataset)}")
    
    # Apply formatting with error handling
    dataset = dataset.map(
        formatting_prompts_func, 
        batched=True,
        remove_columns=dataset.column_names,  # Remove original columns
        num_proc=1  # Reduced to prevent multiprocessing issues
    )
    
    print("Dataset processed successfully!")
    print("\nSample formatted text:")
    print(dataset[0]["text"][:500] + "...")
    
except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

# ============================
# ⚙️ TRAINING SETUP
# ============================
print("Setting up training...")

# Enable training mode
FastLanguageModel.for_training(model)

# Clear cache before training
torch.cuda.empty_cache()
gc.collect()

# Training arguments optimized for Mistral-7B
training_args = TrainingArguments(
    per_device_train_batch_size=8,    # Very small for 7B model
    gradient_accumulation_steps=4,   # Larger to maintain effective batch size
    warmup_steps=5,
    max_steps=-1,  # -1 for full training (no step limit)
    num_train_epochs=2,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=3407,
    output_dir="mistral_outputs",
    save_steps=50,
    save_total_limit=1,  # Keep only 1 checkpoint to save space
    dataloader_pin_memory=False,
    dataloader_num_workers=0,
    remove_unused_columns=False,
    group_by_length=True,  # Can help with memory efficiency
    ddp_find_unused_parameters=False,
    report_to=None,  # Disable reporting to prevent wandb issues
    disable_tqdm=False,  # Keep progress bar enabled
    save_strategy="steps",
    logging_strategy="steps",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=1,
    packing=False,  # Can make training 5x faster for short sequences.
    args=training_args,
)

print("Training setup complete!")

Loading Alpaca dataset...


README.md:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Original dataset size: 52002
Using subset size: 1500


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Dataset processed successfully!

Sample formatted text:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.</s>...
Setting up training...


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Unsloth: Tokenizing ["text"]:   0%|          | 0/1500 [00:00<?, ? examples/s]

Training setup complete!


In [5]:
print("\n" + "="*50)
print("🚀 STARTING TRAINING...")
print("="*50)

try:
    # Monitor memory before training
    if torch.cuda.is_available():
        print(f"GPU memory before training: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    
    trainer_stats = trainer.train()
    print("Training completed successfully!")
    print(f"Training loss: {trainer_stats.training_loss:.4f}")
    
except torch.cuda.OutOfMemoryError as e:
    print(f"GPU Out of Memory Error: {e}")
    print("\n🔧 Trying fallback settings...")
    
    # Clear cache and try with even smaller settings
    torch.cuda.empty_cache()
    gc.collect()
    
    # Fallback training arguments
    training_args.per_device_train_batch_size = 1
    training_args.gradient_accumulation_steps = 32
    training_args.num_train_epochs = 1
    max_seq_length = 1024
    
    # Recreate trainer with smaller dataset and settings
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset.select(range(500)),  # Even smaller fallback dataset
        dataset_text_field="text",
        max_seq_length=1024,  # Reduced sequence length
        dataset_num_proc=1,
        packing=False,
        args=training_args,
    )
    
    try:
        trainer_stats = trainer.train()
        print("✅ Training completed with fallback settings!")
    except Exception as e2:
        print(f"❌ Fallback training also failed: {e2}")
        print("\nTroubleshooting suggestions:")
        print("1. Use a GPU with more VRAM (16GB+ recommended for Mistral-7B)")
        print("2. Consider using TinyLlama or Gemma-2B instead")
        print("3. Reduce dataset size further")
        print("4. Use max_seq_length=512")
        
except Exception as e:
    print(f"Training error: {e}")

# Clear cache after training
torch.cuda.empty_cache()
gc.collect()


🚀 STARTING TRAINING...
GPU memory before training: 4.02 GB


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,500 | Num Epochs = 2 | Total steps = 94
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 41,943,040/7,000,000,000 (0.60% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.1192
20,0.8407
30,0.7686
40,0.7436
50,0.7033
60,0.5988
70,0.6123
80,0.5969
90,0.6625


Training completed successfully!
Training loss: 0.7312


1900

In [6]:
print("Saving model...")
try:
    model.save_pretrained("mistral_finetuned")
    tokenizer.save_pretrained("mistral_finetuned")
    print("✅ Model saved to ./mistral_finetuned/")
except Exception as e:
    print(f"❌ Error saving model: {e}")

Saving model...
✅ Model saved to ./mistral_finetuned/


In [10]:
print("\n" + "="*50)
print("🧪 TESTING MODEL INFERENCE...")
print("="*50)

try:
    FastLanguageModel.for_inference(model)  # Enable native 2x faster inference
    
    # Test generation
    test_instruction = "Explain about LLM Fine-Tuning also tell about LoRa technique"
    test_prompt = alpaca_prompt.format(
        test_instruction,
        "",
        "",
    )
    
    inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
    
    print("Generating response...")
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=500, 
            use_cache=True,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    print("\n=== Sample Generation ===")
    response = tokenizer.batch_decode(outputs)[0]
    generated_text = response.split("### Response:")[-1].split(EOS_TOKEN)[0].strip()
    print(f"Question: {test_instruction}")
    print(f"Answer: {generated_text}")
    
except Exception as e:
    print(f"❌ Inference test error: {e}")

# Clear cache before evaluation
torch.cuda.empty_cache()
gc.collect()


🧪 TESTING MODEL INFERENCE...
Generating response...

=== Sample Generation ===
Question: Explain about LLM Fine-Tuning also tell about LoRa technique
Answer: LLM fine-tuning is a technique used to improve the performance of large language models (LLMs) by training them on a specific task. This involves fine-tuning the model on a smaller dataset of the desired task, which allows the model to learn the task-specific knowledge and better generalize to the given task. LoRa is a technique used to improve the performance of LLMs by compressing the model parameters, making it easier to fine-tune and deploy on resource-constrained devices. LoRa is achieved by using a combination of low-rank matrix factorization and quantization techniques.


0

In [11]:
!lm_eval --model hf \
    --model_args pretrained=./mistral_finetuned,dtype=auto,trust_remote_code=True \
    --tasks arc_easy \
    --batch_size 2 \
    --output_path finetuned_results \
    --log_samples

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2025-06-06 06:47:44.279949: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749192464.302728     318 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749192464.310099     318 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
README.md: 100%|███████████████████████████| 9.00k/9.00k [00:00<00:00, 44.7MB/s]
ARC-Easy/train-00000-of-00001.parquet: 100%|██| 331k/331k [00:00<00:00, 662kB/s]
ARC-Easy/test-00000-of-00001.parquet: 100%|███| 346k/346k [00:00<00:00, 600kB/s]
ARC-Easy/validation-00000-of-00001.parqu(…): 100%|█| 86.1k/86.1k [00:00<00:00, 1
Generating train split: 100%|████| 2251/2251 [00:00<00:00, 245511.19 examples/s]
Generating test split: 100%|████

In [12]:
!lm_eval --model hf \
    --model_args pretrained=unsloth/mistral-7b-bnb-4bit,dtype=auto,trust_remote_code=true \
    --tasks arc_easy \
    --batch_size 2 \
    --output_path baseline_mistral.json 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2025-06-06 07:26:40.038430: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749194800.061369     386 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749194800.068765     386 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
tokenizer_config.json: 100%|███████████████| 1.02k/1.02k [00:00<00:00, 9.43MB/s]
tokenizer.model: 100%|███████████████████████| 493k/493k [00:00<00:00, 1.16MB/s]
tokenizer.json: 100%|██████████████████████| 1.80M/1.80M [00:00<00:00, 6.93MB/s]
special_tokens_map.json: 100%|█████████████████| 438/438 [00:00<00:00, 5.12MB/s]
100%|█████████████████████████████████████| 2376/2376 [00:01<00:00, 1248.59it/s]
Running loglikelihood requests: 