# Unsloth Fine-tuning on Google Colab

Train and fine-tune LLMs with Unsloth on Google Colab's free GPU.

**Before you start:**
1. Runtime ‚Üí Change runtime type ‚Üí GPU ‚Üí T4 GPU (free tier)
2. Make a copy of this notebook to your Google Drive

**Total time:** ~10-15 minutes (setup + training)

## Step 1: Setup Environment

Install dependencies (takes ~5 minutes)

In [None]:
%%capture
# Install dependencies in the correct order
!pip install --upgrade pip

# Core ML frameworks
!pip install "trl>=0.12.0" "peft>=0.13.0" "bitsandbytes>=0.45.0" "transformers[sentencepiece]>=4.46.0"

# PyTorch
!pip install torch==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu121

# Unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# xformers
!pip install --no-deps "xformers>=0.0.32,<0.0.33" --index-url https://download.pytorch.org/whl/cu121

# Additional dependencies
!pip install datasets huggingface_hub accelerate sentencepiece protobuf python-dotenv

print("‚úÖ Installation complete!")

## Step 2: Clone Repository

In [None]:
# Clone the repository
!git clone https://github.com/farhan-syah/unsloth-finetuning.git
%cd unsloth-finetuning

print("‚úÖ Repository cloned!")

## Step 3: Configure Training

Edit these settings for your training run:

In [None]:
# ============================================
# CONFIGURATION - Edit these settings
# ============================================

# Model Selection (choose based on use case)
LORA_BASE_MODEL = "unsloth/Qwen3-1.7B-unsloth-bnb-4bit"  # 1.7B model, fits T4 GPU
# LORA_BASE_MODEL = "unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit"  # 2B model with vision
# LORA_BASE_MODEL = "unsloth/Qwen3-4B-unsloth-bnb-4bit"  # 4B model (needs A100)

# Inference/Merging Model (OPTIONAL - for higher quality merged model)
# Leave empty to use LORA_BASE_MODEL (faster, uses 4-bit)
# Uncomment to use unquantized model for true 16-bit quality (slower, requires more VRAM)
INFERENCE_BASE_MODEL = ""  # Empty = use LORA_BASE_MODEL (4-bit, faster)
# INFERENCE_BASE_MODEL = "unsloth/Qwen3-1.7B"  # True 16-bit (requires ~15GB VRAM during build)
# INFERENCE_BASE_MODEL = "unsloth/Qwen3-VL-2B-Instruct"  # For vision models

# Dataset
DATASET_NAME = "yahma/alpaca-cleaned"  # Change to your dataset

# Training Mode
# Quick test (recommended for first run)
MAX_STEPS = 50              # Train for 50 steps only (~2 minutes)
DATASET_MAX_SAMPLES = 100   # Use 100 samples only

# Full training (uncomment to use)
# MAX_STEPS = 0               # Train for full epochs
# DATASET_MAX_SAMPLES = 0     # Use all samples

# ============================================
# HUGGINGFACE CONFIGURATION (Optional)
# ============================================
# Set your HuggingFace username here if you plan to push to HF Hub in Step 8
# This allows proper model card generation with cross-links between repos
HF_USERNAME = ""  # Your HuggingFace username (e.g., "your-username")

# ============================================
# TRAINING HYPERPARAMETERS
# ============================================

# Sequence Length
# Modern models support 8k-32k+ but longer = quadratic VRAM usage
# 512-1024: Short instructions, 2048-4096: Balanced, 8192+: Long context
MAX_SEQ_LENGTH = 4096       # Maximum tokens per training sample (reduce if OOM)

# LoRA Configuration
# Rank: Controls trainable parameters (8, 16, 32, 64, 128)
# Alpha: Scaling factor (typically r or r*2)
LORA_RANK = 16              # Recommended: 16 or 32
LORA_ALPHA = 32             # Recommended: same as rank or 2x rank

# Batch Size Configuration
# EFFECTIVE BATCH SIZE = BATCH_SIZE √ó GRADIENT_ACCUMULATION_STEPS
# Target: 8-16 for stable training
BATCH_SIZE = 2              # Samples per GPU pass (reduce to 1 if OOM)
GRADIENT_ACCUMULATION_STEPS = 4  # Micro-batches before update
# Current Effective Batch Size: 2 √ó 4 = 8

# For faster training with T4 (15GB VRAM):
# BATCH_SIZE = 4, GRADIENT_ACCUMULATION_STEPS = 2  ‚Üí Effective = 8 (faster)
# For maximum speed (if no OOM):
# BATCH_SIZE = 8, GRADIENT_ACCUMULATION_STEPS = 2  ‚Üí Effective = 16 (fastest)

# Learning Rate & Schedule
LEARNING_RATE = 2e-4        # Standard for LoRA (2e-4 recommended)
NUM_TRAIN_EPOCHS = 1        # Number of passes through dataset
WARMUP_STEPS = 5            # Gradual LR ramp-up (5-10% of total steps)

# Optimization Settings
PACKING = False             # Pack short sequences (True = faster for short texts)
USE_GRADIENT_CHECKPOINTING = True  # False = faster but more VRAM

# For SPEED OPTIMIZATION on T4 (if no OOM):
# USE_GRADIENT_CHECKPOINTING = False
# BATCH_SIZE = 4 or 8
# PACKING = True

# Output Formats (GGUF requires llama.cpp - not available in Colab)
OUTPUT_FORMATS = ""  # Empty = no GGUF conversion (recommended for Colab)

# Output naming
OUTPUT_MODEL_NAME = "auto"  # Auto-generate name

# Author
AUTHOR_NAME = "Your Name"  # Your name for model card

# ============================================
# CALCULATED VALUES
# ============================================
effective_batch_size = BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS

print("‚úÖ Configuration set!")
print(f"\nüìä Model Configuration:")
print(f"   Training Model: {LORA_BASE_MODEL}")
print(f"   Merging Model: {INFERENCE_BASE_MODEL if INFERENCE_BASE_MODEL else 'Same as training (4-bit)'}")
print(f"\nüìö Dataset:")
print(f"   Dataset: {DATASET_NAME}")
print(f"   Training: {MAX_STEPS if MAX_STEPS > 0 else 'Full epochs'} steps, {DATASET_MAX_SAMPLES if DATASET_MAX_SAMPLES > 0 else 'All'} samples")
print(f"\n‚öôÔ∏è  Hyperparameters:")
print(f"   Max Seq Length: {MAX_SEQ_LENGTH}")
print(f"   LoRA Rank: {LORA_RANK}, Alpha: {LORA_ALPHA}")
print(f"   Batch Size: {BATCH_SIZE}")
print(f"   Gradient Accumulation: {GRADIENT_ACCUMULATION_STEPS}")
print(f"   Effective Batch Size: {effective_batch_size} {'‚úì Good' if 8 <= effective_batch_size <= 16 else '‚ö†Ô∏è  Consider 8-16'}")
print(f"   Learning Rate: {LEARNING_RATE}")
print(f"\nüöÄ Optimization:")
print(f"   Gradient Checkpointing: {'ON (slower, less VRAM)' if USE_GRADIENT_CHECKPOINTING else 'OFF (faster, more VRAM)'}")
print(f"   Packing: {'ON' if PACKING else 'OFF'}")
if HF_USERNAME:
    print(f"\nüì§ HuggingFace:")
    print(f"   Username: {HF_USERNAME} (model cards will include HF links)")

## Step 4: Create .env File

In [None]:
# Create .env file with configuration
env_content = f"""
# Model
LORA_BASE_MODEL={LORA_BASE_MODEL}
INFERENCE_BASE_MODEL={INFERENCE_BASE_MODEL}
OUTPUT_MODEL_NAME={OUTPUT_MODEL_NAME}

# Dataset
DATASET_NAME={DATASET_NAME}
DATASET_MAX_SAMPLES={DATASET_MAX_SAMPLES}
MAX_STEPS={MAX_STEPS}

# Training
MAX_SEQ_LENGTH={MAX_SEQ_LENGTH}
LORA_RANK={LORA_RANK}
LORA_ALPHA={LORA_ALPHA}
BATCH_SIZE={BATCH_SIZE}
GRADIENT_ACCUMULATION_STEPS={GRADIENT_ACCUMULATION_STEPS}
LEARNING_RATE={LEARNING_RATE}
NUM_TRAIN_EPOCHS={NUM_TRAIN_EPOCHS}
WARMUP_STEPS={WARMUP_STEPS}
PACKING={'true' if PACKING else 'false'}

# Optimization
USE_GRADIENT_CHECKPOINTING={'true' if USE_GRADIENT_CHECKPOINTING else 'false'}
MAX_GRAD_NORM=1.0
OPTIM=adamw_8bit

# Logging
LOGGING_STEPS=5
SAVE_STEPS=25
SAVE_TOTAL_LIMIT=2
SAVE_ONLY_FINAL=true

# Monitoring
WANDB_ENABLED=false

# Output
OUTPUT_FORMATS={OUTPUT_FORMATS}
OUTPUT_DIR_BASE=./outputs
PREPROCESSED_DATA_DIR=./data/preprocessed
CHECK_SEQ_LENGTH=true
CACHE_DIR=./cache

# HuggingFace
PUSH_TO_HUB=false
HF_USERNAME={HF_USERNAME}
HF_MODEL_NAME=auto
HF_TOKEN={HF_TOKEN}

# Author
AUTHOR_NAME={AUTHOR_NAME}

# Advanced
SEED=3407
FORCE_PREPROCESS=false
FORCE_RETRAIN=true
FORCE_REBUILD=true
CHECK_SEQ_LENGTH=false
"""

with open('.env', 'w') as f:
    f.write(env_content)

print("‚úÖ .env file created!")
print(f"\n‚öôÔ∏è  Effective Batch Size: {effective_batch_size}")
if INFERENCE_BASE_MODEL:
    print(f"‚ö†Ô∏è  Using true 16-bit model for merging: {INFERENCE_BASE_MODEL}")
    print(f"   This will require more VRAM during Step 6 (build)")
if not USE_GRADIENT_CHECKPOINTING:
    print(f"üöÄ Gradient checkpointing disabled for faster training")
if HF_USERNAME:
    print(f"üì§ HuggingFace username set: {HF_USERNAME}")

## Step 5: Preprocess Dataset

Analyze your dataset and get smart configuration recommendations.

**This step:**
- Preprocesses and analyzes your dataset (cached, won't rerun if already done)
- Shows sequence length statistics
- Recommends optimal BATCH_SIZE, MAX_STEPS for 1-3 epochs
- Analyzes GPU memory and suggests settings

**After running this step:**
1. Review the recommendations shown below
2. If you want to use the recommended settings:
   - Go back to Step 3 and update the configuration values
   - Rerun Step 4 (Create .env File) to update the .env file
   - Skip rerunning this step (preprocessed data is already saved)
3. Continue to Step 6 (Train Model)

**Note:** Preprocessed data is cached. If you change DATASET_NAME or MAX_SEQ_LENGTH, set `FORCE_PREPROCESS=true` in Step 3 before rerunning.

In [None]:
# Preprocess dataset and get recommendations
!python preprocess.py

## Step 6: Train Model

This will take ~2 minutes for quick test, or hours for full training.

In [None]:
# Run training
!python train.py

## Step 7: Build Merged Model

This creates the merged model (LoRA + base model combined) in safetensors format.

**Why skip GGUF in Colab?**
- GGUF conversion requires llama.cpp (not available in Colab)
- **Better workflow:** Create merged model here, then convert to GGUF locally (CPU-only, no GPU needed)

**This step creates:** `merged_16bit/` folder with complete model in safetensors format

In [None]:
# Build merged model (safetensors format)
# This skips GGUF since OUTPUT_FORMATS is empty
!python build.py

## Step 8: Save Your Model

**You have two models to save:**

1. **LoRA adapters** (~80-100MB) - Small, efficient, requires base model to use
2. **Merged model** (size varies by model) - Complete model, ready to use anywhere

**Choose your preferred method:**
- **Option A (Recommended):** HuggingFace Hub - Free, unlimited storage, easy sharing
- **Option B:** Google Drive - Simple, but limited free storage (15GB)

In [None]:
# Check your model output
import os

# List output directories
output_dirs = [d for d in os.listdir('outputs') if os.path.isdir(os.path.join('outputs', d))]
if output_dirs:
    model_dir = output_dirs[0]
    print(f"‚úÖ Your model is in: outputs/{model_dir}")
    print(f"\nContents:")
    !ls -lh outputs/{model_dir}
    print(f"\nLoRA adapters: outputs/{model_dir}/lora/")
    print(f"Merged model: outputs/{model_dir}/merged_16bit/")
else:
    print("‚ùå No model found in outputs/")

### Option A: Push to HuggingFace Hub (Recommended)

**Why HuggingFace?**
- Free, unlimited storage
- Easy sharing and version control
- Direct integration with transformers, Ollama, etc.

**Steps:**
1. Get your HuggingFace token: https://huggingface.co/settings/tokens (create with "Write" access)
2. Run the cells below to push both LoRA and merged models

In [None]:
# A1. Configure HuggingFace settings
from huggingface_hub import login, HfApi
import os

# Try to get HF_TOKEN from Colab secrets first (recommended)
try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("‚úÖ Using HF_TOKEN from Colab secrets")
except:
    # Fallback to .env or interactive input
    HF_TOKEN = os.getenv('HF_TOKEN', '')
    if not HF_TOKEN:
        print("üí° TIP: Add HF_TOKEN to Colab secrets (üîë icon in left sidebar) for easier reuse")

# If you didn't set HF_USERNAME in Step 3, set it here
if not HF_USERNAME:
    HF_USERNAME = "your-username"  # Your HuggingFace username

# Repository names (auto-generated from model_dir by default)
LORA_REPO_NAME = f"{model_dir}-lora"
MERGED_REPO_NAME = f"{model_dir}"  # No suffix for merged model

print(f"HuggingFace Username: {HF_USERNAME}")
print(f"\nRepositories to create:")
print(f"   1. {HF_USERNAME}/{LORA_REPO_NAME} (LoRA adapters, ~80MB)")
print(f"   2. {HF_USERNAME}/{MERGED_REPO_NAME} (Merged model, size varies by model)")
print(f"\nüí° Later you can also create: {HF_USERNAME}/{model_dir}-gguf (for GGUF quantized)")
print(f"\nüìñ How to set up Colab secrets:")
print(f"   1. Click the üîë icon in the left sidebar")
print(f"   2. Add new secret: Name='HF_TOKEN', Value=<your token from https://huggingface.co/settings/tokens>")
print(f"   3. Toggle 'Notebook access' ON for this notebook")
print(f"\nReady to push? Run the next cell.")


In [None]:
# A2. Push both models to HuggingFace Hub
from huggingface_hub import login, HfApi, create_repo
from pathlib import Path
import os
import subprocess

# Login to HuggingFace
# Login to HuggingFace
if HF_TOKEN:
    login(token=HF_TOKEN)
else:
else:
    print("\nüîê No HF_TOKEN found. Please enter your token:")
    print("   Get it from: https://huggingface.co/settings/tokens")
    login()  # Will prompt interactively

api = HfApi()

# Get model paths
lora_path = f"outputs/{model_dir}/lora"
merged_path = f"outputs/{model_dir}/merged_16bit"

# Calculate sizes
lora_size = sum(f.stat().st_size for f in Path(lora_path).rglob('*') if f.is_file())
lora_size_mb = lora_size / (1024 * 1024)

merged_size = sum(f.stat().st_size for f in Path(merged_path).rglob('*') if f.is_file())
merged_size_gb = merged_size / (1024 * 1024 * 1024)

print("="*60)
print("UPLOADING TO HUGGINGFACE HUB")
print("="*60)

# Generate README files using standardized script
print("\n[0/3] Generating model cards...")
try:
    result = subprocess.run(
        ["python", "generate_readme_train.py"],
        capture_output=True,
        text=True,
        timeout=10
    )
    if result.returncode == 0:
        print("      ‚úÖ Model cards generated")
    else:
        print(f"      ‚ö†Ô∏è  Warning: {result.stderr}")
except Exception as e:
    print(f"      ‚ö†Ô∏è  Could not generate model cards: {e}")

# 1. Push LoRA adapters
lora_repo_id = f"{HF_USERNAME}/{LORA_REPO_NAME}"
print(f"\n[1/3] Pushing LoRA adapters to {lora_repo_id}...")
print(f"      Size: {lora_size_mb:.1f} MB")

try:
    create_repo(repo_id=lora_repo_id, repo_type="model", exist_ok=True)
    api.upload_folder(
        folder_path=lora_path,
        repo_id=lora_repo_id,
        repo_type="model",
        commit_message="Upload LoRA adapters"
    )
    print(f"      ‚úÖ LoRA adapters uploaded!")
    print(f"      üîó https://huggingface.co/{lora_repo_id}")
except Exception as e:
    print(f"      ‚ùå Error: {e}")

# 2. Push merged model
merged_repo_id = f"{HF_USERNAME}/{MERGED_REPO_NAME}"
print(f"\n[2/3] Pushing merged model to {merged_repo_id}...")
print(f"      Size: {merged_size_gb:.2f} GB (this will take several minutes)")

try:
    create_repo(repo_id=merged_repo_id, repo_type="model", exist_ok=True)
    api.upload_folder(
        folder_path=merged_path,
        repo_id=merged_repo_id,
        repo_type="model",
        commit_message="Upload merged model"
    )
    print(f"      ‚úÖ Merged model uploaded!")
    print(f"      üîó https://huggingface.co/{merged_repo_id}")
except Exception as e:
    print(f"      ‚ùå Error: {e}")

print("\n" + "="*60)
print("UPLOAD COMPLETE")
print("="*60)
print(f"\nüì¶ Your models on HuggingFace:")
print(f"   ‚Ä¢ LoRA: https://huggingface.co/{lora_repo_id}")
print(f"   ‚Ä¢ Merged: https://huggingface.co/{merged_repo_id}")
print(f"\nüí° Use the merged model with:")
print(f"   ‚Ä¢ transformers: model = AutoModelForCausalLM.from_pretrained('{merged_repo_id}')")
print(f"   ‚Ä¢ Ollama: ollama pull hf.co/{merged_repo_id}")
print(f"\nüìù Model cards generated from training configuration")

### Option B: Google Drive (Alternative)

In [None]:
# B1. Upload to Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy to Google Drive
!mkdir -p /content/drive/MyDrive/unsloth-models
!cp -r outputs/* /content/drive/MyDrive/unsloth-models/

print("‚úÖ Model copied to Google Drive: MyDrive/unsloth-models/")
print("")
print("üìÅ Your model contains:")
print("   - lora/ - LoRA adapters (~80MB)")
print("   - merged_16bit/ - Merged model in safetensors format (size varies by model)")
print("")
print("‚ö†Ô∏è  Note: Google Drive free tier has 15GB storage limit")
print("Next: Download from Google Drive to convert to GGUF locally")

## Step 9: Convert to GGUF Locally (Optional)

After uploading to HuggingFace, you can download and convert to GGUF on your local machine.

**Why local conversion?**
- GGUF conversion is CPU-only (no GPU needed, works on any machine)
- llama.cpp not available in Colab
- Better for creating multiple quantization formats

---

### Step-by-Step: Download from HuggingFace and Convert to GGUF

Run these commands on your **local machine**:

```bash
# ============================================
# 1. Setup local environment (one-time)
# ============================================
# If you haven't set up locally yet:
git clone https://github.com/farhan-syah/unsloth-finetuning.git
cd unsloth-finetuning
bash setup.sh  # Installs dependencies + llama.cpp

# ============================================
# 2. Download models from HuggingFace
# ============================================
# Download LoRA adapters
hf download {HF_USERNAME}/{model_dir}-lora \
  --local-dir outputs/{model_dir}/lora

# Download merged model
hf download {HF_USERNAME}/{model_dir} \
  --local-dir outputs/{model_dir}/merged_16bit

# ============================================
# 3. Create .env file with your configuration
# ============================================
cat > .env <<'EOF'
# Model (must match what was used in training)
LORA_BASE_MODEL={LORA_BASE_MODEL}
INFERENCE_BASE_MODEL={INFERENCE_BASE_MODEL if INFERENCE_BASE_MODEL else LORA_BASE_MODEL}
OUTPUT_MODEL_NAME={OUTPUT_MODEL_NAME}

# Dataset (for README generation)
DATASET_NAME={DATASET_NAME}
DATASET_MAX_SAMPLES={DATASET_MAX_SAMPLES}
MAX_STEPS={MAX_STEPS}

# Training params (for README generation)
MAX_SEQ_LENGTH={MAX_SEQ_LENGTH}
LORA_RANK={LORA_RANK}
LORA_ALPHA={LORA_ALPHA}
BATCH_SIZE={BATCH_SIZE}
GRADIENT_ACCUMULATION_STEPS={GRADIENT_ACCUMULATION_STEPS}
LEARNING_RATE={LEARNING_RATE}
NUM_TRAIN_EPOCHS={NUM_TRAIN_EPOCHS}

# Output formats - ADD YOUR DESIRED GGUF FORMATS HERE
OUTPUT_FORMATS=gguf_q4_k_m,gguf_q5_k_m

# HuggingFace (for model card generation)
HF_USERNAME={HF_USERNAME if HF_USERNAME else "your-username"}
HF_MODEL_NAME=auto

# Author
AUTHOR_NAME={AUTHOR_NAME}

# Advanced
OUTPUT_DIR_BASE=./outputs
FORCE_REBUILD=false
SEED=3407
EOF

# ============================================
# 4. Convert to GGUF
# ============================================
# This will:
# - Load merged_16bit model
# - Create GGUF quantizations (Q4_K_M, Q5_K_M)
# - Generate README for GGUF
python build.py

# ============================================
# 5. Your GGUF files are ready!
# ============================================
# Location: outputs/{model_dir}/gguf/
ls -lh outputs/{model_dir}/gguf/

# Files created:
# - model.Q4_K_M.gguf  (~1.0GB for 1.7B model)
# - model.Q5_K_M.gguf  (~1.2GB for 1.7B model)
# - README.md (usage instructions)
# - tokenizer files
```

---

### Using Your GGUF Model

**With Ollama:**
```bash
cd outputs/{model_dir}/gguf

# Create Ollama model
cat > Modelfile <<EOF
FROM ./model.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

ollama create {model_dir.lower()} -f Modelfile
ollama run {model_dir.lower()} "Hello! How can you help me?"
```

**With llama.cpp:**
```bash
# Interactive mode
./llama.cpp/llama-cli -m outputs/{model_dir}/gguf/model.Q4_K_M.gguf \
  -p "Hello! How can you help me?" \
  --temp 0.7

# Server mode
./llama.cpp/llama-server -m outputs/{model_dir}/gguf/model.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080
```

---

### Optional: Upload GGUF to HuggingFace

After creating GGUF files, you can upload them to a separate repository:

```bash
# Login to HuggingFace (one-time)
hf auth login

# Create GGUF repository
hf repo create {model_dir}-gguf --type model

# Upload GGUF files
hf upload {HF_USERNAME if HF_USERNAME else "your-username"}/{model_dir}-gguf \
  outputs/{model_dir}/gguf \
  --repo-type model
```

Your GGUF models will be at: `https://huggingface.co/{HF_USERNAME if HF_USERNAME else "your-username"}/{model_dir}-gguf`

---

### Available GGUF Quantizations

Edit `OUTPUT_FORMATS` in the `.env` file above to choose quantizations:

| Format | Size (1.7B) | Quality | Use Case |
|--------|-------------|---------|----------|
| `gguf_q4_k_m` | ~1.0GB | Good | **Recommended** - best balance |
| `gguf_q5_k_m` | ~1.2GB | Better | Higher quality, larger size |
| `gguf_q8_0` | ~1.8GB | Excellent | Near original quality |
| `gguf_f16` | ~3.4GB | Best | Full precision (largest) |

Example for multiple formats:
```bash
OUTPUT_FORMATS=gguf_q4_k_m,gguf_q5_k_m,gguf_q8_0
```

## Step 10: Test Your Model (Optional)

Quick test of your fine-tuned model:

In [None]:
from unsloth import FastLanguageModel
import torch

# Load your fine-tuned model
model_path = f"outputs/{model_dir}/lora"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

# Test prompt
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is machine learning?

### Response:
"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n" + "="*50)
print("MODEL RESPONSE:")
print("="*50)
print(response)
print("="*50)

## üéâ Done!

Your model has been trained and is ready to use!

**Next steps:**
1. Download the model from Google Drive or HuggingFace
2. Use it locally with Ollama or transformers
3. Share it on HuggingFace Hub

**Resources:**
- [Documentation](https://github.com/farhan-syah/unsloth-finetuning/tree/main/docs)
- [Training Guide](https://github.com/farhan-syah/unsloth-finetuning/blob/main/docs/TRAINING.md)
- [FAQ](https://github.com/farhan-syah/unsloth-finetuning/blob/main/docs/FAQ.md)