# üöÄ Fine-Tuning Qwen3-0.6B dengan Unsloth + QLoRA

Notebook ini adalah **step-by-step guide** lengkap untuk fine-tuning model.

---

## ÔøΩ STEP 0: Cara Koneksi ke Google Colab

### Opsi A: Langsung di Browser (Recommended untuk Pemula)

1. **Buka Google Colab**: https://colab.research.google.com
2. **Upload notebook ini**: File ‚Üí Upload notebook ‚Üí Pilih file `training.ipynb`
3. **Pilih GPU Runtime**:
   - Klik menu **Runtime** ‚Üí **Change runtime type**
   - Pilih **Hardware accelerator**: **T4 GPU**
   - Klik **Save**
4. **Jalankan cell satu per satu** dari atas ke bawah

---

### Opsi B: Dari VS Code (Advanced)

**Prerequisites:**
- VS Code dengan extension **Google Colab** (official)
- Extension **Jupyter** dari Microsoft

**Langkah Koneksi:**

1. **Install Extension**:
   ```
   Ctrl+Shift+X ‚Üí Search "Google Colab" ‚Üí Install (Publisher: Google)
   ```

2. **Buka notebook ini di VS Code**

3. **Select Kernel** (klik kanan atas):
   - Klik **Select Kernel**
   - Pilih **Google Colab**
   - Pilih **New Colab Server**

4. **Pilih Hardware**:
   - Pilih **GPU - T4** (free tier)
   - Klik **Connect**

5. **Authenticate**:
   - Browser akan terbuka untuk login Google
   - Allow akses
   - Copy authorization code
   - Paste di VS Code

6. **Verify**: Status bar akan menunjukkan **Connected to Colab**

---

### ‚ö†Ô∏è Known Issues (VS Code Extension)

- `drive.mount()` **TIDAK TERSEDIA** - Gunakan `files.upload()` sebagai gantinya
- `userdata.get()` **TIDAK TERSEDIA** - Hardcode secrets sementara
- Session timeout setelah ~90 menit idle

---

## üìã PLANNING OVERVIEW

**Total Steps**: 15 Cells

| Step | Cell | Deskripsi | Waktu |
|------|------|-----------|-------|
| 0 | Koneksi | Connect ke Colab + GPU | ~2 menit |
| 1 | Environment Setup | Setup cache & env vars | ~5 detik |
| 2 | Install Dependencies | Install libraries + Unsloth (T4) | ~2-3 menit |
| 3 | Verify GPU | Check T4 GPU tersedia | ~5 detik |
| 4 | Upload Files | Upload src.zip + dataset | Manual |
| 5 | Configuration | Set model, paths, hyperparams | ~5 detik |
| 6 | Pre-Download Model | Download model ke cache | ~1-2 menit |
| 7 | Load & Split Dataset | Split 80/10/10 | ~30 detik |
| 8 | Load Model + LoRA | Load Qwen3 + apply LoRA | ~1-2 menit |
| 9 | Setup Trainer | Configure training args | ~5 detik |
| 10 | Training | Run training loop | ~30-60 menit |
| 11 | Evaluation | Final validation | ~5 menit |
| 12 | Test Model | Test inference | ~1 menit |
| 13 | Merge LoRA | Merge adapters ke base | ~2-3 menit |
| 14 | Convert GGUF | Convert + Quantize | ~5-10 menit |
| 15 | Download GGUF | Download file GGUF | ~2-5 menit |

**Total Estimated Time**: ~60-90 menit (tergantung dataset size)

**Output Akhir**: File `model-q4_k_m.gguf` siap pakai untuk Ollama/LM Studio

---

## üìÅ File yang Perlu Diupload

1. **`src.zip`** - Zip dari folder `src/` (modules training)
2. **`train_data.jsonl`** - Dataset dalam format JSONL

### üìã Cara Membuat src.zip:
```bash
cd fine-tuning-project
zip -r src.zip src/
```

---

**Model**: `Qwen/Qwen3-0.6B`  
**GPU**: Google Colab T4 (16GB)  
**Teknik**: QLoRA (4-bit quantization + LoRA)

---
## üì¶ Step 1: Environment Setup

**Apa yang dilakukan:**
- Setup HuggingFace cache directory
- Set environment variables
- Prevent re-download model tiap session

‚è±Ô∏è **Waktu**: ~5 detik

In [None]:
# ===== ENVIRONMENT SETUP =====
# Jalankan cell ini PERTAMA sebelum install dependencies

import os

# Set HuggingFace cache directory (persisten selama session)
os.environ['HF_HOME'] = '/content/hf_cache'
os.environ['TRANSFORMERS_CACHE'] = '/content/hf_cache/transformers'
os.environ['HF_HUB_CACHE'] = '/content/hf_cache/hub'

# Create cache directories
os.makedirs('/content/hf_cache', exist_ok=True)
os.makedirs('/content/hf_cache/transformers', exist_ok=True)
os.makedirs('/content/hf_cache/hub', exist_ok=True)

# Create output directories
os.makedirs('/content/outputs', exist_ok=True)
os.makedirs('/content/outputs/checkpoints', exist_ok=True)

print('‚úÖ Environment configured!')
print(f'üìÅ HF Cache: {os.environ["HF_HOME"]}')
print(f'üìÅ Outputs: /content/outputs')

---
## üì¶ Step 2: Install Dependencies

**Apa yang dilakukan:**
- Install PyTorch, Transformers, PEFT, TRL
- Install Unsloth (optimized untuk T4 GPU)
- Install monitoring tools (tensorboard, pynvml)

‚è±Ô∏è **Waktu**: ~2-3 menit

‚ö†Ô∏è **Note**: Ada warning dependencies, itu normal dan bisa diabaikan.

In [None]:
# ===== INSTALL CORE DEPENDENCIES =====
!pip install -q torch transformers accelerate bitsandbytes peft trl \
    datasets sentencepiece protobuf huggingface-hub wandb tensorboard \
    psutil pynvml pyyaml tqdm numpy

# ===== INSTALL UNSLOTH (T4 GPU COMPATIBLE) =====
# T4 adalah GPU older architecture (non-Ampere), perlu instalasi khusus
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Re-install deps tanpa dependency conflicts (penting untuk T4!)
!pip install -q --no-deps trl peft accelerate bitsandbytes

# Flash Attention (optional - uncomment jika mau coba)
# !pip install -q flash-attn --no-build-isolation

print('\n' + '='*50)
print('‚úÖ All dependencies installed!')
print('‚úÖ Unsloth ready (T4 GPU compatible)')
print('='*50)

---
## üéÆ Step 3: Verify GPU

**Apa yang dilakukan:**
- Check apakah GPU T4 tersedia
- Verify CUDA dan PyTorch version

‚è±Ô∏è **Waktu**: ~5 detik

‚ö†Ô∏è **Jika GPU tidak tersedia:**
1. Klik menu **Runtime** ‚Üí **Change runtime type**
2. Pilih **GPU** ‚Üí **T4**
3. Restart runtime

In [None]:
# ===== VERIFY GPU =====
import torch

print('üéÆ GPU Verification')
print('='*50)
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')

if torch.cuda.is_available():
    print(f'CUDA version: {torch.version.cuda}')
    print(f'Device name: {torch.cuda.get_device_name(0)}')
    print(f'Device count: {torch.cuda.device_count()}')
    print('\nüìä GPU Details:')
    !nvidia-smi
    print('\n‚úÖ GPU ready for training!')
else:
    print('\n‚ö†Ô∏è GPU NOT AVAILABLE!')
    print('üëâ Go to: Runtime ‚Üí Change runtime type ‚Üí GPU ‚Üí T4')
    print('üëâ Then restart runtime and run from Cell 1')

---
## üì§ Step 4: Upload Project Files

**Upload file**: `upload_package.zip`

File ini berisi:
- `src.zip` - Modules training
- `train_data.jsonl` - Dataset Anda
- `training.ipynb` - Notebook (opsional)

‚è±Ô∏è **Waktu**: ~1-2 menit (tergantung ukuran dataset)

### üìã Cara membuat upload_package.zip:
```bash
cd fine-tuning-project
python3 scripts/package_for_upload.py
```

In [None]:
# ===== UPLOAD PACKAGE =====
from google.colab import files
import zipfile
import os

print('üì§ Upload file: upload_package.zip')
print('='*60)
print('File ini dibuat dengan: python3 scripts/package_for_upload.py')
print()

uploaded = files.upload()

# Get uploaded filename
uploaded_file = list(uploaded.keys())[0]
print(f'\nüì¶ Uploaded: {uploaded_file}')

# Extract upload_package.zip
if uploaded_file.endswith('.zip'):
    print(f'\nüìÇ Extracting {uploaded_file}...')
    with zipfile.ZipFile(uploaded_file, 'r') as zip_ref:
        zip_ref.extractall('.')
    
    # Check if src.zip exists and extract it
    if os.path.exists('src.zip'):
        print('\nüìÇ Extracting src.zip...')
        with zipfile.ZipFile('src.zip', 'r') as zip_ref:
            zip_ref.extractall('.')
        print('   ‚úÖ src/ extracted!')
    
    # Find dataset file
    dataset_files = [f for f in os.listdir('.') if f.endswith('.jsonl')]
    if dataset_files:
        DATASET_PATH = dataset_files[0]
        print(f'   ‚úÖ Dataset found: {DATASET_PATH}')
    else:
        print('   ‚ö†Ô∏è No .jsonl file found!')
        DATASET_PATH = None
    
    print('\nüìÅ Extracted files:')
    !ls -la
else:
    print('‚ö†Ô∏è Expected a .zip file!')

In [None]:
# ===== VERIFY EXTRACTION =====
print('üìã Verification')
print('='*60)

# Check src/
if os.path.exists('src'):
    print('‚úÖ src/ folder found')
    !ls src/
else:
    print('‚ùå src/ folder NOT found!')

# Check dataset
print(f'\nüìä Dataset: {DATASET_PATH}')
if DATASET_PATH and os.path.exists(DATASET_PATH):
    size_kb = os.path.getsize(DATASET_PATH) / 1024
    print(f'   Size: {size_kb:.1f} KB')
    print('\nüìã Preview (3 baris pertama):')
    !head -3 {DATASET_PATH}
else:
    print('‚ùå Dataset NOT found!')

---
## ‚öôÔ∏è Step 5: Configuration

**Apa yang dilakukan:**
- Set model name dan paths
- Import custom modules dari src/
- Configure hyperparameters

‚è±Ô∏è **Waktu**: ~5 detik

### üîß Parameter yang bisa diubah:
- `MODEL_NAME`: Model dari HuggingFace
- `NUM_EPOCHS`: Jumlah epoch training
- `VRAM_GB`: VRAM GPU (T4 = 16GB)

In [None]:
# ===== CONFIGURATION =====
import sys
import os
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, DataCollatorForLanguageModeling

# Add src to path
sys.path.insert(0, '.')

# Import custom modules
from src.data.dataset_analyzer import analyze_dataset_and_configure
from src.data.dataset_splitter import split_dataset, analyze_split_distribution
from src.training.mixed_precision import setup_mixed_precision
from src.training.callbacks import (
    VRAMMonitorCallback, 
    DynamicConfigCallback,
    EarlyStoppingCallback,
    ValidationLossLoggerCallback,
)
from src.training.metrics import compute_perplexity_only
from src.models.lora_config import get_dynamic_lora_config

print('‚úÖ Custom modules imported!')

# ===== KONFIGURASI (EDIT SESUAI KEBUTUHAN) =====
MODEL_NAME = 'Qwen/Qwen3-0.6B'  # Model dari HuggingFace
OUTPUT_DIR = '/content/outputs'  # Directory untuk save model
VRAM_GB = 16.0  # T4 GPU VRAM
NUM_EPOCHS = 3  # Jumlah epoch training

# Dataset path (dari upload sebelumnya)
# DATASET_PATH sudah di-set di cell sebelumnya

print(f'\nüìã Configuration:')
print(f'   Model: {MODEL_NAME}')
print(f'   Output: {OUTPUT_DIR}')
print(f'   VRAM: {VRAM_GB}GB')
print(f'   Epochs: {NUM_EPOCHS}')
print(f'   Dataset: {DATASET_PATH}')

---
## üì• Step 6: Pre-Download Model (Optional)

**Apa yang dilakukan:**
- Download model ke cache sebelum loading
- Memastikan download sukses sebelum training
- Skip jika model sudah ada di cache

‚è±Ô∏è **Waktu**: ~1-2 menit (pertama kali)

‚ö†Ô∏è **Skip cell ini** jika sudah pernah download model sebelumnya.

In [None]:
# ===== PRE-DOWNLOAD MODEL (OPTIONAL) =====
# Uncomment untuk pre-download model ke cache

from huggingface_hub import snapshot_download
import os

print(f'üì• Pre-downloading model: {MODEL_NAME}')
print('='*50)

try:
    cache_path = snapshot_download(
        repo_id=MODEL_NAME,
        cache_dir=os.environ.get('HF_HUB_CACHE', '/content/hf_cache/hub'),
        ignore_patterns=['*.md', '*.txt', '*.rst']  # Skip docs
    )
    print(f'\n‚úÖ Model cached to: {cache_path}')
except Exception as e:
    print(f'\n‚ö†Ô∏è Pre-download skipped: {e}')
    print('üí° Model akan di-download otomatis saat loading')

---
## üìä Step 7: Load & Split Dataset

**Apa yang dilakukan:**
- Load dataset dari JSONL file
- Split menjadi Train/Validation/Test (80/10/10)
- Analyze token distribution

‚è±Ô∏è **Waktu**: ~30 detik (tergantung ukuran dataset)

### üìã Hasil Split:
- **Train (80%)**: Untuk training
- **Validation (10%)**: Untuk eval setiap N steps
- **Test (10%)**: JANGAN SENTUH sampai training selesai!

In [None]:
# ===== LOAD & SPLIT DATASET =====
print(f'üì• Loading dataset: {DATASET_PATH}')
print('='*50)

# Load dataset
full_dataset = load_dataset('json', data_files={'train': DATASET_PATH}, split='train')
print(f'Total samples: {len(full_dataset):,}')

# Split dataset (80/10/10)
dataset_dict = split_dataset(
    full_dataset,
    train_ratio=0.80,
    val_ratio=0.10,
    test_ratio=0.10,
    seed=42
)

# Save test set (JANGAN SENTUH sampai training selesai!)
test_dataset_path = f'{OUTPUT_DIR}/test_dataset.json'
dataset_dict['test'].to_json(test_dataset_path)
print(f'\n‚úÖ Test dataset saved: {test_dataset_path}')
print('‚ö†Ô∏è DO NOT use test set until training is fully complete!')

---
## üîß Step 8: Load Model + Apply LoRA

**Apa yang dilakukan:**
- Setup mixed precision (bf16/fp16)
- Load tokenizer dan analyze dataset
- Load model dengan Unsloth (2x faster)
- Apply LoRA adapters

‚è±Ô∏è **Waktu**: ~1-2 menit

### üß† Dynamic Config:
Batch size dan gradient accumulation akan auto-adjust berdasarkan token length!

In [None]:
# ===== SETUP MIXED PRECISION =====
bf16_support, fp16_support, precision_mode = setup_mixed_precision()
print(f'üìä Precision mode: {precision_mode}')

# ===== LOAD TOKENIZER & ANALYZE DATASET =====
print(f'\nüìù Loading tokenizer: {MODEL_NAME}')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    print('   Added [PAD] token')

# Analyze dataset dan generate dynamic config
train_dataset, dynamic_config = analyze_dataset_and_configure(
    dataset_dict['train'], 
    tokenizer, 
    max_length=32768, 
    vram_gb=VRAM_GB
)

# Analyze distribution per split
analyze_split_distribution(dataset_dict, tokenizer)

In [None]:
# ===== LOAD MODEL DENGAN UNSLOTH =====
from unsloth import FastLanguageModel

print(f'üî• Loading model: {MODEL_NAME}')
print('='*50)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=dynamic_config['max_seq_length'],
    dtype=torch.bfloat16 if bf16_support else None,
    load_in_4bit=True,
    device_map='auto'
)

# Resize embeddings jika menambah token
model.resize_token_embeddings(len(tokenizer))

print('\n‚úÖ Model loaded with Unsloth!')

In [None]:
# ===== APPLY LoRA =====
lora_config = get_dynamic_lora_config(MODEL_NAME, dynamic_config['max_seq_length'])

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_config['r'],
    lora_alpha=lora_config['lora_alpha'],
    target_modules=lora_config['target_modules'],
    lora_dropout=lora_config['lora_dropout'],
    bias=lora_config['bias'],
    use_gradient_checkpointing=dynamic_config['use_gradient_checkpointing'],
    use_rslora=lora_config['use_rslora'],
    random_state=3407
)

print('\n‚úÖ LoRA applied!')
print(f'   r: {lora_config["r"]}')
print(f'   alpha: {lora_config["lora_alpha"]}')
print(f'   Gradient checkpointing: {dynamic_config["use_gradient_checkpointing"]}')

---
## üéØ Step 9: Setup Trainer

**Apa yang dilakukan:**
- Configure training arguments
- Setup callbacks (VRAM monitor, early stopping)
- Create SFTTrainer

‚è±Ô∏è **Waktu**: ~5 detik

### üìã Training Features:
- ‚úÖ Evaluation setiap 100 steps
- ‚úÖ Auto-save checkpoints
- ‚úÖ VRAM monitoring
- ‚úÖ Early stopping

In [None]:
# ===== TRAINING ARGUMENTS =====
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8
)

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=dynamic_config['per_device_train_batch_size'],
    gradient_accumulation_steps=dynamic_config['gradient_accumulation_steps'],
    
    # Learning rate & schedule
    learning_rate=2e-5,
    num_train_epochs=NUM_EPOCHS,
    lr_scheduler_type='cosine',
    warmup_ratio=0.1,
    
    # Mixed precision
    bf16=bf16_support,
    fp16=fp16_support,
    
    # Optimizer
    optim='paged_adamw_8bit',
    weight_decay=0.01,
    max_grad_norm=1.0,
    
    # Gradient checkpointing
    gradient_checkpointing=dynamic_config['use_gradient_checkpointing'],
    gradient_checkpointing_kwargs={'use_reentrant': False},
    
    # Evaluation
    eval_strategy='steps',
    eval_steps=100,
    per_device_eval_batch_size=2,
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    
    # Logging & Saving
    logging_steps=10,
    save_strategy='steps',
    save_steps=100,
    save_total_limit=3,
    report_to=['tensorboard'],
    
    # Performance
    dataloader_num_workers=2,
    dataloader_pin_memory=True,
)

print('‚úÖ Training arguments configured!')
print(f'   Batch size: {dynamic_config["per_device_train_batch_size"]}')
print(f'   Gradient accumulation: {dynamic_config["gradient_accumulation_steps"]}')
print(f'   Effective batch: {dynamic_config["effective_batch_size"]}')

In [None]:
# ===== SETUP CALLBACKS =====
callbacks = [
    VRAMMonitorCallback(threshold_percent=95.0),
    DynamicConfigCallback(dynamic_config),
    EarlyStoppingCallback(patience=5, min_delta=0.001),
    ValidationLossLoggerCallback(),
]

print('‚úÖ Callbacks configured:')
for cb in callbacks:
    print(f'   - {cb.__class__.__name__}')

In [None]:
# ===== CREATE TRAINER =====
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_dict['train'],
    eval_dataset=dataset_dict['validation'],
    data_collator=data_collator,
    callbacks=callbacks,
    compute_metrics=compute_perplexity_only,
    dataset_text_field='text',
    max_seq_length=dynamic_config['max_seq_length'],
    packing=False,
)

print('\n‚úÖ Trainer created!')
print('\nüìã Ready for training. Run next cell to start!')

---
## üöÄ Step 10: Start Training

**Apa yang dilakukan:**
- Jalankan training loop
- Log metrics ke TensorBoard
- Auto-save checkpoints

‚è±Ô∏è **Waktu**: ~30-60 menit (tergantung dataset)

### üìä Monitor:
- Loss akan turun secara bertahap
- Eval loss harus track training loss
- VRAM usage akan di-monitor otomatis

In [None]:
# ===== START TRAINING! =====
print('\n' + '='*80)
print('üöÄ STARTING TRAINING')
print('='*80)
print(f'\nüìã Training config:')
print(f'   Epochs: {NUM_EPOCHS}')
print(f'   Train samples: {len(dataset_dict["train"]):,}')
print(f'   Eval samples: {len(dataset_dict["validation"]):,}')
print(f'   Max seq length: {dynamic_config["max_seq_length"]}')
print('\n' + '-'*80)

train_result = trainer.train()

print('\n' + '='*80)
print('‚úÖ TRAINING COMPLETED!')
print('='*80)

---
## üìä Step 11: Final Evaluation

**Apa yang dilakukan:**
- Run final validation
- Calculate perplexity
- Display training stats

‚è±Ô∏è **Waktu**: ~5 menit

In [None]:
# ===== FINAL VALIDATION =====
print('üìä Running final validation...')
val_results = trainer.evaluate()

print(f'\n' + '='*60)
print(f'üìä FINAL VALIDATION RESULTS')
print(f'='*60)
print(f"   Validation Loss: {val_results.get('eval_loss', 'N/A'):.4f}")
print(f"   Validation Perplexity: {val_results.get('eval_perplexity', 'N/A')}")
print(f'='*60)

print(f'\nüìä Training Stats:')
print(f'   Total steps: {train_result.global_step}')
print(f'   Training loss: {train_result.training_loss:.4f}')

---
## üß™ Step 12: Test Model

**Apa yang dilakukan:**
- Test inference dengan prompt sample
- Verify model berfungsi dengan benar
- Cek kualitas output

‚è±Ô∏è **Waktu**: ~1 menit

In [None]:
# ===== TEST INFERENCE =====
print('üß™ Testing model inference...')
print('='*60)

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    'Buatkan aplikasi todo list sederhana',
    'Saya butuh API untuk e-commerce',
]

for i, prompt in enumerate(test_prompts, 1):
    print(f'\n--- Test {i} ---')
    print(f'üìù Prompt: {prompt}')
    
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    outputs = model.generate(
        **inputs, 
        max_new_tokens=300, 
        do_sample=True, 
        temperature=0.7,
        top_p=0.9
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(f'ü§ñ Response:')
    print(response[:500] + '...' if len(response) > 500 else response)

print('\n' + '='*60)
print('‚úÖ Model test completed!')

---
## üíæ Step 13: Save & Merge LoRA Adapters

**Apa yang dilakukan:**
- Save model dengan LoRA adapters
- Merge LoRA ke base model (untuk GGUF conversion)

‚è±Ô∏è **Waktu**: ~2-3 menit

‚ö†Ô∏è **Note**: Merge diperlukan untuk convert ke GGUF

In [None]:
# ===== SAVE MODEL DENGAN LORA =====
FINAL_MODEL_PATH = f'{OUTPUT_DIR}/final_model'
MERGED_MODEL_PATH = f'{OUTPUT_DIR}/merged_model'

print('üíæ Saving model with LoRA adapters...')
trainer.save_model(FINAL_MODEL_PATH)
tokenizer.save_pretrained(FINAL_MODEL_PATH)
print(f'   ‚úÖ LoRA model saved to: {FINAL_MODEL_PATH}')

# Merge LoRA ke base model
print('\nüîÄ Merging LoRA adapters to base model...')

# VOCAB FIX: Qwen3-0.6B base vocab is 151669, but Unsloth pads to 151936.
if '0.6B' in MODEL_NAME:
    print('üîß Applying Vocab Fix: Resizing to 151669...')
    model.resize_token_embeddings(151669)

model.save_pretrained_merged(
    MERGED_MODEL_PATH, 
    tokenizer, 
    save_method='merged_16bit'
)
print(f'   ‚úÖ Merged model saved to: {MERGED_MODEL_PATH}')

print('\nüìÅ Merged model files:')
!ls -lh {MERGED_MODEL_PATH}

---
## üîÑ Step 14: Convert to GGUF

**Apa yang dilakukan:**
- Install llama.cpp
- Convert model ke format GGUF
- Quantize ke Q4_K_M (optimal size/quality)

‚è±Ô∏è **Waktu**: ~5-10 menit

### üìã Quantization Options:
| Type | Size | Quality | Use Case |
|------|------|---------|----------|
| Q4_K_M | ~400MB | Bagus | ‚úÖ **Recommended** |
| Q5_K_M | ~500MB | Lebih baik | High quality |
| Q8_0 | ~700MB | Terbaik | Maximum quality |

In [None]:
# ===== INSTALL LLAMA.CPP =====
print('üì¶ Installing llama.cpp...')
print('='*60)

# Clone llama.cpp
!git clone --depth 1 https://github.com/ggerganov/llama.cpp /content/llama.cpp

# Install Python requirements
!pip install -q /content/llama.cpp

print('\n‚úÖ llama.cpp installed!')

In [None]:
# ===== CONVERT TO GGUF =====
import os

GGUF_OUTPUT = f'{OUTPUT_DIR}/model.gguf'
GGUF_QUANTIZED = f'{OUTPUT_DIR}/model-q4_k_m.gguf'

print('üîÑ Converting to GGUF format...')
print('='*60)

# Convert to GGUF (f16)
!python /content/llama.cpp/convert_hf_to_gguf.py \
    {MERGED_MODEL_PATH} \
    --outfile {GGUF_OUTPUT} \
    --outtype f16

if os.path.exists(GGUF_OUTPUT):
    size_mb = os.path.getsize(GGUF_OUTPUT) / (1024 * 1024)
    print(f'\n‚úÖ GGUF created: {GGUF_OUTPUT}')
    print(f'   Size: {size_mb:.1f} MB')
else:
    print('‚ùå GGUF conversion failed!')

In [None]:
# ===== QUANTIZE TO Q4_K_M =====
print('üìâ Quantizing to Q4_K_M...')
print('='*60)

# Build llama.cpp quantize tool
!cd /content/llama.cpp && make -j quantize

# Quantize
!/content/llama.cpp/llama-quantize {GGUF_OUTPUT} {GGUF_QUANTIZED} Q4_K_M

if os.path.exists(GGUF_QUANTIZED):
    size_mb = os.path.getsize(GGUF_QUANTIZED) / (1024 * 1024)
    print(f'\n‚úÖ Quantized GGUF created: {GGUF_QUANTIZED}')
    print(f'   Size: {size_mb:.1f} MB')
    print(f'\nüìä Size comparison:')
    !ls -lh {OUTPUT_DIR}/*.gguf
else:
    print('‚ùå Quantization failed! Using unquantized version.')
    GGUF_QUANTIZED = GGUF_OUTPUT

---
## üì• Step 15: Download GGUF Model

**Apa yang dilakukan:**
- Download file GGUF yang sudah di-quantize
- File siap digunakan dengan llama.cpp, Ollama, LM Studio, dll

‚è±Ô∏è **Waktu**: ~2-5 menit (tergantung ukuran)

### üìã Setelah Download:
1. File akan masuk ke folder **Downloads**
2. Pindahkan ke folder `outputs/` di project lokal
3. Jalankan dengan Ollama/LM Studio

In [None]:
# ===== DOWNLOAD GGUF =====
from google.colab import files
import os

print('üì• Preparing download...')
print('='*60)

# Check which file to download
if os.path.exists(GGUF_QUANTIZED):
    download_file = GGUF_QUANTIZED
    print(f'üì¶ Downloading quantized model (Q4_K_M)...')
else:
    download_file = GGUF_OUTPUT
    print(f'üì¶ Downloading unquantized model (F16)...')

size_mb = os.path.getsize(download_file) / (1024 * 1024)
print(f'   File: {os.path.basename(download_file)}')
print(f'   Size: {size_mb:.1f} MB')
print('\n‚è≥ Starting download (this may take a few minutes)...\n')

files.download(download_file)

print('\n' + '='*60)
print('‚úÖ GGUF MODEL DOWNLOADED!')
print('='*60)
print('\nüìã Cara menggunakan:')
print('   1. Ollama: ollama create mymodel -f Modelfile')
print('   2. LM Studio: Import model dari file GGUF')
print('   3. llama.cpp: ./llama-cli -m model-q4_k_m.gguf -p "prompt"')

---
## üì¶ (Optional) Download LoRA Adapters Only

Jika ingin download LoRA adapters saja (lebih kecil, ~50MB):

In [None]:
# ===== DOWNLOAD LORA ONLY (OPTIONAL) =====
# Uncomment jika ingin download LoRA adapters saja

# from google.colab import files
# import shutil

# print('üì¶ Creating LoRA zip archive...')
# shutil.make_archive('lora_adapters', 'zip', FINAL_MODEL_PATH)

# print('üì• Downloading LoRA adapters...')
# files.download('lora_adapters.zip')

# print('‚úÖ LoRA adapters downloaded!')
# print('üí° Untuk menggunakan, merge dengan base model di local')

---
## üåê (Optional) Upload to HuggingFace Hub

In [None]:
# ===== UPLOAD TO HUGGINGFACE (OPTIONAL) =====
# Uncomment untuk upload ke HuggingFace Hub

# from huggingface_hub import login, HfApi

# # Login dengan token Anda
# HF_TOKEN = 'hf_your_token_here'  # Ganti dengan token Anda
# login(token=HF_TOKEN)

# # Upload GGUF file
# api = HfApi()
# REPO_NAME = 'your-username/qwen3-0.6b-finetuned-gguf'

# api.create_repo(repo_id=REPO_NAME, exist_ok=True)
# api.upload_file(
#     path_or_fileobj=GGUF_QUANTIZED,
#     path_in_repo='model-q4_k_m.gguf',
#     repo_id=REPO_NAME
# )

# print(f'‚úÖ GGUF uploaded to: https://huggingface.co/{REPO_NAME}')

---
## üîå Cleanup & Disconnect

In [None]:
# ===== CLEANUP =====
import gc
gc.collect()
torch.cuda.empty_cache()

print('‚úÖ Cache cleared!')
print('\n' + '='*60)
print('üéâ TRAINING & EXPORT COMPLETE!')
print('='*60)
print('\nüìã Summary:')
print(f'   Model: {MODEL_NAME}')
print(f'   Training epochs: {NUM_EPOCHS}')
print(f'   GGUF file: model-q4_k_m.gguf')
print('\nüìå Next Steps:')
print('   1. Pindahkan file GGUF ke folder outputs/ lokal')
print('   2. Import ke Ollama/LM Studio')
print('   3. Test dengan prompts')

# Terminate runtime (uncomment untuk menggunakan)
# from google.colab import runtime
# runtime.unassign()