# Phase 1A: LLAMA-3.2-8B QLoRA Training

**Project:** Cogumi-LLM  
**Phase:** 1A - Base Model Distillation  
**Duration:** 36-48 hours  
**GPU Required:** A100 40GB  

---

## Setup Instructions

1. **Select Runtime**: Runtime → Change runtime type → A100 GPU
2. **Connect to GPU**: Click Connect in top-right
3. **Run all cells sequentially**
4. **Monitor training**: Check TensorBoard and logs

⚠️ **Important**: Colab Pro+ allows up to 24 hours per session. Training takes 36-48 hours, so you'll need to resume from checkpoint.

## 📋 Best Practices for Long-Running Tasks

**Background Execution**: For verification and monitoring tasks, use `nohup` to run in background:
```bash
# Run dataset verification in background
nohup python src/phase0_dataset/verify_dataset.py --sample-size 10000 > verify.log 2>&1 &

# Check progress anytime
tail -f verify.log

# Check if still running
ps aux | grep verify_dataset
```

**Benefits**:
- ✅ Continue working on other setup tasks
- ✅ Process survives if you switch cells
- ✅ Can monitor multiple tasks simultaneously
- ✅ Logs saved for later review

**When to Use Background**:
- Dataset verification (5-10 minutes)
- Model downloads (10-15 minutes)
- Benchmark evaluations (15-30 minutes)
- **NOT for training** (use TensorBoard for monitoring)

---

## 1. Environment Setup

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Verify we have A100
import torch
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"GPU Device: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# Verify it's A100
gpu_name = torch.cuda.get_device_name(0)
if 'A100' not in gpu_name:
    print("\n⚠️ WARNING: You need A100 GPU for this training!")
    print("Go to Runtime → Change runtime type → Select A100")
else:
    print("\n✅ A100 GPU detected! Ready to train.")

## 2. Install Dependencies

⚠️ **Important**: Colab comes with pre-installed packages that conflict with our requirements. We'll clean install everything.

**⏱️ Estimated time: 5-7 minutes**

In [None]:
print("=" * 60)
print("📦 DEPENDENCY INSTALLATION (Section 2)")
print("=" * 60)
print("\n🧹 Step 1: Removing conflicting pre-installed packages...")
print("=" * 60)

# Remove ALL conflicting packages that come pre-installed in Colab
conflicting_packages = [
    'torch', 'torchvision', 'torchaudio',  # Colab has torch 2.8.0, we need 2.4.0
    'transformers', 'accelerate', 'peft',   # Need specific versions
    'tensorflow', 'tensorboard',            # TensorFlow conflicts with our tensorboard
    'opencv-python', 'opencv-python-headless', 'opencv-contrib-python',  # Require numpy 2.x
    'timm', 'pillow',                      # Vision packages not needed
    'axolotl',                             # May exist from previous runs
]

for package in conflicting_packages:
    !pip uninstall -y {package} 2>/dev/null || true

print("\n✅ Cleanup complete!")
print("\n" + "=" * 60)
print("📦 Step 2: Installing compatible package versions...")
print("=" * 60)

# Install PyTorch ecosystem with compatible versions
!pip install -q torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118

# Install core ML packages
!pip install -q transformers==4.41.0
!pip install -q accelerate==0.33.0
!pip install -q peft==0.12.0
!pip install -q bitsandbytes==0.43.3

# Install data handling packages
!pip install -q datasets==2.20.0
!pip install -q tokenizers==0.19.1
!pip install -q numpy==1.26.4  # Compatible with all our packages

# Install monitoring packages
!pip install -q wandb
!pip install -q tensorboard==2.17.0  # Compatible version

print("\n" + "=" * 60)
print("📦 Step 3: Installing Axolotl (this may take 2-3 minutes)...")
print("=" * 60)

# Install Axolotl v0.4.0 (compatible with our package versions)
!pip install -q --no-deps git+https://github.com/OpenAccess-AI-Collective/axolotl.git@v0.4.0

# Install Axolotl's dependencies manually to avoid conflicts
!pip install -q fire pyyaml huggingface-hub

print("\n" + "=" * 60)
print("✅ All dependencies installed successfully!")
print("=" * 60)
print("\n📋 Installed versions:")
print(f"  • torch: 2.4.0")
print(f"  • transformers: 4.41.0")
print(f"  • accelerate: 0.33.0")
print(f"  • peft: 0.12.0")
print(f"  • bitsandbytes: 0.43.3")
print(f"  • datasets: 2.20.0")
print(f"  • axolotl: v0.4.0")
print("\n💡 Clean install complete - no dependency conflicts!")
print("🚀 Ready to proceed with training setup")

## 3. Clone Repository & Setup

In [None]:
# Clone repository (or pull latest changes if already exists)
import os

if os.path.exists('Cogumi-LLM'):
    print("📂 Repository already exists, pulling latest changes...")
    %cd Cogumi-LLM
    !git pull origin main
    print("✅ Repository updated to latest version")
else:
    print("📥 Cloning repository...")
    !git clone https://github.com/dkeviv/Cogumi-LLM.git
    %cd Cogumi-LLM
    print("✅ Repository cloned successfully")

In [None]:
# Verify dataset exists
!ls -lh data/phase1/public_500k_filtered.jsonl
!wc -l data/phase1/public_500k_filtered.jsonl

## 3b. Upload Dataset

⚠️ **Important**: The dataset is not in the Git repository (too large). You need to upload it.

**Choose one option:**

### Option 1: Upload Compressed File (RECOMMENDED - 3x faster!)
- **File**: `public_500k_filtered.jsonl.gz` (264 MB)
- **Time**: ~9-10 minutes
- **Location**: `/Users/vivekdurairaj/Projects/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl.gz`

### Option 2: Upload Original File
- **File**: `public_500k_filtered.jsonl` (870 MB)
- **Time**: ~30-35 minutes
- **Location**: `/Users/vivekdurairaj/Projects/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl`

### Option 1: Upload Compressed File (9-10 minutes)

In [None]:
# Create data directory structure
!mkdir -p data/phase1

# Upload compressed dataset file
from google.colab import files
print("📤 OPTION 1: Upload compressed file (FASTER)")
print("📂 Click 'Choose Files' and select: public_500k_filtered.jsonl.gz")
print("📍 Location: /Users/vivekdurairaj/Projects/Cogumi-LLM/data/phase1/")
print("⏱️  Upload: ~9-10 minutes (264 MB)")
print("\nWaiting for file selection...")

uploaded = files.upload()

# Move and decompress
print("\n📦 Moving and decompressing file...")
!mv public_500k_filtered.jsonl.gz data/phase1/
!gunzip data/phase1/public_500k_filtered.jsonl.gz

print("\n✅ Upload and decompression complete! Verifying...")

### Option 2: Upload Original File (30-35 minutes)

In [None]:
# Create data directory structure
!mkdir -p data/phase1

# Upload uncompressed dataset file
from google.colab import files
print("📤 OPTION 2: Upload original file (slower)")
print("📂 Click 'Choose Files' and select: public_500k_filtered.jsonl")
print("📍 Location: /Users/vivekdurairaj/Projects/Cogumi-LLM/data/phase1/")
print("⏱️  Upload: ~30-35 minutes (870 MB)")
print("\nWaiting for file selection...")

uploaded = files.upload()

# Move to correct location
print("\n📦 Moving file to data/phase1/...")
!mv public_500k_filtered.jsonl data/phase1/

print("\n✅ Upload complete! Verifying...")

In [None]:
# Verify dataset uploaded correctly
import json

print("📊 Dataset Verification:\n")

# Check file exists and size
!ls -lh data/phase1/public_500k_filtered.jsonl

# Count lines
print("\n📏 Line count:")
!wc -l data/phase1/public_500k_filtered.jsonl

# Verify format (first 3 examples)
print("\n✅ First 3 examples:")
with open('data/phase1/public_500k_filtered.jsonl', 'r') as f:
    for i in range(3):
        line = f.readline()
        example = json.loads(line)
        print(f"\nExample {i+1}:")
        print(f"  Keys: {list(example.keys())}")
        if 'instruction' in example:
            print(f"  Instruction: {example['instruction'][:80]}...")
        if 'response' in example:
            print(f"  Response: {example['response'][:80]}...")

print("\n🎉 Dataset ready for training!")

### Optional: Verify Dataset Quality (Run in Background)

You can verify dataset quality while setting up other components. This takes 5-10 minutes.

In [None]:
# Option A: Run verification in background (recommended)
# This allows you to continue with other setup tasks
!nohup python src/phase0_dataset/verify_dataset.py --sample-size 10000 > verify.log 2>&1 &
print("✅ Verification running in background. Check progress with: !tail -f verify.log")

In [None]:
# Option B: Check verification progress
!tail -20 verify.log

In [None]:
# Option C: Check if verification is still running
!ps aux | grep verify_dataset.py | grep -v grep

## 4. HuggingFace Authentication

You need a HuggingFace token to download LLAMA-3.2-8B.

1. Go to: https://huggingface.co/settings/tokens
2. Create a new token (read access)
3. Accept LLAMA-3.2 license at: https://huggingface.co/meta-llama/Llama-3.2-8B
4. Paste token below

In [None]:
from huggingface_hub import login

# Paste your HuggingFace token here
HF_TOKEN = "YOUR_HF_TOKEN_HERE"

login(token=HF_TOKEN)
print("✅ HuggingFace authentication successful!")

## 5. Create Training Configuration

In [None]:
%%writefile configs/base_training.yaml
# Base Model Configuration
base_model: meta-llama/Llama-3.2-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: false

# QLoRA Configuration
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16

adapter: lora
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Dataset Configuration
datasets:
  - path: data/phase1/public_500k_filtered.jsonl
    type: completion
    field: response

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
max_packed_sequence_len: 2048

# Training Hyperparameters
num_epochs: 3
micro_batch_size: 4
gradient_accumulation_steps: 8
gradient_checkpointing: true

# Optimizer Configuration
optimizer: adamw_torch
learning_rate: 0.000005
lr_scheduler: cosine
warmup_steps: 500
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-8
max_grad_norm: 1.0

# Precision & Hardware
bf16: true
tf32: true
flash_attention: true

# Logging & Checkpointing
logging_steps: 10
eval_steps: 500
save_steps: 1000
save_total_limit: 5
output_dir: ./data/checkpoints/llama-3.2-8b-phase1a

# Early Stopping
early_stopping_patience: 6
load_best_model_at_end: true
metric_for_best_model: loss
greater_is_better: false

# Evaluation
evaluation_strategy: steps
eval_steps: 500
per_device_eval_batch_size: 4
eval_accumulation_steps: 4

# Additional Optimizations
group_by_length: true
ddp_find_unused_parameters: false
dataloader_num_workers: 4
dataloader_pin_memory: true

## 6. Prepare Dataset for Axolotl

Axolotl needs specific format. Let's verify and convert if needed.

In [None]:
# Check first few examples
import json

with open('data/phase1/public_500k_filtered.jsonl', 'r') as f:
    for i in range(3):
        line = f.readline()
        example = json.loads(line)
        print(f"\nExample {i+1}:")
        print(f"Keys: {list(example.keys())}")
        if 'instruction' in example:
            print(f"Instruction: {example['instruction'][:100]}...")
        if 'response' in example:
            print(f"Response: {example['response'][:100]}...")

## 7. Start Training

⚠️ **This will run for 36-48 hours**. Colab Pro+ sessions timeout after 24 hours, so you'll need to resume.

### 🔧 CHECKPOINT: Verify Installation

**📍 Run this cell after installing dependencies (Section 2)**

This verifies all packages are correctly installed and compatible.

In [None]:
print("=" * 60)
print("🔍 VERIFICATION CHECKPOINT")
print("=" * 60)
print("\n📋 Testing all package installations...\n")

import sys
all_good = True

# Test critical imports
try:
    import torch
    assert torch.__version__.startswith("2.4"), f"Wrong torch version: {torch.__version__}"
    print(f"✅ PyTorch {torch.__version__}")
except Exception as e:
    print(f"❌ PyTorch error: {e}")
    all_good = False

try:
    import transformers
    assert transformers.__version__.startswith("4.41"), f"Wrong transformers version: {transformers.__version__}"
    print(f"✅ Transformers {transformers.__version__}")
except Exception as e:
    print(f"❌ Transformers error: {e}")
    all_good = False

try:
    import accelerate
    print(f"✅ Accelerate {accelerate.__version__}")
except Exception as e:
    print(f"❌ Accelerate error: {e}")
    all_good = False

try:
    import peft
    print(f"✅ PEFT {peft.__version__}")
except Exception as e:
    print(f"❌ PEFT error: {e}")
    all_good = False

try:
    import bitsandbytes
    print(f"✅ BitsAndBytes {bitsandbytes.__version__}")
except Exception as e:
    print(f"❌ BitsAndBytes error: {e}")
    all_good = False

try:
    import axolotl
    print(f"✅ Axolotl imported successfully")
except Exception as e:
    print(f"❌ Axolotl error: {e}")
    all_good = False

try:
    # Test critical transformers imports
    from transformers import AutoModelForCausalLM, AutoTokenizer
    print(f"✅ Transformers models imported successfully")
except Exception as e:
    print(f"❌ Transformers model import error: {e}")
    all_good = False

print("=" * 60)
if all_good:
    print("\n🎉 All packages installed correctly!")
    print("🚀 Ready to proceed with training setup")
else:
    print("\n⚠️  Some packages have issues!")
    print("💡 Solution: Runtime → Restart runtime, then rerun cell 7 (Dependencies)")

### ⚠️ EMERGENCY ONLY: Complete Clean Restart

**📍 Only use if verification fails or training won't start**

This will restart your runtime completely. You'll need to rerun all cells.

In [None]:
print("=" * 60)
print("⚠️  NUCLEAR OPTION - RUNTIME RESTART")
print("=" * 60)
print("=" * 60)
print("\nThis option will:")
print("  1. Kill your current runtime")
print("  2. Clear all installed packages")
print("  3. Clear all variables and uploaded files")
print("\nAfter restart, you'll need to:")
print("  • Rerun cell 7 (Dependencies)")
print("  • Re-upload dataset")
print("  • Rerun all setup cells")
print("\n" + "=" * 60)
print("\nTo proceed, uncomment the line below and run this cell:")
print()

# Uncomment this line to restart runtime:
# import os; os.kill(os.getpid(), 9)

In [None]:
# Start TensorBoard in background (open in new tab)
%load_ext tensorboard
%tensorboard --logdir data/checkpoints/llama-3.2-8b-phase1a

In [None]:
print("=" * 60)
print("🚀 LAUNCH TRAINING (Section 7)")
print("=" * 60)

# Launch training with error suppression
import os
import warnings

# Suppress torchvision warnings
os.environ['PYTHONWARNINGS'] = 'ignore::RuntimeError'
warnings.filterwarnings('ignore')

print("🚀 Launching training...")
print("⏱️  Expected duration: 26-35 hours on A100-80GB")
print("📊 Monitor progress in TensorBoard (see cell above)\n")

!accelerate launch -m axolotl.cli.train configs/base_training.yaml

## 8. Resume Training (After Session Timeout)

If Colab disconnects, run cells 1-5 again, then run this cell to resume:

In [None]:
# Check available checkpoints
!ls -lh data/checkpoints/llama-3.2-8b-phase1a/

# Find latest checkpoint
import os
import re

checkpoint_dir = "data/checkpoints/llama-3.2-8b-phase1a"
checkpoints = [d for d in os.listdir(checkpoint_dir) if d.startswith('checkpoint-')]
if checkpoints:
    # Sort by step number
    checkpoints.sort(key=lambda x: int(re.findall(r'\d+', x)[0]))
    latest = checkpoints[-1]
    print(f"\n✅ Latest checkpoint: {latest}")
    print(f"\nTo resume, update configs/base_training.yaml:")
    print(f"Add: resume_from_checkpoint: {checkpoint_dir}/{latest}")
else:
    print("No checkpoints found yet.")

In [None]:
# Resume training from latest checkpoint
!accelerate launch -m axolotl.cli.train configs/base_training.yaml --resume_from_checkpoint data/checkpoints/llama-3.2-8b-phase1a/checkpoint-XXXX

## 9. Monitor Training Progress

In [None]:
# Check training logs
!tail -50 data/checkpoints/llama-3.2-8b-phase1a/training.log

In [None]:
# Plot loss curve
import json
import matplotlib.pyplot as plt

trainer_state_file = "data/checkpoints/llama-3.2-8b-phase1a/trainer_state.json"

if os.path.exists(trainer_state_file):
    with open(trainer_state_file, 'r') as f:
        state = json.load(f)
    
    # Extract loss history
    steps = []
    losses = []
    for entry in state['log_history']:
        if 'loss' in entry:
            steps.append(entry['step'])
            losses.append(entry['loss'])
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.plot(steps, losses, linewidth=2)
    plt.xlabel('Training Steps')
    plt.ylabel('Loss')
    plt.title('Training Loss Curve')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print(f"\nCurrent step: {state['global_step']}")
    print(f"Current loss: {losses[-1]:.4f}")
    print(f"Best loss: {min(losses):.4f}")
    print(f"Progress: {state['global_step']/60000*100:.1f}% (target: 60K steps)")
else:
    print("Training state file not found yet.")

## 10. Merge LoRA Adapters (After Training)

Run this after training completes to merge LoRA weights into base model.

In [None]:
# Merge LoRA adapters into base model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "data/checkpoints/llama-3.2-8b-phase1a"
)

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("models/llama-3.2-8b-phase1a-merged")

# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")
tokenizer.save_pretrained("models/llama-3.2-8b-phase1a-merged")

print("✅ Model merged and saved to models/llama-3.2-8b-phase1a-merged")

## 11. Test the Model

In [None]:
# Quick test
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="models/llama-3.2-8b-phase1a-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

test_prompt = "Write a Python function to calculate the factorial of a number."

result = generator(
    test_prompt,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

print(result[0]['generated_text'])

## 12. Download Model to Local

After training completes, download the model to continue with Phase 2.

In [None]:
# Compress model for download
!tar -czf llama-3.2-8b-phase1a-merged.tar.gz models/llama-3.2-8b-phase1a-merged/
!ls -lh llama-3.2-8b-phase1a-merged.tar.gz

print("\n✅ Model compressed. Download from Files panel on left.")

In [None]:
# Alternative: Upload to HuggingFace Hub
from huggingface_hub import HfApi

api = HfApi()

# Create repository (change username)
repo_id = "YOUR_USERNAME/cogumi-llm-phase1a"

api.create_repo(repo_id=repo_id, private=True, exist_ok=True)

# Upload model
api.upload_folder(
    folder_path="models/llama-3.2-8b-phase1a-merged",
    repo_id=repo_id,
    repo_type="model"
)

print(f"✅ Model uploaded to: https://huggingface.co/{repo_id}")

---

## Training Checklist

- [ ] A100 GPU selected
- [ ] Dependencies installed
- [ ] Repository cloned
- [ ] HuggingFace authenticated
- [ ] Dataset verified (640,637 examples)
- [ ] Training config created
- [ ] Training started
- [ ] TensorBoard monitoring
- [ ] Checkpoint saved (every 1000 steps)
- [ ] Training completed (60K steps)
- [ ] LoRA merged into base
- [ ] Model tested
- [ ] Model downloaded/uploaded

## Expected Timeline

- **Epoch 1**: 12-14 hours (steps 0-20K)
- **Epoch 2**: 12-14 hours (steps 20K-40K)
- **Epoch 3**: 12-14 hours (steps 40K-60K)
- **Total**: 36-48 hours

## Troubleshooting

**Session Timeout**: Resume from latest checkpoint (see cell 8)

**OOM Error**: Reduce `micro_batch_size` to 2 in config

**Slow Progress**: Check GPU utilization with `!nvidia-smi`

**Loss Not Decreasing**: Check TensorBoard, may need to reduce learning rate

**CUDA Error**: Restart runtime, rerun setup cells

---

**Next Steps After Phase 1A:**
1. Evaluate on benchmarks (MMLU, HumanEval, GSM8K)
2. Proceed to Phase 2: Compression (95% size reduction)
3. Create domain modifiers in Phase 3