# Cogumi-LLM Training on Google Colab Pro+

## Overview
This notebook trains the Qwen 2.5 7B model using QLoRA on Google Colab Pro+ A100 40GB GPU.

**Training Details:**
- Model: Qwen 2.5 7B (7.62B parameters)
- Method: QLoRA 4-bit quantization
- Dataset: 640,637 samples (870MB)
- Expected Time: 20-25 hours
- Compute Units: ~62.5 units

**Prerequisites:**
- ✅ Google Colab Pro+ subscription
- ✅ Training data: `public_500k_filtered.jsonl` (870MB)
- ✅ A100 GPU runtime selected

## Instructions
Run cells sequentially from top to bottom. Each cell has explanations.

## Step 1: Verify GPU and Environment

In [None]:
# Check GPU availability and type
!nvidia-smi

import torch
print(f"\n{'='*60}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"GPU Memory: {total_memory:.1f} GB")
    
    # Verify we have A100 with enough memory
    assert 'A100' in torch.cuda.get_device_name(0), "Please select A100 GPU runtime"
    assert total_memory >= 35, f"Insufficient GPU memory: {total_memory:.1f}GB < 35GB"
    
    print(f"\n✅ Environment ready for training!")
else:
    print("\n❌ CUDA not available. Please select GPU runtime.")
    print("Go to: Runtime → Change runtime type → Select A100 GPU")

print(f"{'='*60}")

## Step 2: Clone Repository

In [None]:
import os

# Clone the repository if not already cloned
if not os.path.exists('Cogumi-LLM'):
    print("Cloning Cogumi-LLM repository...")
    !git clone https://github.com/YOUR_USERNAME/Cogumi-LLM.git
    print("✅ Repository cloned")
else:
    print("✅ Repository already exists")

# Change to project directory
%cd Cogumi-LLM

# Show current status
!pwd
!git log --oneline -1

## Step 3: Upload Training Data

**Important**: Upload the `public_500k_filtered.jsonl` file (870MB) from your local machine.

This will take 2-3 minutes depending on your internet speed.

In [None]:
from google.colab import files
import os

# Create directory structure
!mkdir -p data/phase1

# Check if file already exists
data_file = 'data/phase1/public_500k_filtered.jsonl'

if os.path.exists(data_file):
    print(f"✅ Training data already uploaded")
    !ls -lh {data_file}
    !wc -l {data_file}
else:
    print("📤 Please upload: public_500k_filtered.jsonl (870MB)")
    print("This will take 2-3 minutes...\n")
    
    uploaded = files.upload()
    
    # Move to correct location
    for filename in uploaded.keys():
        if filename.endswith('.jsonl'):
            !mv {filename} {data_file}
            print(f"\n✅ File uploaded successfully")
            break
    
    # Verify upload
    print("\nVerifying upload:")
    !ls -lh {data_file}
    !wc -l {data_file}
    
    # Should show:
    # -rw-r--r-- 1 root root 870M ... public_500k_filtered.jsonl
    # 640637 data/phase1/public_500k_filtered.jsonl

## Step 4: Install Dependencies

In [None]:
# Install Python packages
print("Installing dependencies (3-5 minutes)...\n")

!pip install -q -r requirements.txt
!pip install -q flash-attn --no-build-isolation

print("\n✅ Dependencies installed")

# Verify key packages
import transformers
import peft
import bitsandbytes

print(f"\nPackage versions:")
print(f"  transformers: {transformers.__version__}")
print(f"  peft: {peft.__version__}")
print(f"  bitsandbytes: {bitsandbytes.__version__}")

## Step 5: Validate Training Setup

This runs a 1-step training test to catch any issues before the full 25-hour run.

In [None]:
# Run validation (2-3 minutes)
print("Running pre-training validation...\n")

!python scripts/validate_training_setup.py

# Should see:
# ✅ Data loaded successfully
# ✅ Model loaded successfully  
# ✅ Training step completed
# ✅ Validation PASSED

print("\n✅ Validation complete - ready to start training!")

## Step 6: Start Training

**⚠️ This will run for 20-25 hours**

- Keep this browser tab open (can minimize)
- Training will save checkpoints every 500 steps
- Monitor progress in the output below

**Expected behavior:**
- Initial setup: 5-10 minutes (downloading model)
- Training: ~1.5 hours per epoch × 15 epochs = 22.5 hours
- Loss should decrease from ~2.5 to ~1.0

In [None]:
# Start training
import time
from datetime import datetime

start_time = time.time()
print(f"🚀 Training started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Expected completion: ~{25} hours from now\n")
print("="*60)

!python scripts/train_qwen_qlora.py

# Training complete
end_time = time.time()
duration_hours = (end_time - start_time) / 3600

print("\n" + "="*60)
print(f"✅ Training completed!")
print(f"Total time: {duration_hours:.2f} hours")
print(f"Finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

## Step 7: Monitor Progress (Optional)

Run this cell in parallel while training to monitor progress without interrupting the main training cell.

**Note**: You need to run this in a separate browser tab pointing to the same notebook.

In [None]:
# Monitor training logs
!tail -f logs/training_*.log

## Step 8: Check Checkpoints

In [None]:
# List saved checkpoints
print("Saved checkpoints:\n")
!ls -lth models/qwen-7b-qlora-phase2/ | head -20

# Show model directory size
!du -sh models/qwen-7b-qlora-phase2/

## Step 9: Keep Session Alive (Optional)

Run this to prevent session timeout. Colab sessions can timeout after 12 hours of inactivity.

**Note**: Stop this cell manually when training completes.

In [None]:
import time
from datetime import datetime

print("🔄 Session keep-alive started")
print("This will ping every 5 minutes to keep the session active.")
print("Stop this cell manually when done.\n")

try:
    while True:
        current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f"✅ Session active at {current_time}")
        time.sleep(300)  # 5 minutes
except KeyboardInterrupt:
    print("\n🛑 Keep-alive stopped")

## Step 10: Download Trained Model

After training completes, compress and download the model.

In [None]:
# Compress the trained model
print("Compressing model (this may take 5-10 minutes)...\n")

!tar -czf qwen-7b-qlora-phase2.tar.gz models/qwen-7b-qlora-phase2/

print("\n✅ Model compressed")
!ls -lh qwen-7b-qlora-phase2.tar.gz

# Download to local machine
print("\nStarting download...")
from google.colab import files
files.download('qwen-7b-qlora-phase2.tar.gz')

print("\n✅ Download started - check your browser's download folder")

## Resume Training from Checkpoint (If Needed)

If your session disconnected, use this cell to resume from the last checkpoint.

In [None]:
# Find the latest checkpoint
!ls -lt models/qwen-7b-qlora-phase2/ | grep checkpoint | head -1

# Replace XXXX with the checkpoint number
# checkpoint_num = "XXXX"  # e.g., "1000"
# !python scripts/train_qwen_qlora.py --resume_from_checkpoint models/qwen-7b-qlora-phase2/checkpoint-{checkpoint_num}

## Troubleshooting

### Session Disconnected
- Resume from last checkpoint (see cell above)
- Checkpoints saved every 500 steps

### CUDA Out of Memory
- Shouldn't happen on A100 40GB
- If it does, reduce batch size in `configs/student_model.yaml`

### Training Too Slow
- Check GPU utilization: `!nvidia-smi`
- Should show ~90-95% GPU usage

### Upload Failed
- File too large for browser
- Alternative: Host file on cloud and use `!wget`

## Next Steps

1. ✅ Download trained model
2. Evaluate on benchmarks
3. Proceed to Phase 3a (general modifiers)
4. Proceed to Phase 3b (coding modifiers)

## Resources

- Full Guide: `docs/COLAB_PRO_PLUS_GUIDE.md`
- Training Config: `configs/student_model.yaml`
- Execution Plan: `docs/EXECUTION_PLAN.md`