# ⚠️ GOLDEN DEPENDENCY SET - DO NOT CHANGE

**Tested and Working Configuration for H100 Training:**

| Package | Version | Notes |
|---------|---------|-------|
| **Python** | 3.10.12 | Base system Python |
| **PyTorch** | 2.8.0+cu128 | CUDA 12.8 build |
| **CUDA** | 12.8 | GPU compute platform |
| **bitsandbytes** | 0.48.1 | 8-bit optimizer support |
| **xformers** | 0.0.32.post2 | Memory efficient attention |
| **transformers** | 4.57.1 | HuggingFace models |
| **Unsloth** | 2025.10.8 | Fast fine-tuning library |

**Hardware:** NVIDIA H100 80GB HBM3

**Installation Method:** Use `golden_dynamic_setup_full.sh` script which creates a virtual environment at `/workspace/golden-venv/` with these exact versions.

---

# H100 Training with Unsloth - Production Ready

**Complete 5-step guide using GOLDEN DEPENDENCY SET (tested & working)**

**Time**: 8-9 hours | **Cost**: ~$10 on H100

⚠️ **IMPORTANT:** Use the exact dependency versions documented above. Other combinations may fail!

## Step 1: Dry Run - Verify What Will Be Installed

**First, run a dry-run to confirm the golden dependency versions will be installed.**

This will show:
- PyTorch 2.8.0+cu128
- transformers 4.57.1
- xformers 0.0.32.post2
- bitsandbytes 0.48.1
- Unsloth 2025.10.8

✅ These are the **GOLDEN DEPENDENCY SET** that's been tested and confirmed working.

In [None]:
# Cell 1: Dry-run to see what will be installed
# Upload golden_dynamic_setup_full.sh to the same folder as the notebook
!echo "🔹 Running dry-run to show planned installation..."
!bash golden_dynamic_setup_full.sh --dry-run

## Step 2: Install Golden Dependency Set

**This installs the exact tested versions in a virtual environment.**

Creates: `/workspace/golden-venv/` with Python 3.10.12 and all dependencies

**Takes 10-15 minutes**

In [None]:
# Cell 3: Run full installation
!bash golden_dynamic_setup_full.sh

## Step 3: Switch Kernel & Restart

**After installation completes:**

1. Click **"Kernel"** menu at the top
2. Select **"Restart Kernel"**
3. Wait for kernel to restart

**Then change kernel to golden-venv:**

1. Click **"Kernel"** menu → **"Change Kernel"**
2. Select the kernel from `/workspace/golden-venv/bin/python`
3. Wait for connection (green checkmark)

## Step 4: Verify Golden Dependency Set

**Run the cell below to verify all packages match the golden set:**

Expected output:
```
Python: 3.10.12
Torch: 2.8.0+cu128, CUDA: 12.8, GPUs: 1
GPU 0: NVIDIA H100 80GB HBM3
bitsandbytes: 0.48.1
xformers: 0.0.32.post2
transformers: 4.57.1
🦥 Unsloth version: 2025.10.8
```

If versions don't match, something went wrong with installation!

## ⚠️ CRITICAL: Disk Space Prevention

**This notebook includes automatic checkpoint cleanup to prevent disk space crashes.**

**What's implemented:**
- `save_total_limit=2` (keep only 2 checkpoints instead of 3)
- `save_steps=2000` (save less frequently to reduce disk pressure)
- Automatic cleanup callback (force-deletes old checkpoints every save)
- Disk monitoring (alerts if disk usage >70%)

**Why this matters:**
- Each checkpoint = ~3GB
- Without aggressive cleanup, disk fills up → training crashes
- This configuration keeps max 6GB of checkpoints (safe on 100GB volume)

✅ **Safe for 100GB volumes** - tested and confirmed working!

In [None]:
#!/usr/bin/env python3
try:
    import unsloth
    version = getattr(unsloth, "__version__", "unknown")
    print(f"🦥 Unsloth version: {version} (import first in your scripts!)")
except ImportError:
    print("⚠️ Unsloth not installed! Install via 'pip install unsloth'")

import sys
import torch


print(f"Python: {sys.version}")
print(f"Torch: {torch.__version__}, CUDA: {torch.version.cuda}, GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

for pkg in ["bitsandbytes", "xformers", "transformers"]:
    try:
        mod = __import__(pkg)
        print(f"{pkg}: {mod.__version__}")
    except ImportError:
        print(f"⚠️ {pkg} not installed")


## Step 5: HuggingFace Authentication

1. Get token from: https://huggingface.co/settings/tokens
2. Accept LLAMA license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
3. Run the cell below and paste your token

In [None]:
from huggingface_hub import login
login()
print("\n✅ Authentication successful! Token saved.")

## Step 6: Upload Dataset

**Before running the next cell, upload your dataset:**

1. In JupyterLab, use the file browser on the left
2. Navigate to `/data/Cogumi-LLM/data/phase1/`
3. Click the **Upload Files** button (↑ icon)
4. Select `public_500k_filtered.jsonl` from your local machine
5. Wait for upload to complete (~5-10 minutes for 870MB file)

The training script expects: `/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl`

## Step 7: Create Training Script

**Creates train.py with:**
- Unsloth 2025.10.8 compatible code
- Batched formatting function for instruction/response dataset
- QLoRA 4-bit training configuration
- Optimized for H100 with golden dependency set

In [None]:
import os

# ----------------------------
# Notebook cell to create train.py - WITH AUTO-RESUME
# ----------------------------
script = """# ----------------------------
# train.py - H100 Optimized (Packing DISABLED for stability)
# Compatible with Unsloth 2025.10.8, TRL, PEFT, 4-bit training
# INCLUDES: Automatic checkpoint cleanup AND auto-resume capability
# ----------------------------

import unsloth  # Must be first for Unsloth patching
import torch
from transformers import TrainingArguments, TrainerCallback
from trl import SFTTrainer
from datasets import load_dataset
from unsloth import FastLanguageModel
import gc
import os
import glob
import shutil
import subprocess
import threading
import time

# Clear any existing GPU memory
gc.collect()
torch.cuda.empty_cache()

# ============================================================================
# AUTO-RESUME: Check for existing checkpoints
# ============================================================================
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
resume_checkpoint = None

if os.path.exists(checkpoint_dir):
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    if checkpoints:
        # Sort by step number and get the latest
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        resume_checkpoint = checkpoints[-1]
        step_num = resume_checkpoint.split('-')[-1]
        print("=" * 70)
        print(f"🔄 RESUMING from checkpoint: {os.path.basename(resume_checkpoint)}")
        print(f"   Previous progress: {step_num} steps completed")
        print("=" * 70)
    else:
        print("=" * 70)
        print("🆕 No checkpoints found - starting fresh training")
        print("=" * 70)
else:
    print("=" * 70)
    print("🆕 Starting fresh training (no checkpoint directory)")
    print("=" * 70)

# ============================================================================
# CHECKPOINT CLEANUP CALLBACK - Prevents disk space crashes
# ============================================================================
class CheckpointCleanupCallback(TrainerCallback):
    \"\"\"
    Aggressively deletes old checkpoints to prevent disk space exhaustion.
    
    Keeps only the N most recent checkpoints (save_total_limit).
    Triggered after every checkpoint save to ensure immediate cleanup.
    \"\"\"
    def on_save(self, args, state, control, **kwargs):
        \"\"\"Called after checkpoint is saved.\"\"\"
        checkpoint_dir = args.output_dir
        
        # Find all checkpoint directories
        checkpoints = sorted(
            glob.glob(f\"{checkpoint_dir}/checkpoint-*\"),
            key=lambda x: int(x.split('-')[-1])
        )
        
        # Keep only the last N checkpoints
        keep_last_n = args.save_total_limit or 2
        
        if len(checkpoints) > keep_last_n:
            to_delete = checkpoints[:-keep_last_n]
            
            for checkpoint_path in to_delete:
                try:
                    print(f\"🗑️  Auto-deleting old checkpoint: {os.path.basename(checkpoint_path)}\")
                    shutil.rmtree(checkpoint_path)
                    print(f\"   ✅ Deleted successfully\")
                except Exception as e:
                    print(f\"   ⚠️  Failed to delete: {e}\")
        
        # Report disk usage after cleanup
        try:
            result = subprocess.run(['df', '-h', checkpoint_dir], 
                                  capture_output=True, text=True)
            lines = result.stdout.strip().split('\\n')
            if len(lines) > 1:
                usage = lines[1].split()[4]
                print(f\"💾 Disk usage after cleanup: {usage}\")
        except:
            pass

# ============================================================================
# DISK MONITORING - Early warning system
# ============================================================================
def monitor_disk_space():
    \"\"\"Monitor disk usage every 10 minutes and warn if getting full.\"\"\"
    while True:
        try:
            result = subprocess.run(['df', '-h', '/data'], 
                                  capture_output=True, text=True)
            lines = result.stdout.strip().split('\\n')
            if len(lines) > 1:
                usage = lines[1].split()[4]
                used = lines[1].split()[2]
                timestamp = time.strftime('%H:%M:%S')
                
                usage_pct = int(usage.strip('%'))
                
                if usage_pct > 90:
                    print(f\"\\n🚨 CRITICAL: Disk {used} used ({usage}) at {timestamp}\")
                    print(f\"   ⚠️  Training may crash soon due to disk space!\")
                elif usage_pct > 70:
                    print(f\"\\n⚠️  Warning: Disk {used} used ({usage}) at {timestamp}\")
                else:
                    print(f\"💾 Disk: {used} used ({usage}) - {timestamp}\")
        except Exception as e:
            print(f\"⚠️  Disk monitoring error: {e}\")
        
        time.sleep(600)  # Check every 10 minutes

# Start disk monitoring in background
monitor_thread = threading.Thread(target=monitor_disk_space, daemon=True)
monitor_thread.start()
print(\"✅ Disk space monitoring started (checks every 10 minutes)\")

# Load model + tokenizer with H100 optimizations
print("🔄 Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,  # Handle sequences up to 1024 tokens
    load_in_4bit=True,
    dtype=None,  # Auto-detect bf16 for H100
    attn_implementation="flash_attention_2",  # CRITICAL: Enable FA2
)
print("✅ Model loaded successfully")

# Apply PEFT / LoRA
print("🔄 Applying LoRA...")
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)
print("✅ LoRA applied successfully")

# Prepare model for training (enables Flash Attention 2)
# Force disable gradient offloading for maximum speed
import os
os.environ["UNSLOTH_OFFLOAD_GRADIENTS"] = "0"
print("🔄 Preparing model for training...")
model = FastLanguageModel.for_training(model)
print("✅ Model ready for training")

# Load dataset with caching
print("📥 Loading dataset...")
dataset = load_dataset(
    "json",
    data_files="/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl",
    split="train",
    cache_dir="/tmp/hf_cache",
    encoding="utf-8"
)
print(f"✅ Dataset loaded: {len(dataset)} examples")

# Define formatting function (required by Unsloth)
def formatting_func(examples):
    instructions = examples['instruction']
    responses = examples['response']
    
    texts = []
    for instruction, response in zip(instructions, responses):
        text = f"### Instruction:\\\\n{instruction}\\\\n\\\\n### Response:\\\\n{response}"
        texts.append(text)
    
    return texts

# Training arguments - OPTIMIZED FOR H100
args = TrainingArguments(
    output_dir="/data/Cogumi-LLM/checkpoints",
    num_train_epochs=3,
    
    # Batch size for H100 80GB with 4-bit and seq_len 1024
    per_device_train_batch_size=4,  # Reduced for no-packing mode
    gradient_accumulation_steps=2,   # Effective batch size of 8
    
    # Optimization settings
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=1,
    save_steps=2000,              # UPDATED: Save every 2000 steps (was 1000)
    save_total_limit=2,           # UPDATED: Keep only 2 checkpoints (was 3)
    
    # H100 optimizations
    optim="adamw_8bit",
    bf16=True,
    tf32=True,
    
    # Dataloader settings (conservative)
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
    dataloader_persistent_workers=True,
    group_by_length=False,
    
    # Memory optimizations
    gradient_checkpointing=False,
    max_grad_norm=1.0,
    
    # Disable unnecessary features
    logging_first_step=False,
    logging_nan_inf_filter=False,
    save_safetensors=True,
    
    # Report to nothing (disable wandb etc)
    report_to="none",
)

# Create trainer WITHOUT packing (packing was causing batch size mismatch)
print("🔄 Creating trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=args,
    formatting_func=formatting_func,
    max_seq_length=1024,  # MUST match model max_seq_length
    packing=False,  # DISABLED: packing caused batch mismatch error
    dataset_num_proc=2,
)

# Add checkpoint cleanup callback
trainer.add_callback(CheckpointCleanupCallback())
print("✅ Trainer created successfully")
print("✅ Checkpoint cleanup callback registered")

# Train the model (WITH AUTO-RESUME)
print("=" * 70)
if resume_checkpoint:
    print(f"🔄 RESUMING training from step {resume_checkpoint.split('-')[-1]}")
else:
    print("🚀 STARTING fresh training")
print("=" * 70)
print(f"   Max sequence length: 1024")
print(f"   Batch size per device: {args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {args.gradient_accumulation_steps}")
print(f"   Effective batch size: {args.per_device_train_batch_size * args.gradient_accumulation_steps}")
print(f"   Total training steps: ~{len(dataset) // (args.per_device_train_batch_size * args.gradient_accumulation_steps) * args.num_train_epochs}")
print(f"   Dataloader workers: {args.dataloader_num_workers}")
print(f"   Prefetch factor: {args.dataloader_prefetch_factor}")
print(f"   Dataset processing workers: 2")
print(f"   Packing: DISABLED (was causing batch mismatch)")
print(f"   Flash Attention 2: ENABLED")
print(f"   Gradient offloading: DISABLED")
print(f"   Expected speed: 2-4 it/s on H100")
print(f"   💾 Checkpoint saves: Every {args.save_steps} steps")
print(f"   💾 Checkpoints kept: {args.save_total_limit} (auto-cleanup enabled)")
print(f"   📊 Disk monitoring: Active (every 10 min)")
print(f"   🔄 Auto-resume: ENABLED")
print("=" * 70)

try:
    trainer.train(resume_from_checkpoint=resume_checkpoint)
    print("\\n✅ Training completed successfully!")
except Exception as e:
    print(f"\\n❌ Training failed with error: {e}")
    import traceback
    traceback.print_exc()
    raise

# Save model
print("💾 Saving final model...")
model.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
tokenizer.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
print("✅ Model saved to /data/Cogumi-LLM/checkpoints/final")
"""

# Write train.py to disk
train_path = "/data/Cogumi-LLM/train.py"
os.makedirs(os.path.dirname(train_path), exist_ok=True)
with open(train_path, "w", encoding="utf-8") as f:
    f.write(script)

print(f"✅ AUTO-RESUME training script created at {train_path}")
print(f"   🔄 Auto-resume: ENABLED - will resume from last checkpoint automatically")
print(f"   ⚡ Flash Attention 2: Enabled")
print(f"   🚫 Gradient offloading: DISABLED")
print(f"   🔧 Sequence length: 1024")
print(f"   📦 Batch size: 4")
print(f"   🔄 Gradient accumulation: 2 (effective batch = 8)")
print(f"   👷 Dataloader workers: 4")
print(f"   ❌ Packing: DISABLED (was causing batch dimension mismatch)")
print(f"   💾 Save every 2000 steps")
print(f"   💾 Keep only 2 checkpoints")
print(f"   🗑️  Auto-cleanup: ENABLED (prevents disk space crashes)")
print(f"   📊 Disk monitoring: ENABLED (checks every 10 min)")
print(f"   ✅ Safe to interrupt - will resume automatically on restart!")


## Step 7.5: Clean Up Old Checkpoints (If Resuming After Crash)

**⚠️ ONLY RUN THIS IF:**
- Training crashed previously due to disk space
- You have multiple old checkpoints taking up space
- You're resuming training from a checkpoint

**This will:**
- Keep only the latest checkpoint
- Delete all older checkpoints
- Free up 20-30GB of disk space

**Skip this step if starting fresh training.**

In [None]:
# ============================================================================
# CHECKPOINT CLEANUP - Run ONLY if resuming after crash
# ============================================================================

import os
import subprocess
from pathlib import Path

checkpoint_dir = Path("/data/Cogumi-LLM/checkpoints")

print("🧹 CHECKPOINT CLEANUP")
print("=" * 70)

if not checkpoint_dir.exists():
    print("✅ No checkpoints directory found - nothing to clean")
else:
    # Find all checkpoints
    checkpoints = sorted(
        checkpoint_dir.glob("checkpoint-*"),
        key=lambda x: int(x.name.split("-")[1])
    )
    
    if len(checkpoints) == 0:
        print("✅ No checkpoints found - nothing to clean")
    elif len(checkpoints) == 1:
        print(f"✅ Only 1 checkpoint found: {checkpoints[0].name}")
        print("   Nothing to delete")
    else:
        print(f"📁 Found {len(checkpoints)} checkpoints:")
        for cp in checkpoints:
            size = subprocess.run(
                ['du', '-sh', str(cp)],
                capture_output=True,
                text=True
            ).stdout.split()[0]
            print(f"   - {cp.name} ({size})")
        
        print("\n🗑️  Deleting old checkpoints (keeping latest)...")
        
        # Delete all but the latest
        for cp in checkpoints[:-1]:
            print(f"   Deleting {cp.name}...")
            subprocess.run(['rm', '-rf', str(cp)])
        
        # Verify cleanup
        remaining = list(checkpoint_dir.glob("checkpoint-*"))
        print(f"\n✅ Cleanup complete!")
        print(f"   Kept: {remaining[0].name if remaining else 'None'}")
        print(f"   Deleted: {len(checkpoints) - len(remaining)} checkpoints")
        
        # Show disk space
        result = subprocess.run(['df', '-h', '/data'], 
                              capture_output=True, text=True)
        lines = result.stdout.strip().split('\n')
        if len(lines) > 1:
            parts = lines[1].split()
            print(f"\n💾 Disk Space: {parts[2]} used / {parts[1]} total ({parts[4]} full)")
            print(f"   Available: {parts[3]}")

print("\n✅ Ready to resume training!")

## Step 8: Start Training 🚀

**Training Details:**
- Duration: 8-9 hours on H100
- Cost: ~$10 on Vast.ai
- Model: Llama-3.1-8B-Instruct (4-bit QLoRA)
- Dataset: 640K instruction/response pairs

**Monitor with:** `nvidia-smi` in a terminal or `watch -n 1 nvidia-smi`

## Step 7.9: Pre-Flight Check - Verify Auto-Resume Setup ✈️

**ALWAYS RUN THIS BEFORE STARTING TRAINING!**

This verifies:
1. ✅ train.py has auto-resume code
2. ✅ Checkpoints are detected correctly
3. ✅ Training will resume from the right place

**If this shows issues, fix them before running Step 8!**

In [None]:
# ============================================================================
# PRE-FLIGHT CHECK - Verify auto-resume setup before training
# ============================================================================

import os
import glob
import subprocess

print("✈️ PRE-FLIGHT CHECKLIST")
print("=" * 70)

# Check 1: Does train.py exist?
train_py_path = "/data/Cogumi-LLM/train.py"
if not os.path.exists(train_py_path):
    print("❌ CRITICAL: train.py not found!")
    print(f"   Expected at: {train_py_path}")
    print("   🔧 FIX: Re-run Cell 15 (Step 7) to create train.py")
    print("=" * 70)
    raise FileNotFoundError("train.py missing - cannot continue")

print("✅ CHECK 1: train.py exists")

# Check 2: Does train.py have auto-resume code?
print("\n🔍 CHECK 2: Checking for auto-resume code in train.py...")
with open(train_py_path, 'r') as f:
    content = f.read()
    
has_checkpoint_detection = "resume_checkpoint = None" in content
has_glob_import = "import glob" in content
has_resume_param = "resume_from_checkpoint=resume_checkpoint" in content

if has_checkpoint_detection and has_glob_import and has_resume_param:
    print("✅ CHECK 2: Auto-resume code FOUND in train.py")
    
    # Show the exact lines
    print("\n📝 Auto-resume code preview:")
    result = subprocess.run(['grep', '-n', '-A', '2', 'resume_checkpoint = None', train_py_path],
                          capture_output=True, text=True)
    print(result.stdout[:300])
else:
    print("❌ CHECK 2 FAILED: Auto-resume code MISSING in train.py!")
    print(f"   - Checkpoint detection: {'✅' if has_checkpoint_detection else '❌'}")
    print(f"   - Glob import: {'✅' if has_glob_import else '❌'}")
    print(f"   - Resume parameter: {'✅' if has_resume_param else '❌'}")
    print("\n🔧 FIX: Re-run Cell 15 (Step 7) to regenerate train.py with auto-resume")
    print("=" * 70)

# Check 3: What checkpoints exist?
print("\n🔍 CHECK 3: Scanning for existing checkpoints...")
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"

if not os.path.exists(checkpoint_dir):
    print("ℹ️  No checkpoint directory found")
    print("   → Training will START FRESH (expected for first run)")
else:
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    
    if not checkpoints:
        print("ℹ️  Checkpoint directory exists but is EMPTY")
        print("   → Training will START FRESH")
    else:
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        latest = checkpoints[-1]
        step_num = latest.split('-')[-1]
        
        print(f"✅ Found {len(checkpoints)} checkpoint(s):")
        for cp in checkpoints:
            step = cp.split('-')[-1]
            print(f"   - checkpoint-{step}")
        
        print(f"\n🎯 Training will RESUME from: checkpoint-{step_num}")
        print(f"   Previous progress: {step_num} steps completed")
        
        # Verify checkpoint is valid
        adapter_path = os.path.join(latest, "adapter_model.safetensors")
        if os.path.exists(adapter_path):
            size_mb = os.path.getsize(adapter_path) / (1024 * 1024)
            print(f"   Checkpoint size: {size_mb:.1f} MB (looks valid ✅)")
        else:
            print(f"   ⚠️  WARNING: adapter_model.safetensors not found!")
            print(f"   This checkpoint might be corrupted!")

# Check 4: Disk space
print("\n🔍 CHECK 4: Checking disk space...")
result = subprocess.run(['df', '-h', '/data'], capture_output=True, text=True)
lines = result.stdout.strip().split('\n')
if len(lines) > 1:
    parts = lines[1].split()
    usage = parts[4]
    available = parts[3]
    
    usage_pct = int(usage.strip('%'))
    
    if usage_pct > 80:
        print(f"⚠️  DISK SPACE WARNING: {usage} used, {available} available")
    else:
        print(f"✅ CHECK 4: Disk space OK - {usage} used, {available} available")

# Final summary
print("\n" + "=" * 70)
print("📋 FINAL STATUS:")
print("=" * 70)

if has_checkpoint_detection and has_glob_import and has_resume_param:
    if os.path.exists(checkpoint_dir) and glob.glob(f"{checkpoint_dir}/checkpoint-*"):
        print("✅ READY TO RESUME: Training will continue from last checkpoint")
    else:
        print("✅ READY FOR FRESH START: Training will begin from step 0")
    print("\n🚀 You can now run Step 8 (Start Training)")
else:
    print("❌ NOT READY: Auto-resume code missing from train.py")
    print("\n🔧 ACTION REQUIRED:")
    print("   1. Re-run Cell 15 (Step 7) to regenerate train.py")
    print("   2. Then re-run this cell to verify")
    print("   3. Then run Step 8 to start training")

print("=" * 70)

In [None]:
!python /data/Cogumi-LLM/train.py

## Step 8.5: Monitor Disk Space During Training (Optional)

**Run this in a separate terminal to monitor disk space in real-time:**

This shows disk usage every 30 seconds while training runs.

**Stop with:** Ctrl+C

In [None]:
# Real-time disk monitoring (runs continuously)
# Press Ctrl+C to stop

import time
import subprocess
from datetime import datetime

print("🔍 Real-time disk space monitoring")
print("=" * 70)
print("Press Ctrl+C to stop")
print("=" * 70)

try:
    while True:
        # Get disk usage
        result = subprocess.run(['df', '-h', '/data'], 
                              capture_output=True, text=True)
        lines = result.stdout.strip().split('\n')
        
        if len(lines) > 1:
            parts = lines[1].split()
            total = parts[1]
            used = parts[2]
            available = parts[3]
            usage_pct = parts[4]
            
            timestamp = datetime.now().strftime('%H:%M:%S')
            
            # Color coding based on usage
            usage_num = int(usage_pct.strip('%'))
            if usage_num > 90:
                status = "🚨 CRITICAL"
            elif usage_num > 70:
                status = "⚠️  WARNING"
            else:
                status = "✅ HEALTHY"
            
            print(f"{timestamp} | {status} | Used: {used}/{total} ({usage_pct}) | Free: {available}")
        
        # Check checkpoint count
        try:
            checkpoint_count = subprocess.run(
                ['sh', '-c', 'ls -1d /data/Cogumi-LLM/checkpoints/checkpoint-* 2>/dev/null | wc -l'],
                capture_output=True, text=True
            )
            count = checkpoint_count.stdout.strip()
            if count and int(count) > 0:
                print(f"          | 📁 Checkpoints: {count}")
        except:
            pass
        
        time.sleep(30)  # Update every 30 seconds
        
except KeyboardInterrupt:
    print("\n\n✅ Monitoring stopped")

## Step 7.5: Quick Diagnostic (Optional)

**If training seems slow despite high GPU utilization, run this to check:**
- Actual batch size being used
- Whether packing is working
- Samples per second vs steps per second

In [None]:
# Quick diagnostic - check training speed metrics
!tail -50 /data/Cogumi-LLM/train.py 2>/dev/null || echo "Training script not found"
!echo ""
!echo "Recent training output:"
!ps aux | grep train.py | grep -v grep || echo "No training process running"

In [None]:
# Verify flash-attn 2.8.2 is installed and working
import sys
print(f"Python executable: {sys.executable}")

try:
    import flash_attn
    print(f"✅ flash-attn version: {flash_attn.__version__}")
    
    from flash_attn import flash_attn_func
    print(f"✅ flash_attn_func importable: True")
    
    import torch
    print(f"✅ CUDA available: {torch.cuda.is_available()}")
    print(f"✅ CUDA version: {torch.version.cuda}")
    
    print("\n🎉 Flash Attention 2 is ready!")
except ImportError as e:
    print(f"❌ Flash Attention 2 import failed: {e}")
    print("⚠️ You may need to reinstall flash-attn")

In [None]:
# CRITICAL DIAGNOSTIC: Check if FA2 is actually being used
# Even though FA2 = True, check if it's really active in the model

import unsloth
from unsloth import FastLanguageModel

# Quick test load to see actual configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
    dtype=None,
    attn_implementation="flash_attention_2",
)

# Check what attention implementation is actually being used
print("\n🔍 CHECKING ACTUAL ATTENTION IMPLEMENTATION:")
print(f"Model config attn_implementation: {model.config._attn_implementation}")
print(f"Model type: {type(model)}")

# Check individual layers
if hasattr(model, 'model') and hasattr(model.model, 'layers'):
    first_layer = model.model.layers[0]
    if hasattr(first_layer, 'self_attn'):
        attn_class = type(first_layer.self_attn).__name__
        print(f"Attention layer class: {attn_class}")
        print(f"Expected for FA2: Should contain 'FlashAttention' or 'Llama.*FlashAttention'")

del model, tokenizer  # Free memory

## 🚨 EMERGENCY: Check Training Status

**If GPU shows 0% after browser refresh, run these diagnostic cells:**

In [None]:
# Step 1: Check if training process is still running
import subprocess
import os

print("🔍 DIAGNOSTIC 1: Checking for training process...")
result = subprocess.run(['ps', 'aux'], capture_output=True, text=True)
train_processes = [line for line in result.stdout.split('\n') if 'train.py' in line and 'grep' not in line]

if train_processes:
    print("✅ Training process FOUND and running:")
    for proc in train_processes:
        print(f"   {proc}")
else:
    print("❌ NO training process found - training has stopped!")
    
print("\n🔍 DIAGNOSTIC 2: Checking GPU status...")
gpu_result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total', '--format=csv,noheader'], 
                           capture_output=True, text=True)
print(f"GPU Utilization: {gpu_result.stdout.strip()}")

print("\n🔍 DIAGNOSTIC 3: Checking for recent checkpoints...")
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
if os.path.exists(checkpoint_dir):
    checkpoints = subprocess.run(['ls', '-lht', checkpoint_dir], capture_output=True, text=True)
    print(checkpoints.stdout)
else:
    print(f"❌ No checkpoint directory at {checkpoint_dir}")

print("\n🔍 DIAGNOSTIC 4: Check last training logs...")
try:
    # Look for any log files or check recent output
    log_check = subprocess.run(['find', '/data/Cogumi-LLM', '-name', '*.log', '-mmin', '-60'], 
                              capture_output=True, text=True)
    if log_check.stdout.strip():
        print(f"Recent logs:\n{log_check.stdout}")
    else:
        print("No recent log files found")
except Exception as e:
    print(f"Could not check logs: {e}")

## 🔄 Recovery Actions (Based on Diagnostic Results)

**Choose the appropriate action based on what you found above:**

In [None]:
# ACTION A: If process died - Resume from last checkpoint
# This will automatically detect the latest checkpoint and resume training

import os
import subprocess
import glob

print("🔍 Searching for latest checkpoint...")
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"

if os.path.exists(checkpoint_dir):
    # Find all checkpoint folders
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    if checkpoints:
        # Sort by step number
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        latest_checkpoint = checkpoints[-1]
        step_num = latest_checkpoint.split('-')[-1]
        
        print(f"✅ Found latest checkpoint: {latest_checkpoint}")
        print(f"   Training was at step {step_num}")
        print(f"\n⚠️ To resume training from this checkpoint:")
        print(f"   1. Modify train.py to add: resume_from_checkpoint='{latest_checkpoint}'")
        print(f"   2. Re-run the training cell")
        print(f"\nI'll create a resume script for you...")
        
        # Check how many steps were completed
        print(f"\n📊 Progress estimate:")
        print(f"   Completed steps: {step_num}")
        print(f"   At ~1.7 it/s, you've trained for ~{int(step_num) / 1.7 / 3600:.1f} hours")
    else:
        print("❌ No checkpoints found - training crashed before first checkpoint (step 1000)")
        print("   You'll need to restart training from scratch")
else:
    print("❌ Checkpoint directory doesn't exist - training never started properly")

In [None]:
# ACTION B: Create a training script with auto-resume capability
# This version will automatically resume from the last checkpoint if it exists

import os

script_with_resume = """# ----------------------------
# train.py - H100 Optimized with AUTO-RESUME
# Automatically resumes from last checkpoint if training was interrupted
# ----------------------------

import unsloth
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
from unsloth import FastLanguageModel
import gc
import os
import glob

# Clear GPU memory
gc.collect()
torch.cuda.empty_cache()

# Check for existing checkpoints
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
resume_checkpoint = None

if os.path.exists(checkpoint_dir):
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    if checkpoints:
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        resume_checkpoint = checkpoints[-1]
        print(f"🔄 Found checkpoint to resume from: {resume_checkpoint}")
        step_num = resume_checkpoint.split('-')[-1]
        print(f"   Resuming from step {step_num}")
    else:
        print("🆕 No checkpoints found - starting fresh training")
else:
    print("🆕 Starting fresh training")

# Load model + tokenizer
print("🔄 Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
    dtype=None,
    attn_implementation="flash_attention_2",
)
print("✅ Model loaded")

# Apply LoRA
print("🔄 Applying LoRA...")
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)
print("✅ LoRA applied")

# Prepare for training
os.environ["UNSLOTH_OFFLOAD_GRADIENTS"] = "0"
model = FastLanguageModel.for_training(model)
print("✅ Model ready")

# Load dataset
print("📥 Loading dataset...")
dataset = load_dataset(
    "json",
    data_files="/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl",
    split="train",
    cache_dir="/tmp/hf_cache",
    encoding="utf-8"
)
print(f"✅ Dataset loaded: {len(dataset)} examples")

# Formatting function
def formatting_func(examples):
    instructions = examples['instruction']
    responses = examples['response']
    texts = []
    for instruction, response in zip(instructions, responses):
        text = f"### Instruction:\\\\n{instruction}\\\\n\\\\n### Response:\\\\n{response}"
        texts.append(text)
    return texts

# Training arguments
args = TrainingArguments(
    output_dir="/data/Cogumi-LLM/checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,      # Your actual config
    gradient_accumulation_steps=2,       # Your actual config
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=1,
    save_steps=1000,
    save_total_limit=3,
    optim="adamw_8bit",
    bf16=True,
    tf32=True,
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
    dataloader_persistent_workers=True,
    group_by_length=False,
    gradient_checkpointing=False,
    max_grad_norm=1.0,
    logging_first_step=False,
    logging_nan_inf_filter=False,
    save_safetensors=True,
    report_to="none",
)

# Create trainer
print("🔄 Creating trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=args,
    formatting_func=formatting_func,
    max_seq_length=1024,
    packing=False,
    dataset_num_proc=2,
)
print("✅ Trainer created")

# Train with auto-resume
print("=" * 70)
if resume_checkpoint:
    print(f"🔄 RESUMING training from {resume_checkpoint}")
else:
    print("🚀 STARTING fresh training")
print("=" * 70)

try:
    trainer.train(resume_from_checkpoint=resume_checkpoint)
    print("\\n✅ Training completed!")
except Exception as e:
    print(f"\\n❌ Training failed: {e}")
    import traceback
    traceback.print_exc()
    raise

# Save final model
print("💾 Saving model...")
model.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
tokenizer.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
print("✅ Model saved")
"""

# Write the script
train_path = "/data/Cogumi-LLM/train.py"
os.makedirs(os.path.dirname(train_path), exist_ok=True)
with open(train_path, "w", encoding="utf-8") as f:
    f.write(script_with_resume)

print(f"✅ Auto-resume training script created at {train_path}")
print(f"   This script will automatically resume from the last checkpoint if training crashes")
print(f"\n🔄 Now run the training cell again to restart with auto-resume")

## ✅ PERFECT! Resume from Checkpoint 9000

**You have checkpoint-9000! That's ~5.3 hours of progress saved!**

**Next steps:**
1. ✅ Run the cell below (ACTION B) to update train.py with auto-resume
2. ✅ Then scroll back up and run the training cell again
3. ✅ Training will automatically resume from step 9000

**Progress so far:**
- Completed: 9,000 / 240,240 steps (~3.7%)
- Time invested: ~5.3 hours
- Remaining: ~32.7 hours

## 🔍 Verify Auto-Resume Script Was Created

**Before running training, verify the script has resume capability:**

In [None]:
# Check if train.py has auto-resume code
import subprocess

print("🔍 Checking if train.py has resume capability...")
result = subprocess.run(['grep', '-n', 'resume_checkpoint', '/data/Cogumi-LLM/train.py'], 
                       capture_output=True, text=True)

if result.returncode == 0:
    print("✅ Auto-resume code FOUND in train.py:")
    print(result.stdout)
    print("\n✅ Script is ready to auto-resume from checkpoint-9000!")
else:
    print("❌ Auto-resume code NOT found!")
    print("⚠️ You need to run the ACTION B cell above first!")
    
print("\n" + "="*60)
print("Now check what checkpoint it will find:")
print("="*60)

# Check what checkpoint will be detected
checkpoint_check = subprocess.run(['ls', '-lht', '/data/Cogumi-LLM/checkpoints/'], 
                                 capture_output=True, text=True)
print(checkpoint_check.stdout)

## 🚨 CHECKPOINT CORRUPTED - Use Previous Checkpoint

**Checkpoint-9000 is corrupted (incomplete save). We'll use checkpoint-8000 instead.**

You still saved **4.7 hours** of work! Better than starting from scratch.

In [None]:
# Delete corrupted checkpoint-9000 and use checkpoint-8000 instead
import subprocess
import shutil
import os

print("🗑️ Removing corrupted checkpoint-9000...")
corrupted_path = "/data/Cogumi-LLM/checkpoints/checkpoint-9000"
if os.path.exists(corrupted_path):
    shutil.rmtree(corrupted_path)
    print(f"✅ Deleted {corrupted_path}")
else:
    print(f"⚠️ Checkpoint-9000 not found (may already be deleted)")

print("\n📋 Current checkpoints:")
result = subprocess.run(['ls', '-lht', '/data/Cogumi-LLM/checkpoints/'], 
                       capture_output=True, text=True)
print(result.stdout)

print("\n✅ Now training will resume from checkpoint-8000")
print("   Progress: 8,000 steps completed (~4.7 hours saved)")
print("   Remaining: ~33.3 hours")
print("\n🔄 Run the training cell again to resume from checkpoint-8000")

## 🚨 EMERGENCY: Checkpoint Accidentally Deleted

**Run this if you accidentally deleted checkpoints with `rm -rf`**

This cell will:
1. Assess what was lost
2. Verify dataset integrity
3. Prepare for fresh training restart
4. Give you clear next steps

In [None]:
# ============================================================================
# EMERGENCY RECOVERY - After Accidental Checkpoint Deletion
# ============================================================================

import os
import subprocess

print("🚨 CHECKPOINT RECOVERY ASSESSMENT")
print("=" * 70)

# 1. Check what was deleted
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
if not os.path.exists(checkpoint_dir):
    print("\n❌ Checkpoint directory doesn't exist")
    print("   All checkpoints were deleted")
else:
    remaining = [f for f in os.listdir(checkpoint_dir) if f.startswith("checkpoint-")]
    if len(remaining) == 0:
        print("\n❌ All checkpoints were deleted")
    else:
        print(f"\n✅ Found {len(remaining)} remaining checkpoint(s):")
        for cp in remaining:
            print(f"   - {cp}")

# 2. Check dataset integrity
dataset_path = "/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl"
if os.path.exists(dataset_path):
    size_mb = os.path.getsize(dataset_path) / (1024 * 1024)
    print(f"\n✅ Dataset intact: {size_mb:.1f} MB")
else:
    print("\n❌ Dataset not found - needs re-upload")

# 3. Check train.py
if os.path.exists("/data/Cogumi-LLM/train.py"):
    print("\n✅ train.py exists")
else:
    print("\n⚠️  train.py missing - re-run Step 7 to create it")

# 4. Check disk space
result = subprocess.run(['df', '-h', '/data'], capture_output=True, text=True)
lines = result.stdout.strip().split('\n')
if len(lines) > 1:
    parts = lines[1].split()
    print(f"\n💾 Disk Space:")
    print(f"   Total: {parts[1]}")
    print(f"   Used: {parts[2]} ({parts[4]})")
    print(f"   Available: {parts[3]}")

# 5. Recovery recommendation
print("\n" + "=" * 70)
print("📋 RECOVERY PLAN:")
print("=" * 70)

if not os.path.exists(checkpoint_dir) or len([f for f in os.listdir(checkpoint_dir) if f.startswith("checkpoint-")]) == 0:
    print("\n🔄 OPTION 1: Start Fresh Training (RECOMMENDED)")
    print("   - All progress lost, but clean slate")
    print("   - Re-run Step 8 (Start Training)")
    print("   - Duration: 8-9 hours")
    print("   - Cost: ~$10")
    
    print("\n📂 OPTION 2: Check Vast.ai Backups")
    print("   - Some Vast.ai instances auto-backup /data/")
    print("   - Check: ls -la /data/.backups/ or similar")
    print("   - Contact Vast.ai support if backups exist")
else:
    print("\n✅ Some checkpoints remain - you can resume!")
    print("   - Re-run Step 8 (Start Training)")
    print("   - Training will resume from latest checkpoint")

print("\n" + "=" * 70)
print("✅ Assessment complete")