# ‚ö†Ô∏è GOLDEN DEPENDENCY SET - DO NOT CHANGE

**Tested and Working Configuration for H100 Training:**

| Package | Version | Notes |
|---------|---------|-------|
| **Python** | 3.10.12 | Base system Python |
| **PyTorch** | 2.8.0+cu128 | CUDA 12.8 build |
| **CUDA** | 12.8 | GPU compute platform |
| **bitsandbytes** | 0.48.1 | 8-bit optimizer support |
| **xformers** | 0.0.32.post2 | Memory efficient attention |
| **transformers** | 4.57.1 | HuggingFace models |
| **Unsloth** | 2025.10.8 | Fast fine-tuning library |

**Hardware:** NVIDIA H100 80GB HBM3

**Installation Method:** Use `golden_dynamic_setup_full.sh` script which creates a virtual environment at `/workspace/golden-venv/` with these exact versions.

---

# H100 Training with Unsloth - Production Ready

**Complete 5-step guide using GOLDEN DEPENDENCY SET (tested & working)**

**Time**: 8-9 hours | **Cost**: ~$10 on H100

‚ö†Ô∏è **IMPORTANT:** Use the exact dependency versions documented above. Other combinations may fail!

## Step 1: Dry Run - Verify What Will Be Installed

**First, run a dry-run to confirm the golden dependency versions will be installed.**

This will show:
- PyTorch 2.8.0+cu128
- transformers 4.57.1
- xformers 0.0.32.post2
- bitsandbytes 0.48.1
- Unsloth 2025.10.8

‚úÖ These are the **GOLDEN DEPENDENCY SET** that's been tested and confirmed working.

In [None]:
# Cell 1: Dry-run to see what will be installed
# Upload golden_dynamic_setup_full.sh to the same folder as the notebook
!echo "üîπ Running dry-run to show planned installation..."
!bash golden_dynamic_setup_full.sh --dry-run

## Step 2: Install Golden Dependency Set

**This installs the exact tested versions in a virtual environment.**

Creates: `/workspace/golden-venv/` with Python 3.10.12 and all dependencies

**Takes 10-15 minutes**

In [None]:
# Cell 3: Run full installation
!bash golden_dynamic_setup_full.sh

## Step 3: Switch Kernel & Restart

**After installation completes:**

1. Click **"Kernel"** menu at the top
2. Select **"Restart Kernel"**
3. Wait for kernel to restart

**Then change kernel to golden-venv:**

1. Click **"Kernel"** menu ‚Üí **"Change Kernel"**
2. Select the kernel from `/workspace/golden-venv/bin/python`
3. Wait for connection (green checkmark)

## Step 4: Verify Golden Dependency Set

**Run the cell below to verify all packages match the golden set:**

Expected output:
```
Python: 3.10.12
Torch: 2.8.0+cu128, CUDA: 12.8, GPUs: 1
GPU 0: NVIDIA H100 80GB HBM3
bitsandbytes: 0.48.1
xformers: 0.0.32.post2
transformers: 4.57.1
ü¶• Unsloth version: 2025.10.8
```

If versions don't match, something went wrong with installation!

In [None]:
#!/usr/bin/env python3
import sys
import torch


print(f"Python: {sys.version}")
print(f"Torch: {torch.__version__}, CUDA: {torch.version.cuda}, GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

for pkg in ["bitsandbytes", "xformers", "transformers"]:
    try:
        mod = __import__(pkg)
        print(f"{pkg}: {mod.__version__}")
    except ImportError:
        print(f"‚ö†Ô∏è {pkg} not installed")

try:
    import unsloth
    version = getattr(unsloth, "__version__", "unknown")
    print(f"ü¶• Unsloth version: {version} (import first in your scripts!)")
except ImportError:
    print("‚ö†Ô∏è Unsloth not installed! Install via 'pip install unsloth'")


## Step 5: HuggingFace Authentication

1. Get token from: https://huggingface.co/settings/tokens
2. Accept LLAMA license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
3. Run the cell below and paste your token

In [None]:
from huggingface_hub import login
login()
print("\n‚úÖ Authentication successful! Token saved.")

## Step 6: Upload Dataset

**Before running the next cell, upload your dataset:**

1. In JupyterLab, use the file browser on the left
2. Navigate to `/data/Cogumi-LLM/data/phase1/`
3. Click the **Upload Files** button (‚Üë icon)
4. Select `public_500k_filtered.jsonl` from your local machine
5. Wait for upload to complete (~5-10 minutes for 870MB file)

The training script expects: `/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl`

## Step 7: Create Training Script

**Creates train.py with:**
- Unsloth 2025.10.8 compatible code
- Batched formatting function for instruction/response dataset
- QLoRA 4-bit training configuration
- Optimized for H100 with golden dependency set

In [None]:
import os

# ----------------------------
# Notebook cell to create train.py - PACKING DISABLED
# ----------------------------
script = """# ----------------------------
# train.py - H100 Optimized (Packing DISABLED for stability)
# Compatible with Unsloth 2025.10.8, TRL, PEFT, 4-bit training
# ----------------------------

import unsloth  # Must be first for Unsloth patching
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
from unsloth import FastLanguageModel
import gc

# Clear any existing GPU memory
gc.collect()
torch.cuda.empty_cache()

# Load model + tokenizer with H100 optimizations
print("üîÑ Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,  # Handle sequences up to 1024 tokens
    load_in_4bit=True,
    dtype=None,  # Auto-detect bf16 for H100
    attn_implementation="flash_attention_2",  # CRITICAL: Enable FA2
)
print("‚úÖ Model loaded successfully")

# Apply PEFT / LoRA
print("üîÑ Applying LoRA...")
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)
print("‚úÖ LoRA applied successfully")

# Prepare model for training (enables Flash Attention 2)
# Force disable gradient offloading for maximum speed
import os
os.environ["UNSLOTH_OFFLOAD_GRADIENTS"] = "0"
print("üîÑ Preparing model for training...")
model = FastLanguageModel.for_training(model)
print("‚úÖ Model ready for training")

# Load dataset with caching
print("üì• Loading dataset...")
dataset = load_dataset(
    "json",
    data_files="/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl",
    split="train",
    cache_dir="/tmp/hf_cache",
    encoding="utf-8"
)
print(f"‚úÖ Dataset loaded: {len(dataset)} examples")

# Define formatting function (required by Unsloth)
def formatting_func(examples):
    instructions = examples['instruction']
    responses = examples['response']
    
    texts = []
    for instruction, response in zip(instructions, responses):
        text = f"### Instruction:\\\\n{instruction}\\\\n\\\\n### Response:\\\\n{response}"
        texts.append(text)
    
    return texts

# Training arguments - OPTIMIZED FOR H100
args = TrainingArguments(
    output_dir="/data/Cogumi-LLM/checkpoints",
    num_train_epochs=3,
    
    # Batch size for H100 80GB with 4-bit and seq_len 1024
    per_device_train_batch_size=16,  # Reduced for no-packing mode
    gradient_accumulation_steps=8,   # Higher accumulation for effective batch of 128
    
    # Optimization settings
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=1,
    save_steps=1000,
    save_total_limit=3,
    
    # H100 optimizations
    optim="adamw_8bit",
    bf16=True,
    tf32=True,
    
    # Dataloader settings (conservative)
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
    dataloader_persistent_workers=True,
    group_by_length=False,
    
    # Memory optimizations
    gradient_checkpointing=False,
    max_grad_norm=1.0,
    
    # Disable unnecessary features
    logging_first_step=False,
    logging_nan_inf_filter=False,
    save_safetensors=True,
    
    # Report to nothing (disable wandb etc)
    report_to="none",
)

# Create trainer WITHOUT packing (packing was causing batch size mismatch)
print("üîÑ Creating trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=args,
    formatting_func=formatting_func,
    max_seq_length=1024,  # MUST match model max_seq_length
    packing=False,  # DISABLED: packing caused batch mismatch error
    dataset_num_proc=2,
)
print("‚úÖ Trainer created successfully")

# Train the model
print("=" * 70)
print("üöÄ Starting H100 training (packing DISABLED for stability)...")
print("=" * 70)
print(f"   Max sequence length: 1024")
print(f"   Batch size per device: {args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {args.gradient_accumulation_steps}")
print(f"   Effective batch size: {args.per_device_train_batch_size * args.gradient_accumulation_steps}")
print(f"   Total training steps: ~{len(dataset) // (args.per_device_train_batch_size * args.gradient_accumulation_steps) * args.num_train_epochs}")
print(f"   Dataloader workers: {args.dataloader_num_workers}")
print(f"   Prefetch factor: {args.dataloader_prefetch_factor}")
print(f"   Dataset processing workers: 2")
print(f"   Packing: DISABLED (was causing batch mismatch)")
print(f"   Flash Attention 2: ENABLED")
print(f"   Gradient offloading: DISABLED")
print(f"   Expected speed: 2-4 it/s on H100")
print("=" * 70)

try:
    trainer.train()
    print("\\n‚úÖ Training completed successfully!")
except Exception as e:
    print(f"\\n‚ùå Training failed with error: {e}")
    import traceback
    traceback.print_exc()
    raise

# Save model
print("üíæ Saving model...")
model.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
tokenizer.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
print("‚úÖ Model saved to /data/Cogumi-LLM/checkpoints/final")
"""

# Write train.py to disk
train_path = "/data/Cogumi-LLM/train.py"
os.makedirs(os.path.dirname(train_path), exist_ok=True)
with open(train_path, "w", encoding="utf-8") as f:
    f.write(script)

print(f"‚úÖ STABLE training script created at {train_path}")
print(f"   ‚ö° Flash Attention 2: Enabled")
print(f"   üö´ Gradient offloading: DISABLED")
print(f"   üîß Sequence length: 1024")
print(f"   üì¶ Batch size: 16 (reduced for no-packing)")
print(f"   üîÑ Gradient accumulation: 8 (effective batch = 128)")
print(f"   üë∑ Dataloader workers: 4")
print(f"   ‚ùå Packing: DISABLED (was causing batch dimension mismatch)")
print(f"   ?Ô∏è This should fix the cross_entropy batch size error!")


## Step 8: Start Training üöÄ

**Training Details:**
- Duration: 8-9 hours on H100
- Cost: ~$10 on Vast.ai
- Model: Llama-3.1-8B-Instruct (4-bit QLoRA)
- Dataset: 640K instruction/response pairs

**Monitor with:** `nvidia-smi` in a terminal or `watch -n 1 nvidia-smi`

## Step 7.5: Quick Diagnostic (Optional)

**If training seems slow despite high GPU utilization, run this to check:**
- Actual batch size being used
- Whether packing is working
- Samples per second vs steps per second

In [None]:
# Quick diagnostic - check training speed metrics
!tail -50 /data/Cogumi-LLM/train.py 2>/dev/null || echo "Training script not found"
!echo ""
!echo "Recent training output:"
!ps aux | grep train.py | grep -v grep || echo "No training process running"

In [None]:
# Verify flash-attn 2.8.2 is installed and working
import sys
print(f"Python executable: {sys.executable}")

try:
    import flash_attn
    print(f"‚úÖ flash-attn version: {flash_attn.__version__}")
    
    from flash_attn import flash_attn_func
    print(f"‚úÖ flash_attn_func importable: True")
    
    import torch
    print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
    print(f"‚úÖ CUDA version: {torch.version.cuda}")
    
    print("\nüéâ Flash Attention 2 is ready!")
except ImportError as e:
    print(f"‚ùå Flash Attention 2 import failed: {e}")
    print("‚ö†Ô∏è You may need to reinstall flash-attn")

In [None]:
# CRITICAL DIAGNOSTIC: Check if FA2 is actually being used
# Even though FA2 = True, check if it's really active in the model

import unsloth
from unsloth import FastLanguageModel

# Quick test load to see actual configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
    dtype=None,
    attn_implementation="flash_attention_2",
)

# Check what attention implementation is actually being used
print("\nüîç CHECKING ACTUAL ATTENTION IMPLEMENTATION:")
print(f"Model config attn_implementation: {model.config._attn_implementation}")
print(f"Model type: {type(model)}")

# Check individual layers
if hasattr(model, 'model') and hasattr(model.model, 'layers'):
    first_layer = model.model.layers[0]
    if hasattr(first_layer, 'self_attn'):
        attn_class = type(first_layer.self_attn).__name__
        print(f"Attention layer class: {attn_class}")
        print(f"Expected for FA2: Should contain 'FlashAttention' or 'Llama.*FlashAttention'")

del model, tokenizer  # Free memory

In [None]:
import subprocess
import sys

# Path to your venv Python
venv_python = '/workspace/golden-venv/bin/python'

# Run the training script with live output
print("üöÄ Starting training with live output...")
print("=" * 60)

# Use Popen for live streaming output
process = subprocess.Popen(
    [venv_python, '/data/Cogumi-LLM/train.py'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

# Stream output line by line
for line in process.stdout:
    print(line, end='')
    sys.stdout.flush()

process.wait()
print("\n" + "=" * 60)
print(f"Training {'completed successfully' if process.returncode == 0 else 'failed with error code ' + str(process.returncode)}")

## üö® EMERGENCY: Check Training Status

**If GPU shows 0% after browser refresh, run these diagnostic cells:**

In [None]:
# Step 1: Check if training process is still running
import subprocess
import os

print("üîç DIAGNOSTIC 1: Checking for training process...")
result = subprocess.run(['ps', 'aux'], capture_output=True, text=True)
train_processes = [line for line in result.stdout.split('\n') if 'train.py' in line and 'grep' not in line]

if train_processes:
    print("‚úÖ Training process FOUND and running:")
    for proc in train_processes:
        print(f"   {proc}")
else:
    print("‚ùå NO training process found - training has stopped!")
    
print("\nüîç DIAGNOSTIC 2: Checking GPU status...")
gpu_result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total', '--format=csv,noheader'], 
                           capture_output=True, text=True)
print(f"GPU Utilization: {gpu_result.stdout.strip()}")

print("\nüîç DIAGNOSTIC 3: Checking for recent checkpoints...")
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
if os.path.exists(checkpoint_dir):
    checkpoints = subprocess.run(['ls', '-lht', checkpoint_dir], capture_output=True, text=True)
    print(checkpoints.stdout)
else:
    print(f"‚ùå No checkpoint directory at {checkpoint_dir}")

print("\nüîç DIAGNOSTIC 4: Check last training logs...")
try:
    # Look for any log files or check recent output
    log_check = subprocess.run(['find', '/data/Cogumi-LLM', '-name', '*.log', '-mmin', '-60'], 
                              capture_output=True, text=True)
    if log_check.stdout.strip():
        print(f"Recent logs:\n{log_check.stdout}")
    else:
        print("No recent log files found")
except Exception as e:
    print(f"Could not check logs: {e}")

## üîÑ Recovery Actions (Based on Diagnostic Results)

**Choose the appropriate action based on what you found above:**

In [None]:
# ACTION A: If process died - Resume from last checkpoint
# This will automatically detect the latest checkpoint and resume training

import os
import subprocess
import glob

print("üîç Searching for latest checkpoint...")
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"

if os.path.exists(checkpoint_dir):
    # Find all checkpoint folders
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    if checkpoints:
        # Sort by step number
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        latest_checkpoint = checkpoints[-1]
        step_num = latest_checkpoint.split('-')[-1]
        
        print(f"‚úÖ Found latest checkpoint: {latest_checkpoint}")
        print(f"   Training was at step {step_num}")
        print(f"\n‚ö†Ô∏è To resume training from this checkpoint:")
        print(f"   1. Modify train.py to add: resume_from_checkpoint='{latest_checkpoint}'")
        print(f"   2. Re-run the training cell")
        print(f"\nI'll create a resume script for you...")
        
        # Check how many steps were completed
        print(f"\nüìä Progress estimate:")
        print(f"   Completed steps: {step_num}")
        print(f"   At ~1.7 it/s, you've trained for ~{int(step_num) / 1.7 / 3600:.1f} hours")
    else:
        print("‚ùå No checkpoints found - training crashed before first checkpoint (step 1000)")
        print("   You'll need to restart training from scratch")
else:
    print("‚ùå Checkpoint directory doesn't exist - training never started properly")

In [None]:
# ACTION B: Create a training script with auto-resume capability
# This version will automatically resume from the last checkpoint if it exists

import os

script_with_resume = """# ----------------------------
# train.py - H100 Optimized with AUTO-RESUME
# Automatically resumes from last checkpoint if training was interrupted
# ----------------------------

import unsloth
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
from unsloth import FastLanguageModel
import gc
import os
import glob

# Clear GPU memory
gc.collect()
torch.cuda.empty_cache()

# Check for existing checkpoints
checkpoint_dir = "/data/Cogumi-LLM/checkpoints"
resume_checkpoint = None

if os.path.exists(checkpoint_dir):
    checkpoints = glob.glob(f"{checkpoint_dir}/checkpoint-*")
    if checkpoints:
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        resume_checkpoint = checkpoints[-1]
        print(f"üîÑ Found checkpoint to resume from: {resume_checkpoint}")
        step_num = resume_checkpoint.split('-')[-1]
        print(f"   Resuming from step {step_num}")
    else:
        print("üÜï No checkpoints found - starting fresh training")
else:
    print("üÜï Starting fresh training")

# Load model + tokenizer
print("üîÑ Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
    dtype=None,
    attn_implementation="flash_attention_2",
)
print("‚úÖ Model loaded")

# Apply LoRA
print("üîÑ Applying LoRA...")
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)
print("‚úÖ LoRA applied")

# Prepare for training
os.environ["UNSLOTH_OFFLOAD_GRADIENTS"] = "0"
model = FastLanguageModel.for_training(model)
print("‚úÖ Model ready")

# Load dataset
print("üì• Loading dataset...")
dataset = load_dataset(
    "json",
    data_files="/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl",
    split="train",
    cache_dir="/tmp/hf_cache",
    encoding="utf-8"
)
print(f"‚úÖ Dataset loaded: {len(dataset)} examples")

# Formatting function
def formatting_func(examples):
    instructions = examples['instruction']
    responses = examples['response']
    texts = []
    for instruction, response in zip(instructions, responses):
        text = f"### Instruction:\\\\n{instruction}\\\\n\\\\n### Response:\\\\n{response}"
        texts.append(text)
    return texts

# Training arguments
args = TrainingArguments(
    output_dir="/data/Cogumi-LLM/checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,      # Your actual config
    gradient_accumulation_steps=2,       # Your actual config
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=1,
    save_steps=1000,
    save_total_limit=3,
    optim="adamw_8bit",
    bf16=True,
    tf32=True,
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
    dataloader_persistent_workers=True,
    group_by_length=False,
    gradient_checkpointing=False,
    max_grad_norm=1.0,
    logging_first_step=False,
    logging_nan_inf_filter=False,
    save_safetensors=True,
    report_to="none",
)

# Create trainer
print("üîÑ Creating trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=args,
    formatting_func=formatting_func,
    max_seq_length=1024,
    packing=False,
    dataset_num_proc=2,
)
print("‚úÖ Trainer created")

# Train with auto-resume
print("=" * 70)
if resume_checkpoint:
    print(f"üîÑ RESUMING training from {resume_checkpoint}")
else:
    print("üöÄ STARTING fresh training")
print("=" * 70)

try:
    trainer.train(resume_from_checkpoint=resume_checkpoint)
    print("\\n‚úÖ Training completed!")
except Exception as e:
    print(f"\\n‚ùå Training failed: {e}")
    import traceback
    traceback.print_exc()
    raise

# Save final model
print("üíæ Saving model...")
model.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
tokenizer.save_pretrained("/data/Cogumi-LLM/checkpoints/final")
print("‚úÖ Model saved")
"""

# Write the script
train_path = "/data/Cogumi-LLM/train.py"
os.makedirs(os.path.dirname(train_path), exist_ok=True)
with open(train_path, "w", encoding="utf-8") as f:
    f.write(script_with_resume)

print(f"‚úÖ Auto-resume training script created at {train_path}")
print(f"   This script will automatically resume from the last checkpoint if training crashes")
print(f"\nüîÑ Now run the training cell again to restart with auto-resume")

## ‚úÖ PERFECT! Resume from Checkpoint 9000

**You have checkpoint-9000! That's ~5.3 hours of progress saved!**

**Next steps:**
1. ‚úÖ Run the cell below (ACTION B) to update train.py with auto-resume
2. ‚úÖ Then scroll back up and run the training cell again
3. ‚úÖ Training will automatically resume from step 9000

**Progress so far:**
- Completed: 9,000 / 240,240 steps (~3.7%)
- Time invested: ~5.3 hours
- Remaining: ~32.7 hours

## üîç Verify Auto-Resume Script Was Created

**Before running training, verify the script has resume capability:**

In [None]:
# Check if train.py has auto-resume code
import subprocess

print("üîç Checking if train.py has resume capability...")
result = subprocess.run(['grep', '-n', 'resume_checkpoint', '/data/Cogumi-LLM/train.py'], 
                       capture_output=True, text=True)

if result.returncode == 0:
    print("‚úÖ Auto-resume code FOUND in train.py:")
    print(result.stdout)
    print("\n‚úÖ Script is ready to auto-resume from checkpoint-9000!")
else:
    print("‚ùå Auto-resume code NOT found!")
    print("‚ö†Ô∏è You need to run the ACTION B cell above first!")
    
print("\n" + "="*60)
print("Now check what checkpoint it will find:")
print("="*60)

# Check what checkpoint will be detected
checkpoint_check = subprocess.run(['ls', '-lht', '/data/Cogumi-LLM/checkpoints/'], 
                                 capture_output=True, text=True)
print(checkpoint_check.stdout)

## üö® CHECKPOINT CORRUPTED - Use Previous Checkpoint

**Checkpoint-9000 is corrupted (incomplete save). We'll use checkpoint-8000 instead.**

You still saved **4.7 hours** of work! Better than starting from scratch.

In [None]:
# Delete corrupted checkpoint-9000 and use checkpoint-8000 instead
import subprocess
import shutil
import os

print("üóëÔ∏è Removing corrupted checkpoint-9000...")
corrupted_path = "/data/Cogumi-LLM/checkpoints/checkpoint-9000"
if os.path.exists(corrupted_path):
    shutil.rmtree(corrupted_path)
    print(f"‚úÖ Deleted {corrupted_path}")
else:
    print(f"‚ö†Ô∏è Checkpoint-9000 not found (may already be deleted)")

print("\nüìã Current checkpoints:")
result = subprocess.run(['ls', '-lht', '/data/Cogumi-LLM/checkpoints/'], 
                       capture_output=True, text=True)
print(result.stdout)

print("\n‚úÖ Now training will resume from checkpoint-8000")
print("   Progress: 8,000 steps completed (~4.7 hours saved)")
print("   Remaining: ~33.3 hours")
print("\nüîÑ Run the training cell again to resume from checkpoint-8000")