# üöÄ H100 Training - Fresh Start

**Fresh training from base Llama-3.1-8B-Instruct model**

**What You Need:**
1. ‚úÖ Training dataset: `public_500k_filtered.jsonl` (870MB)
2. ‚úÖ Vast.ai account with $25 credit
3. ‚úÖ 8-9 hours of time

**Expected:**
- Speed: 40-45 it/s
- Time: 8-9 hours
- Cost: ~$17-21 total

## Part 1: Verify Dataset on Mac

Run this on your Mac first to verify you have the dataset.

In [None]:
import os

dataset_path = '/Users/vivekdurairaj/Projects/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl'

if os.path.exists(dataset_path):
    size_mb = os.path.getsize(dataset_path) / (1024 * 1024)
    print(f'‚úÖ Dataset found: {size_mb:.1f} MB')
    print(f'üìÅ Location: {dataset_path}')
else:
    print('‚ùå Dataset not found!')
    print(f'   Expected: {dataset_path}')

## Part 2: Create Directory Structure (Run on H100)

After uploading this notebook to JupyterLab on H100, run this cell.

In [None]:
import os
import subprocess

dirs = [
    '/data/Cogumi-LLM/data/phase1',
    '/data/Cogumi-LLM/data/checkpoints/llama-3.1-8b-phase1a-h100',
    '/data/Cogumi-LLM/configs'
]

for d in dirs:
    os.makedirs(d, exist_ok=True)
    print(f'‚úÖ Created: {d}')

print('\nüìÅ Directory structure:')
result = subprocess.run(['ls', '-la', '/data/Cogumi-LLM/'], capture_output=True, text=True)
print(result.stdout)

## Part 3: Verify H100 Setup

Check GPU, storage, CUDA, and Axolotl installation.

In [None]:
import subprocess

print('üîç GPU Check:')
subprocess.run(['nvidia-smi'])

print('\nüìä Storage Check:')
subprocess.run(['df', '-h', '/data'])

print('\nüîß CUDA Version:')
subprocess.run(['nvcc', '--version'])

print('\nüì¶ Axolotl Check:')
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
if 'axolotl' in result.stdout:
    print('‚úÖ Axolotl is installed')
else:
    print('‚ö†Ô∏è  Axolotl not found')

## Part 3.5: Install Training Dependencies

**Using same versions as Colab for consistency**

This will take 3-5 minutes.

In [None]:
import subprocess
import sys

print('üîß Installing training dependencies (Colab-compatible versions)...')
print('‚è±Ô∏è  This will take 3-5 minutes\n')

# Install PyTorch 2.4.0 with CUDA 12.1 (closest to Colab's cu118)
print('Step 1: Installing PyTorch 2.4.0...')
sys.stdout.flush()
torch_cmd = 'pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121'
result = subprocess.run(torch_cmd, shell=True, capture_output=False)
if result.returncode == 0:
    print('‚úÖ PyTorch installed!\n')
else:
    print('‚ö†Ô∏è  PyTorch installation had warnings (might still work)\n')

# Install exact versions from Colab
print('Step 2: Installing HuggingFace stack...')
sys.stdout.flush()

packages = [
    'transformers==4.46.3',
    'accelerate==1.2.1', 
    'peft==0.13.2',
    'bitsandbytes==0.45.0',
    'datasets==3.2.0',
    'tokenizers==0.21.0',
    'wandb',
    'tensorboard==2.18.0',
    'trl==0.12.2'
]

for i, pkg in enumerate(packages, 1):
    print(f'  [{i}/{len(packages)}] Installing {pkg}...')
    sys.stdout.flush()
    subprocess.run(f'pip install -q {pkg}', shell=True)

print('\n‚úÖ Installation complete!')
print('\nüîç Verifying versions...\n')
sys.stdout.flush()

# Verify installations
result = subprocess.run(
    'python -c "import torch, transformers, peft, accelerate; '
    'print(f\'PyTorch: {torch.__version__}\'); '
    'print(f\'Transformers: {transformers.__version__}\'); '
    'print(f\'PEFT: {peft.__version__}\'); '
    'print(f\'Accelerate: {accelerate.__version__}\')"',
    shell=True, capture_output=True, text=True
)

if result.returncode == 0:
    print(result.stdout)
    print('\nüéâ All packages installed! Training environment ready!')
else:
    print('‚ö†Ô∏è  Verification had issues, but training might still work')
    print(result.stderr)

## Part 3.6: HuggingFace Authentication

You need a HuggingFace token to download LLAMA-3.1-8B.

1. Go to: https://huggingface.co/settings/tokens
2. Create a new token (read access)
3. Accept LLAMA-3.1 license at: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
4. Run this cell and paste your token when prompted

In [None]:
from huggingface_hub import login

# Option 1: Paste your token directly (less secure but faster)
# HF_TOKEN = "hf_..."  # Uncomment and paste your token here
# login(token=HF_TOKEN)

# Option 2: Interactive login (more secure, requires terminal access)
login()

print("‚úÖ HuggingFace authentication successful!")

## Part 4: Upload Dataset

Use JupyterLab UI to upload:
1. Navigate to `/data/Cogumi-LLM/data/phase1/`
2. Click Upload button
3. Select `public_500k_filtered.jsonl`
4. Wait for upload (~5-10 min)

Then run this cell to verify:

In [None]:
import os
import subprocess

dataset_path = '/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl'

if os.path.exists(dataset_path):
    size_mb = os.path.getsize(dataset_path) / (1024 * 1024)
    print(f'‚úÖ Dataset found: {size_mb:.1f} MB')
    
    result = subprocess.run(['wc', '-l', dataset_path], capture_output=True, text=True)
    lines = result.stdout.split()[0]
    print(f'‚úÖ Lines: {lines}')
else:
    print(f'‚ùå Dataset not found at: {dataset_path}')

## Part 5: Create Training Script

Using HuggingFace Trainer (same as Colab) - more stable than Axolotl.

In [None]:
%%writefile /data/Cogumi-LLM/train_qlora_h100.py
"""
üöÄ QLoRA Training Script for H100 80GB
- Same stable configuration as Colab
- Optimized for H100: Higher batch size for faster training
- Uses HuggingFace Trainer (more stable than Axolotl)
"""
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import os

# Model configuration
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
output_dir = "/data/Cogumi-LLM/data/checkpoints/llama-3.1-8b-phase1a-h100"

# QLoRA configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# LoRA configuration (same as Colab)
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Training arguments - OPTIMIZED FOR H100 80GB
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=8,           # HIGHER than Colab (H100 has more VRAM)
    gradient_accumulation_steps=4,            # Effective batch = 32 (same as Colab)
    gradient_checkpointing=True,
    optim="adamw_torch",
    learning_rate=2e-5,                       # Same as Colab
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.01,
    bf16=True,
    tf32=True,
    logging_steps=100,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=2,
    report_to="tensorboard",                  # H100 can handle tensorboard
    max_grad_norm=1.0,
    dataloader_num_workers=8,                 # H100 has better CPU
    dataloader_pin_memory=True,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=False
)

print("Preparing model for training...")
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("Loading dataset...")
dataset = load_dataset("json", data_files="/data/Cogumi-LLM/data/phase1/public_500k_filtered.jsonl", split="train")

def tokenize_function(examples):
    # Combine instruction and response
    texts = []
    for inst, resp in zip(examples["instruction"], examples["response"]):
        texts.append(f"{inst}\n\n{resp}")
    
    return tokenizer(
        texts,
        truncation=True,
        max_length=2048,
        padding=False,
        return_tensors=None
    )

print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
    desc="Tokenizing"
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

print("Creating trainer...")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("Starting training...")
trainer.train()

print("Saving final model...")
trainer.save_model()
print("Training complete!")

## Part 6: Start Training

In [None]:
import os
import subprocess

print('üöÄ Starting training...')
print('‚è±Ô∏è  Expected time: 8-9 hours')
print('üíæ Checkpoints saved every 1,000 steps\n')

# Set environment variables
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
os.environ['TOKENIZERS_PARALLELISM'] = 'true'

# Start training in tmux (so it survives disconnection)
print('Starting tmux session...')
subprocess.run(['tmux', 'new', '-s', 'training', '-d'])

# Run training script
cmd = 'cd /data/Cogumi-LLM && python train_qlora_h100.py'
subprocess.run(['tmux', 'send-keys', '-t', 'training', cmd, 'C-m'])

print('‚úÖ Training started in tmux session "training"')
print('\nüìä To monitor training:')
print('   1. Open terminal in JupyterLab')
print('   2. Run: tmux attach -t training')
print('   3. To detach: Press Ctrl+B, then D')
print('\nüí° Training continues even if you disconnect!')

## Part 7: Monitor Training

In [None]:
import subprocess

print('Attaching to training session...')
subprocess.run(['tmux', 'attach', '-t', 'training'])

In [None]:
import subprocess
import time

print('GPU Monitoring (Ctrl+C to stop)')
try:
    while True:
        subprocess.run(['clear'])
        subprocess.run(['nvidia-smi'])
        time.sleep(5)
except KeyboardInterrupt:
    print('\nMonitoring stopped')