# 🎵 LoFi Music Generator - GPU Training on Colab

**Train your LoFi music AI model with FREE GPU!**

This notebook will:
- ✅ Use Google's FREE GPU (100x faster than CPU)
- ✅ Train on 178k MIDI files from Lakh dataset
- ✅ Save trained model for download
- ✅ Complete in 8-12 hours (not 43 days!)

---

## ⚡ IMPORTANT: Enable GPU First!

1. Click **Runtime** → **Change runtime type**
2. Select **T4 GPU** or **GPU** from Hardware accelerator
3. Click **Save**

**Then run the cells below in order!**

## 📦 Step 1: Setup Environment

In [None]:
# Check GPU is available
import torch
print(f"🔍 GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("❌ NO GPU! Go to Runtime → Change runtime type → Select GPU")

## 💾 Mount Google Drive (IMPORTANT!)

**This is CRITICAL for training 178k files!**

Why you need this:
- Training takes 8-12 hours on 178k MIDI files
- Colab sessions disconnect after 12 hours (or randomly)
- Without Drive, all progress is LOST on disconnect
- With Drive, you can resume exactly where you left off

What gets saved to Drive:
- ✅ Tokenized MIDI files (so you don't re-tokenize 178k files!)
- ✅ Model checkpoints every 500 steps
- ✅ Training logs and metrics
- ✅ Final trained model

In [None]:
# Mount Google Drive
from google.colab import drive
import os

print("🔗 Mounting Google Drive...")
print("You'll need to click the link and authorize access.\n")

drive.mount('/content/drive')

# Create directory structure in Drive
drive_dir = '/content/drive/MyDrive/LoFi_Training'
os.makedirs(drive_dir, exist_ok=True)
os.makedirs(f'{drive_dir}/checkpoints', exist_ok=True)
os.makedirs(f'{drive_dir}/tokenized_data', exist_ok=True)

print(f"\n✅ Google Drive mounted!")
print(f"📁 Training data will be saved to: {drive_dir}")
print("\nIf Colab disconnects, just re-run the cells and training will resume automatically!")

## 📥 Clone Repository

**IMPORTANT:** If you get authentication errors when cloning, choose ONE solution:

### Option 1: Make Repository Public (Easiest)
1. Go to https://github.com/andy-regulore/lofi
2. Click **Settings** → **General**
3. Scroll to **Danger Zone** → **Change repository visibility**
4. Click **Change visibility** → **Make public**

### Option 2: Use Personal Access Token
1. Go to https://github.com/settings/tokens
2. Generate new token (classic) with **repo** scope
3. Copy the token
4. In the cell below, replace the clone command with:
   ```
   !git clone https://YOUR_TOKEN_HERE@github.com/andy-regulore/lofi.git
   ```

Then run the cell below ⬇️

In [None]:
# Clone repository (handles both first run and re-runs after disconnect)
import os

if os.path.exists('/content/lofi'):
    print("📁 Repository already exists (from previous run)")
    print("Skipping clone, changing to directory...\n")
    %cd /content/lofi
    
    # Pull latest changes just in case
    print("🔄 Pulling latest changes...")
    !git pull origin main
else:
    print("📥 Cloning repository for first time...\n")
    # NOTE: If you get authentication errors, you need to either:
    # 1. Make your repository PUBLIC on GitHub (Settings → General → Danger Zone → Change visibility)
    # 2. OR use a personal access token: !git clone https://YOUR_TOKEN@github.com/andy-regulore/lofi.git
    
    !git clone https://github.com/andy-regulore/lofi.git
    %cd /content/lofi
    
    # Use main branch (all fixes are here)
    !git checkout main

# Verify we're in the right place
print("\n✅ Current directory:")
!pwd
print("\n✅ Checking for config.yaml:")
!ls -la config.yaml

In [None]:
# Install dependencies
print("📦 Installing dependencies (this takes 3-5 minutes)...")
!pip install -q torch transformers datasets accelerate
!pip install -q miditok miditoolkit pretty_midi
!pip install -q librosa soundfile scipy numpy pandas
!pip install -q pyyaml scikit-learn tqdm tensorboard
print("✅ Dependencies installed!")

## 📂 Step 2: Get Training Data

**Choose ONE option:**
- **Option A:** Download Lakh MIDI Dataset (176k files, ~20GB)
- **Option B:** Upload your own MIDI files from Google Drive

In [None]:
# OPTION A: Download Lakh MIDI Dataset (recommended)
print("📥 Downloading Lakh MIDI Dataset (~20GB, takes 10-20 minutes)...")
!mkdir -p data/training
!wget -q --show-progress http://hog.ee.columbia.edu/craffel/lmd/lmd_full.tar.gz
print("\n📦 Extracting dataset...")
!tar -xzf lmd_full.tar.gz -C data/training/
!rm lmd_full.tar.gz
print("✅ Dataset ready!")

# Count files
import os
midi_count = sum(1 for root, dirs, files in os.walk('data/training') 
                 for f in files if f.endswith(('.mid', '.midi')))
print(f"\n🎵 Found {midi_count:,} MIDI files")

In [None]:
# OPTION B: Use Google Drive (skip if you used Option A)
# Uncomment if you have MIDI files in Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

# # Copy from your Google Drive to Colab
# !mkdir -p data/training
# !cp -r /content/drive/MyDrive/your-midi-folder/* data/training/

# # Count files
# import os
# midi_count = sum(1 for root, dirs, files in os.walk('data/training') 
#                  for f in files if f.endswith(('.mid', '.midi')))
# print(f"🎵 Found {midi_count:,} MIDI files")

## 🚀 Step 3: Train the Model!

This will:
1. Tokenize all MIDI files (1-2 hours)
2. Train GPT-2 model (6-10 hours)
3. Save trained model

**Total time: 8-12 hours with GPU**

In [None]:
# Run training directly in Python with Google Drive checkpoint saving
print("🚀 Starting training with checkpoint saving...\n")
print("💾 All progress will be saved to Google Drive!")
print("If Colab disconnects, re-run this cell to resume from last checkpoint.\n")

# Make sure we're in the right directory
import os
import pickle
from pathlib import Path
from multiprocessing import Pool, cpu_count
from functools import partial

if not Path('config.yaml').exists():
    print("❌ Error: Not in lofi directory!")
    print("Please run the repository clone cell first.")
    raise FileNotFoundError("config.yaml not found - wrong directory")

print(f"✅ Working directory: {os.getcwd()}\n")

import yaml
import torch
from src.tokenizer import LoFiTokenizer
from src.model import ConditionedLoFiModel
from src.trainer import LoFiTrainer
from sklearn.model_selection import train_test_split

# Google Drive paths
DRIVE_DIR = '/content/drive/MyDrive/LoFi_Training'
TOKENIZED_DATA_PATH = f'{DRIVE_DIR}/tokenized_data/sequences.pkl'
CHECKPOINT_DIR = f'{DRIVE_DIR}/checkpoints'
FINAL_MODEL_DIR = f'{DRIVE_DIR}/final_model'

# Load config
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Override settings for Colab with Google Drive checkpointing
config['training']['output_dir'] = CHECKPOINT_DIR
config['training']['device'] = 'cuda'
config['training']['fp16'] = True
config['training']['num_epochs'] = 15  # User set to 15
config['training']['batch_size'] = 4   # User set to 4
config['training']['save_steps'] = 500  # Save checkpoint every 500 steps
config['training']['save_total_limit'] = 3  # Keep only last 3 checkpoints to save space
config['data']['quality_filters']['require_drums'] = False
config['data']['quality_filters']['min_tempo'] = 1
config['data']['quality_filters']['max_tempo'] = 999

print("="*60)
print("PHASE 1: TOKENIZING MIDI FILES")
print("="*60)

# Check if we already have tokenized data in Google Drive
if os.path.exists(TOKENIZED_DATA_PATH):
    print(f"\n🎉 Found existing tokenized data in Google Drive!")
    print(f"Loading from: {TOKENIZED_DATA_PATH}")
    print("This saves hours of tokenization time!\n")
    
    with open(TOKENIZED_DATA_PATH, 'rb') as f:
        saved_data = pickle.load(f)
        token_sequences = saved_data['token_sequences']
        vocab_size = saved_data['vocab_size']
    
    print(f"✅ Loaded {len(token_sequences):,} token sequences")
    print(f"Vocabulary size: {vocab_size}")
    
else:
    print("\n💾 No existing tokenized data found. Starting tokenization...")
    print("(This will be saved to Google Drive for future runs)\n")
    
    # Initialize tokenizer
    tokenizer = LoFiTokenizer(config)
    vocab_size = tokenizer.tokenizer.vocab_size
    print(f"Vocabulary size: {vocab_size}")

    # Find all MIDI files
    training_dir = Path('data/training')
    midi_files = list(training_dir.glob('**/*.mid')) + list(training_dir.glob('**/*.midi'))
    
    if len(midi_files) == 0:
        print("\n" + "="*60)
        print("❌ ERROR: NO MIDI FILES FOUND!")
        print("="*60)
        print("\nYou need to download training data first!")
        print("\n📋 STEPS TO FIX:")
        print("1. Scroll up to 'Step 2: Get Training Data'")
        print("2. Run either:")
        print("   - OPTION A: Download Lakh MIDI Dataset cell (recommended)")
        print("   - OPTION B: Upload your own MIDI files from Google Drive")
        print("3. Wait for download/upload to complete")
        print("4. Then come back and re-run this training cell")
        print("\n" + "="*60)
        raise ValueError("No MIDI files found! Please download training data first (see Step 2 above).")
    
    print(f"Found {len(midi_files):,} MIDI files\n")

    # Parallel tokenization function
    def tokenize_single_file(midi_file, config_dict):
        """Tokenize a single MIDI file (for parallel processing)"""
        try:
            # Create tokenizer instance for this process
            tokenizer = LoFiTokenizer(config_dict)
            
            result = tokenizer.tokenize_midi(str(midi_file), check_quality=False)
            if result and 'tokens' in result and len(result['tokens']) > 0:
                chunks = tokenizer.chunk_sequence(result['tokens'])
                if chunks:
                    return ('success', chunks)
                else:
                    return ('fail', None)
            else:
                return ('fail', None)
        except Exception as e:
            return ('error', str(e)[:100])

    # Use parallel processing to speed up tokenization
    num_workers = cpu_count()
    print(f"⚡ Using {num_workers} CPU cores for parallel tokenization")
    print(f"   This should be ~{num_workers}x faster!\n")
    
    print(f"Tokenizing {len(midi_files):,} files in parallel...")
    print("Progress updates every 1000 files:\n")

    token_sequences = []
    success_count = 0
    fail_count = 0
    error_count = 0
    
    # Process in batches for progress tracking
    batch_size = 1000
    
    with Pool(num_workers) as pool:
        for batch_start in range(0, len(midi_files), batch_size):
            batch_end = min(batch_start + batch_size, len(midi_files))
            batch_files = midi_files[batch_start:batch_end]
            
            # Process this batch in parallel
            tokenize_func = partial(tokenize_single_file, config_dict=config)
            results = pool.map(tokenize_func, batch_files)
            
            # Collect results
            for result in results:
                status, data = result
                if status == 'success':
                    token_sequences.extend(data)
                    success_count += 1
                elif status == 'fail':
                    fail_count += 1
                else:  # error
                    error_count += 1
                    if error_count <= 10:
                        print(f"  Error: {data}")
            
            # Progress update
            processed = batch_end
            elapsed_pct = processed / len(midi_files) * 100
            print(f"  [{elapsed_pct:5.1f}%] Processed {processed:,}/{len(midi_files):,} files - "
                  f"Success: {success_count:,}, Failed: {fail_count:,}")

    print(f"\n✅ Tokenization complete!")
    print(f"  Success: {success_count:,} files")
    print(f"  Failed: {fail_count:,} files")
    print(f"  Errors: {error_count:,} files")
    print(f"  Generated: {len(token_sequences):,} token sequences")

    if len(token_sequences) == 0:
        raise ValueError("No valid sequences generated. Check MIDI files.")
    
    # Save tokenized data to Google Drive
    print(f"\n💾 Saving tokenized data to Google Drive...")
    print(f"   Path: {TOKENIZED_DATA_PATH}")
    
    with open(TOKENIZED_DATA_PATH, 'wb') as f:
        pickle.dump({
            'token_sequences': token_sequences,
            'vocab_size': vocab_size
        }, f)
    
    print("✅ Tokenized data saved! Future runs will skip tokenization.")

print("\n" + "="*60)
print("PHASE 2: SPLITTING DATASET")
print("="*60)

# Split into train/eval
train_sequences, eval_sequences = train_test_split(
    token_sequences, test_size=0.1, random_state=42
)

print(f"Training sequences: {len(train_sequences):,}")
print(f"Evaluation sequences: {len(eval_sequences):,}")

print("\n" + "="*60)
print("PHASE 3: INITIALIZING MODEL")
print("="*60)

# Initialize model
model = ConditionedLoFiModel(config, vocab_size)
model_info = model.get_model_info()

print(f"Model: GPT-2")
print(f"Parameters: {model_info['total_parameters']:,} ({model_info['total_parameters']/1e6:.1f}M)")
print(f"Layers: {model_info['num_layers']}")
print(f"Context length: {model_info['context_length']}")

print("\n" + "="*60)
print("PHASE 4: TRAINING MODEL")
print("="*60)
print(f"\n💾 Checkpoints will be saved to Google Drive every 500 steps")
print(f"📁 Location: {CHECKPOINT_DIR}")
print(f"\nEpochs: {config['training']['num_epochs']}, Batch size: {config['training']['batch_size']}")
print("\n⚠️ If Colab disconnects, just re-run this cell - it will auto-resume from the last checkpoint!\n")

# Check for existing checkpoints (just for user info - trainer handles this automatically)
from transformers.trainer_utils import get_last_checkpoint

existing_checkpoint = None
if os.path.exists(CHECKPOINT_DIR):
    existing_checkpoint = get_last_checkpoint(CHECKPOINT_DIR)
    if existing_checkpoint:
        print(f"🔄 Found existing checkpoint: {os.path.basename(existing_checkpoint)}")
        print(f"   Training will resume from this checkpoint!\n")

# Initialize trainer (it will auto-detect and resume from checkpoints)
trainer = LoFiTrainer(model, config, vocab_size)

# Train! (trainer automatically resumes from last checkpoint if it exists)
results = trainer.train(train_sequences, eval_sequences)

print("\n" + "="*60)
print("✅ TRAINING COMPLETE!")
print("="*60)

print(f"\nFinal metrics:")
print(f"  Train loss: {results['train_metrics'].get('train_loss', 'N/A')}")
print(f"  Eval loss: {results['eval_metrics'].get('eval_loss', 'N/A')}")

# Copy final model to separate directory in Google Drive
print(f"\n💾 Copying final model to: {FINAL_MODEL_DIR}")
import shutil
if os.path.exists(FINAL_MODEL_DIR):
    shutil.rmtree(FINAL_MODEL_DIR)
shutil.copytree(CHECKPOINT_DIR, FINAL_MODEL_DIR)

print(f"\n✅ Model saved to Google Drive!")
print(f"📁 Location: {FINAL_MODEL_DIR}")
print("\n👉 Run the download cell below to get your trained model!")

## 📊 Step 4: Monitor Training (Optional)

Run this in a separate cell to check progress

In [None]:
# Load TensorBoard to monitor training from Google Drive
%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/LoFi_Training/checkpoints

# You can also view logs locally if training hasn't started yet:
# %tensorboard --logdir models/colab-trained/logs

## 💾 Step 5: Download Trained Model

In [None]:
# Download trained model from Google Drive
import os
from google.colab import files

FINAL_MODEL_DIR = '/content/drive/MyDrive/LoFi_Training/final_model'

if os.path.exists(FINAL_MODEL_DIR):
    print("📦 Zipping your trained model from Google Drive...")
    print(f"📁 Source: {FINAL_MODEL_DIR}\n")
    
    # Zip the model
    !cd /content/drive/MyDrive/LoFi_Training && zip -r trained_lofi_model.zip final_model/
    
    print("\n💾 Downloading to your computer...")
    files.download('/content/drive/MyDrive/LoFi_Training/trained_lofi_model.zip')
    
    print("\n✅ Model downloaded!")
    print("\nNext steps:")
    print("1. Unzip trained_lofi_model.zip")
    print("2. Copy the 'final_model' folder to your local lofi/models/ directory")
    print("3. Rename it to 'lofi-gpt2' (or update config.yaml)")
    print("4. Start generating music with your web UI!")
else:
    print("❌ No trained model found in Google Drive!")
    print(f"Expected location: {FINAL_MODEL_DIR}")
    print("\nPlease run the training cell first.")

## 🎵 Step 6: Test Generation (Optional)

In [None]:
# Generate a test track (if scripts exist)
# Note: You can also generate using the web UI after downloading the model

try:
    !python scripts/04_generate.py \
        --config config.yaml \
        --model-path models/colab-trained \
        --output-dir output/test \
        --num-tracks 1 \
        --mood chill \
        --tempo 75

    print("\n✅ Generated test track in output/test/")

    # Download the generated MIDI file
    from google.colab import files
    import os
    if os.path.exists('output/test/midi'):
        midi_files = [f for f in os.listdir('output/test/midi') if f.endswith('.mid')]
        if midi_files:
            files.download(f'output/test/midi/{midi_files[0]}')
    else:
        print("MIDI output directory not found")
except Exception as e:
    print(f"Generation failed: {e}")
    print("You can generate music using the web UI after downloading the model")

## ✅ Next Steps

After training completes:

1. **Download the model** (Step 5 above)
2. **Unzip** on your local machine
3. **Place in** `lofi/models/lofi-gpt2/`
4. **Generate music** using your local web UI!

---

### 💾 Google Drive Checkpoint System

**How it works:**
- ✅ Tokenized MIDI data saved to Drive (1-2 hour savings!)
- ✅ Model checkpoints saved every 500 steps
- ✅ Keeps last 3 checkpoints (saves Drive space)
- ✅ Auto-resumes from latest checkpoint

**If Colab disconnects:**
1. Wait for email notification (or check manually)
2. Re-open this notebook
3. Re-run the cells in order
4. Training resumes exactly where it left off!

**Your Google Drive will have:**
```
MyDrive/
  LoFi_Training/
    ├── tokenized_data/
    │   └── sequences.pkl (saved for future runs)
    ├── checkpoints/
    │   ├── checkpoint-500/
    │   ├── checkpoint-1000/
    │   └── checkpoint-1500/
    └── final_model/ (ready to download)
```

---

### 🎯 Pro Tips:

- **First run takes longest** - tokenization (1-2 hours) + training (6-10 hours)
- **Subsequent runs are faster** - tokenized data is cached in Drive
- **Training interrupted?** - Just re-run, it auto-resumes
- **Want faster training?** - Reduce `num_epochs` from 10 to 5 in training cell
- **Check progress anytime** - Use TensorBoard cell or check Google Drive folder
- **Drive space low?** - Training uses ~5-10GB total

---

### ⚡ Troubleshooting:

**"Runtime disconnected"**
- Normal! Colab has 12-hour limit. Just re-run cells to resume.

**"Out of memory"**
- Reduce `batch_size` from 8 to 4 in training cell
- Make sure you selected GPU (not CPU) in runtime settings

**"No MIDI files found"**
- Make sure Step 2 (download dataset) completed successfully
- Check `data/training/` directory has files

**"Checkpoint not found"**
- First run won't have checkpoints - that's normal
- Checkpoints appear after 500 training steps