# Project Sullivan - Transformer Model Training

**Phase 2-B: Advanced Architecture Training on Google Colab**

This notebook trains the Transformer model for acoustic-to-articulatory inversion.

---

## üìã Prerequisites

**Nothing! Just run the notebook.**

All data (78MB) is included in the GitHub repository and will be automatically extracted.

---

## üöÄ Runtime Settings

**IMPORTANT**: Change runtime to GPU
- Menu: Runtime ‚Üí Change runtime type
- Hardware accelerator: **GPU** (T4 recommended)
- Click Save


## ‚öôÔ∏è Configuration

Simple configuration - just set your training mode:

In [None]:
# ============================================
# Training Configuration
# ============================================
QUICK_TEST = False  # Set to True for 10-epoch validation test (20-30 min)
                    # Set to False for full training (2-3 hours)

CONFIG_FILE = 'configs/transformer_config.yaml' if not QUICK_TEST else 'configs/transformer_quick_test.yaml'

# GitHub Repository (already set correctly)
GITHUB_REPO = 'faransansj/Project_Sullivan'
BRANCH = 'main'

# ‚≠ê NOTE: sequence_length is now set in config files!
# - transformer_config.yaml: sequence_length = 100
# - This splits data into 100-frame sequences
# - Train samples: 50 ‚Üí 1,240 (24x increase!)

## üîß Setup Environment

In [None]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: GPU not available! Training will be slow.")
    print("Please change runtime: Runtime ‚Üí Change runtime type ‚Üí GPU")

In [None]:
# Clone GitHub repository
import os

if not os.path.exists('Project_Sullivan'):
    !git clone https://github.com/{GITHUB_REPO}.git
    %cd Project_Sullivan
    !git checkout {BRANCH}
else:
    %cd Project_Sullivan
    !git pull origin {BRANCH}

print("\n‚úÖ Repository ready!")
!pwd

In [None]:
# Install dependencies
print("üì¶ Installing dependencies...\n")
!pip install -q torch torchvision torchaudio
!pip install -q pytorch-lightning tensorboard
!pip install -q librosa soundfile
!pip install -q numpy scipy matplotlib seaborn
!pip install -q pyyaml tqdm

print("\n‚úÖ Dependencies installed!")

## üì• Extract Data from Repository

In [None]:
# Extract data archives (already in repository)
import tarfile
import os

print("üì¶ Extracting data from repository archives...\n")

# Create data directory
os.makedirs('data/processed', exist_ok=True)

# Extract combined archive
archive_path = 'colab_data_archives/processed_data_all.tar.gz'

if os.path.exists(archive_path):
    print(f"Found archive: {archive_path}")
    print("Extracting...")
    
    with tarfile.open(archive_path, 'r:gz') as tar:
        tar.extractall('data/processed')
    
    print("‚úÖ Data extracted!\n")
else:
    print("‚ö†Ô∏è Archive not found. Trying individual archives...\n")
    
    # Try individual archives as fallback
    archives = {
        'audio_features': 'colab_data_archives/audio_features.tar.gz',
        'parameters': 'colab_data_archives/parameters.tar.gz',
        'segmentations': 'colab_data_archives/segmentations.tar.gz',
        'splits': 'colab_data_archives/splits.tar.gz'
    }
    
    for name, path in archives.items():
        if os.path.exists(path):
            print(f"Extracting {name}...")
            with tarfile.open(path, 'r:gz') as tar:
                tar.extractall('data/processed')
    
    print("\n‚úÖ All data extracted!")

# Verify data
print("üìä Data verification:")
!ls -lh data/processed/

## üèãÔ∏è Model Training

In [None]:
# Start training
print(f"üöÄ Starting Transformer training ({CONFIG_FILE})...\n")
print(f"Quick test mode: {QUICK_TEST}")
print(f"GPU available: {torch.cuda.is_available()}\n")

# Run training script
!python scripts/train_transformer.py \
    --config {CONFIG_FILE} \
    --gpus 1

## üìä Monitor Training with TensorBoard

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir logs/training/

## üìà View Training Results

In [None]:
# Check training logs
import glob

log_dirs = glob.glob('logs/training/*/')
if log_dirs:
    latest_log = sorted(log_dirs)[-1]
    print(f"üìÅ Latest training run: {latest_log}\n")
    
    # Show metrics
    metrics_file = os.path.join(latest_log, 'metrics.csv')
    if os.path.exists(metrics_file):
        import pandas as pd
        df = pd.read_csv(metrics_file)
        print("üìä Training Metrics:")
        print(df.tail(10))
    
    # List checkpoints
    checkpoints = glob.glob(os.path.join(latest_log, 'checkpoints', '*.ckpt'))
    if checkpoints:
        print(f"\nüíæ Checkpoints ({len(checkpoints)} found):")
        for ckpt in sorted(checkpoints)[-3:]:
            print(f"   - {os.path.basename(ckpt)}")
else:
    print("No training logs found.")

## üíæ Download Results

In [None]:
# Create archive of results
import shutil
from datetime import datetime

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
archive_name = f'transformer_results_{timestamp}'

# Copy important files
os.makedirs(archive_name, exist_ok=True)

# Copy logs
if log_dirs:
    shutil.copytree(latest_log, os.path.join(archive_name, 'logs'))

# Create zip
shutil.make_archive(archive_name, 'zip', archive_name)

print(f"\nüì¶ Results archived: {archive_name}.zip")
print(f"\nüì• Download using Files panel (left sidebar)")
print(f"   Or run: !cp {archive_name}.zip /content/drive/MyDrive/")

## üíæ Save to Google Drive (Optional)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy results to Drive
drive_path = '/content/drive/MyDrive/Project_Sullivan_Results/'
os.makedirs(drive_path, exist_ok=True)

!cp {archive_name}.zip {drive_path}

print(f"\n‚úÖ Results saved to Google Drive: {drive_path}")

---

## üìù Notes

### Expected Training Time (T4 GPU)
- Quick test (10 epochs): ~20-30 minutes
- Full training (50 epochs): ~2-3 hours

### Expected Performance
- **Target RMSE**: 0.20-0.30 (3-5√ó better than baseline LSTM)
- **Target PCC**: 0.30-0.45 (3-4√ó better than baseline LSTM)
- **Baseline LSTM**: RMSE 1.011, PCC 0.105

### Advantages of This Setup
- **No Google Drive needed** - all data in repository
- **No manual configuration** - just set QUICK_TEST and run
- **Automatic extraction** - data extracted from included archives
- **Simple and fast** - fewer steps, less setup time

### Troubleshooting

**Out of Memory Error:**
- Reduce batch size in config file
- Use gradient accumulation

**GPU not available:**
- Runtime ‚Üí Change runtime type ‚Üí GPU
- May need to wait for GPU allocation

**Data extraction fails:**
- Check that repository cloned successfully
- Verify `colab_data_archives/` folder exists

---

**Updated**: 2025-12-02  
**Project**: Sullivan - Acoustic-to-Articulatory Inversion  
**Phase**: 2-B Transformer Training  
**Setup**: Simplified - No Google Drive required!