# Project Sullivan - Transformer Model Training

**Phase 2-B: Advanced Architecture Training on Google Colab**

This notebook trains the Transformer model for acoustic-to-articulatory inversion.

---

## üìã Prerequisites

1. Upload compressed data to Google Drive:
   - `processed_data_all.tar.gz` (78MB) - Recommended (all data in one file)
   - OR individual files:
     - `audio_features.tar.gz` (48MB)
     - `parameters.tar.gz` (11MB)
     - `segmentations.tar.gz` (19MB)
     - `splits.tar.gz` (4KB)

2. Get shareable links (Anyone with link can view)

3. Extract file IDs from URLs:
   - URL format: `https://drive.google.com/file/d/FILE_ID/view`
   - Copy the `FILE_ID` part

4. Update the `GDRIVE_FILE_ID` variable below

---

## üöÄ Runtime Settings

**IMPORTANT**: Change runtime to GPU
- Menu: Runtime ‚Üí Change runtime type
- Hardware accelerator: **GPU** (T4 recommended)
- Click Save


## ‚öôÔ∏è Configuration

Update these variables with your Google Drive file IDs:

In [None]:
# ============================================
# Google Drive File IDs
# ============================================
# Replace 'YOUR_FILE_ID_HERE' with actual file ID from Google Drive

# Option 1: Use combined archive (recommended)
GDRIVE_FILE_ID_ALL = 'YOUR_FILE_ID_HERE'  # processed_data_all.tar.gz
USE_COMBINED_ARCHIVE = True

# Option 2: Use individual archives (if combined fails)
GDRIVE_FILE_IDS = {
    'audio_features': 'YOUR_FILE_ID_HERE',
    'parameters': 'YOUR_FILE_ID_HERE',
    'segmentations': 'YOUR_FILE_ID_HERE',
    'splits': 'YOUR_FILE_ID_HERE'
}

# ============================================
# Training Configuration
# ============================================
QUICK_TEST = False  # Set to True for 10-epoch validation test
CONFIG_FILE = 'configs/transformer_config.yaml' if not QUICK_TEST else 'configs/transformer_quick_test.yaml'

# GitHub Repository
GITHUB_REPO = 'YOUR_GITHUB_USERNAME/Project_Sullivan'  # Update with your repo
BRANCH = 'main'

## üîß Setup Environment

In [None]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: GPU not available! Training will be slow.")
    print("Please change runtime: Runtime ‚Üí Change runtime type ‚Üí GPU")

In [None]:
# Clone GitHub repository
import os

if not os.path.exists('Project_Sullivan'):
    !git clone https://github.com/{GITHUB_REPO}.git
    %cd Project_Sullivan
    !git checkout {BRANCH}
else:
    %cd Project_Sullivan
    !git pull origin {BRANCH}

print("\n‚úÖ Repository ready!")
!pwd

In [None]:
# Install dependencies
print("üì¶ Installing dependencies...\n")
!pip install -q torch torchvision torchaudio
!pip install -q pytorch-lightning tensorboard
!pip install -q librosa soundfile
!pip install -q numpy scipy matplotlib seaborn
!pip install -q pyyaml tqdm
!pip install -q gdown  # For Google Drive downloads

print("\n‚úÖ Dependencies installed!")

## üì• Download Data from Google Drive

In [None]:
# Download and extract data
import gdown
import tarfile
import os

# Create data directory
os.makedirs('data/processed', exist_ok=True)

if USE_COMBINED_ARCHIVE:
    print("üì• Downloading combined data archive...\n")
    
    # Download
    url = f'https://drive.google.com/uc?id={GDRIVE_FILE_ID_ALL}'
    output = 'processed_data_all.tar.gz'
    gdown.download(url, output, quiet=False)
    
    # Extract
    print("\nüì¶ Extracting archive...")
    with tarfile.open(output, 'r:gz') as tar:
        tar.extractall('data/processed')
    
    # Cleanup
    os.remove(output)
    print("‚úÖ Data extracted!")
    
else:
    print("üì• Downloading individual archives...\n")
    
    for name, file_id in GDRIVE_FILE_IDS.items():
        print(f"\nDownloading {name}...")
        url = f'https://drive.google.com/uc?id={file_id}'
        output = f'{name}.tar.gz'
        gdown.download(url, output, quiet=False)
        
        # Extract
        print(f"Extracting {name}...")
        with tarfile.open(output, 'r:gz') as tar:
            tar.extractall('data/processed')
        
        # Cleanup
        os.remove(output)
    
    print("\n‚úÖ All data extracted!")

# Verify data
print("\nüìä Data verification:")
!ls -lh data/processed/

## üèãÔ∏è Model Training

In [None]:
# Start training
print(f"üöÄ Starting Transformer training ({CONFIG_FILE})...\n")
print(f"Quick test mode: {QUICK_TEST}")
print(f"GPU available: {torch.cuda.is_available()}\n")

# Run training script
!python scripts/train_transformer.py \
    --config {CONFIG_FILE} \
    --gpus 1

## üìä Monitor Training with TensorBoard

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir logs/training/

## üìà View Training Results

In [None]:
# Check training logs
import glob

log_dirs = glob.glob('logs/training/*/')
if log_dirs:
    latest_log = sorted(log_dirs)[-1]
    print(f"üìÅ Latest training run: {latest_log}\n")
    
    # Show metrics
    metrics_file = os.path.join(latest_log, 'metrics.csv')
    if os.path.exists(metrics_file):
        import pandas as pd
        df = pd.read_csv(metrics_file)
        print("üìä Training Metrics:")
        print(df.tail(10))
    
    # List checkpoints
    checkpoints = glob.glob(os.path.join(latest_log, 'checkpoints', '*.ckpt'))
    if checkpoints:
        print(f"\nüíæ Checkpoints ({len(checkpoints)} found):")
        for ckpt in sorted(checkpoints)[-3:]:
            print(f"   - {os.path.basename(ckpt)}")
else:
    print("No training logs found.")

## üíæ Download Results

In [None]:
# Create archive of results
import shutil
from datetime import datetime

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
archive_name = f'transformer_results_{timestamp}'

# Copy important files
os.makedirs(archive_name, exist_ok=True)

# Copy logs
if log_dirs:
    shutil.copytree(latest_log, os.path.join(archive_name, 'logs'))

# Create zip
shutil.make_archive(archive_name, 'zip', archive_name)

print(f"\nüì¶ Results archived: {archive_name}.zip")
print(f"\nüì• Download using Files panel (left sidebar)")
print(f"   Or run: !cp {archive_name}.zip /content/drive/MyDrive/")

## üíæ Save to Google Drive (Optional)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy results to Drive
drive_path = '/content/drive/MyDrive/Project_Sullivan_Results/'
os.makedirs(drive_path, exist_ok=True)

!cp {archive_name}.zip {drive_path}

print(f"\n‚úÖ Results saved to Google Drive: {drive_path}")

---

## üìù Notes

### Expected Training Time (T4 GPU)
- Quick test (10 epochs): ~20-30 minutes
- Full training (50 epochs): ~2-3 hours

### Expected Performance
- **Target RMSE**: 0.20-0.30 (3-5√ó better than baseline LSTM)
- **Target PCC**: 0.30-0.45 (3-4√ó better than baseline LSTM)
- **Baseline LSTM**: RMSE 1.011, PCC 0.105

### Troubleshooting

**Out of Memory Error:**
- Reduce batch size in config file
- Use gradient accumulation

**GPU not available:**
- Runtime ‚Üí Change runtime type ‚Üí GPU
- May need to wait for GPU allocation

**Download fails:**
- Check file IDs are correct
- Ensure sharing is set to "Anyone with link"
- Try individual archives instead of combined

---

**Generated**: 2025-12-01  
**Project**: Sullivan - Acoustic-to-Articulatory Inversion  
**Phase**: 2-B Transformer Training
