# BOM VAE v15 - LBO Compliant Training

**Features:**
- Pure LBO with Directive #4 rollback mechanism
- Directive #6: Adaptive squeeze starting at plateau (epochs 5, 8, 11, 14, 17)
- Behavioral disentanglement testing (core→structure, detail→appearance)
- 35 epochs, L4 GPU optimized (batch_size=256)
- NO epsilon, NO softmin, NO clamps on goals - pure `-log(min())`

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Clone repository and checkout branch
!git clone https://github.com/caseymrobbins/bom_vea.git
%cd bom_vea
!git checkout claude/loss-function-goals-kRaQb
!git log -1 --oneline

In [None]:
# Install dependencies
!pip install -q torch torchvision tqdm pillow scikit-image

In [None]:
# Download and setup CelebA dataset
!python celeba_setup.py

## Training Configuration

Current settings:
- **Epochs**: 35
- **Batch Size**: 256 (L4 GPU)
- **Learning Rate**: 1e-3 (VAE), 1e-4 (Discriminator)
- **Latent Dim**: 128 (16 core, 112 detail)
- **Progressive Tightening**: Epochs 5, 8, 11, 14, 17 (starts at plateau)
- **Stable Convergence**: Epochs 18-35 (18 epochs for final refinement)

In [None]:
# Start training
!python train.py

## Expected Behavior

### LBO Rollback Mechanism
If you see `[ROLLBACK]` messages, this is **expected and correct**:
- Optimizer attempted a step that would violate constraints (S_min ≤ 0)
- System detected violation before crash
- State restored, update rejected
- Training continues safely

### Bottleneck Tracking
Watch the bottleneck percentages:
- **Epochs 1-4**: Initial convergence, recon usually dominates (60-80%)
- **Epoch 5**: Recon tightened → should be bottleneck 85%+
- **Epochs 6-7**: System adapts to new recon baseline
- **Epoch 8**: Core tightened → Core/Recon compete
- **Epochs 9-10**: Adaptation
- **Epoch 11**: Swap tightened
- **Epoch 14**: Realism tightened
- **Epoch 17**: Disentangle tightened (final squeeze)
- **Epochs 18-35**: Long stable convergence, all groups balanced

### Key Metrics
- **Loss**: Should decrease (negative log, so lower is better)
- **Min Group**: Should increase toward 1.0 (all constraints satisfied)
- **SSIM**: Should increase toward 1.0 (perfect reconstruction)
- **KL**: Should stabilize within clamps [0, 10]/dim

In [None]:
# View outputs
from IPython.display import Image, display
import os

output_dir = '/content/outputs_bom_v15'

if os.path.exists(output_dir):
    print("\n=== Reconstructions ===")
    if os.path.exists(f'{output_dir}/reconstructions.png'):
        display(Image(filename=f'{output_dir}/reconstructions.png'))
    
    print("\n=== Core Traversals (Structure) ===")
    if os.path.exists(f'{output_dir}/traversals_core.png'):
        display(Image(filename=f'{output_dir}/traversals_core.png'))
    
    print("\n=== Detail Traversals (Appearance) ===")
    if os.path.exists(f'{output_dir}/traversals_detail.png'):
        display(Image(filename=f'{output_dir}/traversals_detail.png'))
    
    print("\n=== Cross Reconstruction (Core/Detail Swap) ===")
    if os.path.exists(f'{output_dir}/cross_reconstruction.png'):
        display(Image(filename=f'{output_dir}/cross_reconstruction.png'))
    
    print("\n=== Group Balance Over Time ===")
    if os.path.exists(f'{output_dir}/group_balance.png'):
        display(Image(filename=f'{output_dir}/group_balance.png'))
else:
    print(f"Output directory not found: {output_dir}")

## Troubleshooting

### Training crashes immediately
- Check if S_min ≤ 0 violation is being caught
- Look for `[ROLLBACK]` messages
- If crashing with `log(0)`, the pre-log check may have a bug

### Too many rollbacks (>50% of batches)
- Learning rate may be too high
- Consider reducing LR: 1e-3 → 5e-4
- Or widen initial BOX constraints

### OOM (Out of Memory)
- Reduce batch size: 256 → 128
- L4 has 24GB, should handle 256 fine
- If using A100, can increase to 512

### Disentangle always at 1.0
- Expected at init (collapsed encoder)
- Should drop as latent space becomes expressive
- If still 1.0 at epoch 20+, encoder may not be learning