# BOM VAE v15 - LBO Compliant Training

**Features:**
- Pure LBO with Directive #4 rollback mechanism
- Directive #6: Natural adaptive squeeze (LBO's infinite gradient automatically pushes all groups ‚Üí 1.0)
- Behavioral disentanglement testing (core‚Üístructure, detail‚Üíappearance)
- 35 epochs, L4 GPU optimized (batch_size=256)
- NO epsilon, NO softmin, NO clamps on goals - pure `-log(min())`
- NO manual recalibration - scales set once at epoch 1, then natural improvement

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Clone repository and checkout branch
!git clone https://github.com/caseymrobbins/bom_vea.git
%cd bom_vea
!git checkout claude/loss-function-goals-kRaQb
!git log -1 --oneline

In [None]:
# Install dependencies
!pip install -q torch torchvision tqdm pillow scikit-image

In [None]:
# Download and setup CelebA dataset
!python celeba_setup.py

## Training Configuration

Current settings:
- **Epochs**: 35 (max - may stop early)
- **Batch Size**: 256 (L4 GPU)
- **Learning Rate**: 1e-3 (VAE), 1e-4 (Discriminator)
- **Latent Dim**: 128 (16 core, 112 detail)
- **Calibration**: Epoch 1 only (sets initial scales)
- **Adaptive Tightening**: Starts epoch 5, tightens 5%/epoch until rollback rate hits 5%, then 1 more epoch and stops

In [None]:
# Start training
!python train.py

## Expected Behavior

### LBO Rollback Mechanism
If you see `[ROLLBACK]` messages, this is **expected and correct**:
- Optimizer attempted a step that would violate constraints (S_min ‚â§ 0)
- System detected violation before crash
- State restored, update rejected
- Training continues safely

### Adaptive Tightening & Early Stopping (LBO Directive #6)
The system automatically tightens constraints and stops when optimal:
- **Epochs 1-4**: Initial calibration and convergence
- **Epochs 5+**: Automatic tightening begins (5% per epoch)
  - Tightens MINIMIZE_SOFT scales ‚Üí goals harder to achieve
  - Narrows BOX bounds ‚Üí stricter constraints
- **At 5% rollback rate**: Constraints are optimally tight
  - System runs 1 more epoch for stability
  - Training stops automatically (may be before epoch 35)
- **üèÅ STOPPING message**: Indicates successful convergence

### How Adaptive Tightening Works
1. **Monitor rollback rate**: `rollbacks / total_batches`
2. **While rate < 5%**: Tighten all constraints by 5%
3. **When rate ‚â• 5%**: Stop tightening (constraints at limit)
4. **After 1 more epoch**: Stop training (optimal convergence reached)

This ensures training stops exactly when constraints are maximally tight without excessive rejections.

### Bottleneck Tracking
Watch the bottleneck percentages - they should shift naturally:
- Early: Recon typically dominates (hardest to perfect)
- Mid: Competition between recon, core, realism
- Late: More balanced as all groups approach 1.0

### Key Metrics
- **Loss**: Should decrease steadily (negative log, so lower is better)
- **Min Group**: Should increase toward 1.0 (all constraints satisfied)
- **SSIM**: Should increase toward 1.0 (perfect reconstruction)
- **KL**: Should stabilize within clamps [0, 10]/dim
- **Rollback Rate**: Should gradually increase as constraints tighten, stopping at ~5%

In [None]:
# View outputs
from IPython.display import Image, display
import os

output_dir = '/content/outputs_bom_v15'

if os.path.exists(output_dir):
    print("\n=== Reconstructions ===")
    if os.path.exists(f'{output_dir}/reconstructions.png'):
        display(Image(filename=f'{output_dir}/reconstructions.png'))
    
    print("\n=== Core Traversals (Structure) ===")
    if os.path.exists(f'{output_dir}/traversals_core.png'):
        display(Image(filename=f'{output_dir}/traversals_core.png'))
    
    print("\n=== Detail Traversals (Appearance) ===")
    if os.path.exists(f'{output_dir}/traversals_detail.png'):
        display(Image(filename=f'{output_dir}/traversals_detail.png'))
    
    print("\n=== Cross Reconstruction (Core/Detail Swap) ===")
    if os.path.exists(f'{output_dir}/cross_reconstruction.png'):
        display(Image(filename=f'{output_dir}/cross_reconstruction.png'))
    
    print("\n=== Group Balance Over Time ===")
    if os.path.exists(f'{output_dir}/group_balance.png'):
        display(Image(filename=f'{output_dir}/group_balance.png'))
else:
    print(f"Output directory not found: {output_dir}")

## Troubleshooting

### Training crashes immediately
- Check if S_min ‚â§ 0 violation is being caught
- Look for `[ROLLBACK]` messages
- If crashing with `log(0)`, the pre-log check may have a bug

### Too many rollbacks (>50% of batches)
- Learning rate may be too high
- Consider reducing LR: 1e-3 ‚Üí 5e-4
- Or widen initial BOX constraints

### Groups not improving
- Check if bottleneck shifts between groups (healthy sign)
- Loss should decrease even if some groups plateau temporarily
- Natural squeeze can be slower than manual tightening but more stable

### OOM (Out of Memory)
- Reduce batch size: 256 ‚Üí 128
- L4 has 24GB, should handle 256 fine
- If using A100, can increase to 512

### Disentangle always at 1.0
- Expected at init (collapsed encoder)
- Should drop as latent space becomes expressive
- If still 1.0 at epoch 20+, encoder may not be learning