# NFL Big Data Bowl 2026 - Ensemble Prediction

**Best Overall: 0.540-0.541 Public LB**

This notebook demonstrates how to combine multiple models into an ensemble for best performance.

**4-Model Ensemble Components**:
| Model | Public LB | Weight |
|-------|-----------|--------|
| ST Transformer (6L) | 0.547 | 0.2517 |
| Multiscale CNN | 0.548 | 0.2517 |
| GRU (Seed 27) | 0.557 | 0.2476 |
| Position-Specific ST | 0.553 | 0.2490 |

**Contents**:
1. Ensemble Strategy Overview
2. Weight Calculation
3. Model Loading
4. Ensemble Prediction
5. Test-Time Augmentation
6. Complete Ensemble Pipeline

In [1]:
import numpy as np
import pandas as pd
import torch
from pathlib import Path
import pickle
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

print('Imports ready')

Imports ready


## 1. Ensemble Strategy

**Why Ensemble?**
- Different architectures capture different patterns
- Reduces variance and overfitting
- Typical improvement: 0.01-0.02 on leaderboard

**Best Practices**:
1. Use diverse architectures (Transformer, GRU, CNN)
2. Use different seeds for same architecture
3. Weight by inverse LB score (better = higher weight)
4. Apply TTA to each model before averaging

In [2]:
# Model configurations from actual submissions
ENSEMBLE_CONFIG = {
    'st_transformer': {
        'name': '6-Layer ST Transformer',
        'public_lb': 0.547,
        'cv_score': 0.0750,
        'n_folds': 20,
        'kaggle_dataset': '6layer-seed700-flip-only',
    },
    'multiscale_cnn': {
        'name': 'Multiscale CNN + 2L ST',
        'public_lb': 0.548,
        'cv_score': 0.0751,
        'n_folds': 20,
        'kaggle_dataset': 'st-multiscale-cnn-w10-20fold',
    },
    'gru_seed27': {
        'name': 'GRU (Seed 27)',
        'public_lb': 0.557,
        'cv_score': 0.0798,
        'n_folds': 20,
        'kaggle_dataset': 'gru-w9-seed27-20fold',
    },
    'position_st': {
        'name': 'Position-Specific ST',
        'public_lb': 0.553,
        'cv_score': 0.0750,
        'n_folds': 5,  # per position
        'kaggle_dataset': 'nfl-bdb-2026-position-st-combined',
    },
}

print('Ensemble models:')
for key, cfg in ENSEMBLE_CONFIG.items():
    print(f"  {cfg['name']}: {cfg['public_lb']} LB")

Ensemble models:
  6-Layer ST Transformer: 0.547 LB
  Multiscale CNN + 2L ST: 0.548 LB
  GRU (Seed 27): 0.557 LB
  Position-Specific ST: 0.553 LB


## 2. Weight Calculation

Calculate ensemble weights based on inverse LB scores.

In [3]:
def calculate_weights(lb_scores: Dict[str, float]) -> Dict[str, float]:
    """
    Calculate ensemble weights from LB scores.
    Lower LB score = higher weight (inverse weighting)
    
    Args:
        lb_scores: Dict of model_name -> public LB score
    
    Returns:
        Dict of model_name -> normalized weight
    """
    # Inverse scores (lower is better)
    inv_scores = {k: 1.0 / v for k, v in lb_scores.items()}
    
    # Normalize to sum to 1
    total = sum(inv_scores.values())
    weights = {k: v / total for k, v in inv_scores.items()}
    
    return weights

# Calculate weights for our ensemble
lb_scores = {k: v['public_lb'] for k, v in ENSEMBLE_CONFIG.items()}
WEIGHTS = calculate_weights(lb_scores)

print('Ensemble Weights:')
for model, weight in WEIGHTS.items():
    print(f"  {model}: {weight:.4f}")
print(f"\nTotal: {sum(WEIGHTS.values()):.4f}")

Ensemble Weights:
  st_transformer: 0.2519
  multiscale_cnn: 0.2515
  gru_seed27: 0.2474
  position_st: 0.2492

Total: 1.0000


## 3. Ensemble Predictor Class

In [4]:
class EnsemblePredictor:
    """
    Ensemble multiple trajectory prediction models.
    
    Supports:
    - Multiple model types with different architectures
    - Weighted averaging based on LB scores
    - Test-Time Augmentation (TTA)
    """
    
    def __init__(self, model_configs: Dict, weights: Dict, device='cuda'):
        """
        Args:
            model_configs: Dict with model configurations
            weights: Dict with model weights
            device: 'cuda' or 'cpu'
        """
        self.configs = model_configs
        self.weights = weights
        self.device = torch.device(device)
        
        # Model storage
        self.models = {}  # model_name -> List[fold_models]
        self.scalers = {}  # model_name -> List[fold_scalers]
        
        self.loaded = False
    
    def load_models(self, base_dir: Path):
        """
        Load all ensemble models from directory.
        
        Expected structure:
        base_dir/
            st_transformer/
                model_fold1.pt, scaler_fold1.pkl, ...
            gru_seed27/
                model_fold1.pt, scaler_fold1.pkl, ...
            ...
        """
        base_dir = Path(base_dir)
        
        for model_name, cfg in self.configs.items():
            model_dir = base_dir / model_name
            if not model_dir.exists():
                print(f"Warning: {model_name} directory not found")
                continue
            
            models = []
            scalers = []
            
            for fold in range(1, cfg['n_folds'] + 1):
                model_path = model_dir / f'model_fold{fold}.pt'
                scaler_path = model_dir / f'scaler_fold{fold}.pkl'
                
                if model_path.exists():
                    # Load based on model type
                    model = self._create_model(model_name)
                    state = torch.load(model_path, map_location='cpu')
                    model.load_state_dict(state)
                    model.to(self.device)
                    model.eval()
                    models.append(model)
                    
                    if scaler_path.exists():
                        with open(scaler_path, 'rb') as f:
                            scalers.append(pickle.load(f))
            
            self.models[model_name] = models
            self.scalers[model_name] = scalers
            print(f"Loaded {model_name}: {len(models)} folds")
        
        self.loaded = True
    
    def _create_model(self, model_name: str):
        """Create model instance based on type."""
        # Import appropriate model class based on name
        # This is a placeholder - implement based on your model classes
        raise NotImplementedError("Implement model creation for each type")
    
    def predict(self, sequences: List[np.ndarray], use_tta: bool = True) -> Tuple[np.ndarray, np.ndarray]:
        """
        Run ensemble prediction.
        
        Args:
            sequences: List of (window_size, n_features) arrays
            use_tta: Whether to use test-time augmentation
        
        Returns:
            (dx, dy) ensemble predictions
        """
        all_dx = []
        all_dy = []
        all_weights = []
        
        for model_name, models in self.models.items():
            if not models:
                continue
            
            scalers = self.scalers.get(model_name, [None] * len(models))
            weight = self.weights.get(model_name, 1.0 / len(self.models))
            
            # Average predictions across folds
            fold_preds = []
            for model, scaler in zip(models, scalers):
                # Scale inputs
                if scaler:
                    X = [scaler.transform(s) for s in sequences]
                else:
                    X = sequences
                
                X_tensor = torch.tensor(np.stack(X).astype(np.float32)).to(self.device)
                
                with torch.no_grad():
                    preds = model(X_tensor).cpu().numpy()
                
                if use_tta:
                    # Add TTA prediction
                    X_flip = [self._horizontal_flip(s) for s in X]
                    X_flip_tensor = torch.tensor(np.stack(X_flip).astype(np.float32)).to(self.device)
                    
                    with torch.no_grad():
                        preds_flip = model(X_flip_tensor).cpu().numpy()
                    
                    # Average (flip dy back)
                    preds[:, :, 0] = (preds[:, :, 0] + preds_flip[:, :, 0]) / 2
                    preds[:, :, 1] = (preds[:, :, 1] - preds_flip[:, :, 1]) / 2
                
                fold_preds.append(preds)
            
            # Average across folds
            model_preds = np.mean(fold_preds, axis=0)
            all_dx.append(model_preds[:, :, 0] * weight)
            all_dy.append(model_preds[:, :, 1] * weight)
            all_weights.append(weight)
        
        # Weighted sum
        total_weight = sum(all_weights)
        ens_dx = sum(all_dx) / total_weight
        ens_dy = sum(all_dy) / total_weight
        
        return ens_dx, ens_dy
    
    def _horizontal_flip(self, seq: np.ndarray, y_idx: int = 1) -> np.ndarray:
        """Horizontal flip for TTA."""
        flipped = seq.copy()
        flipped[:, y_idx] = 53.3 - flipped[:, y_idx]
        return flipped

print('EnsemblePredictor class defined')

EnsemblePredictor class defined


## 4. Simple Ensemble Function

For quick use without the full class.

In [5]:
def simple_ensemble(predictions: List[np.ndarray], weights: List[float] = None) -> np.ndarray:
    """
    Simple weighted average of predictions.
    
    Args:
        predictions: List of (N, horizon, 2) prediction arrays
        weights: Optional weights (defaults to equal)
    
    Returns:
        (N, horizon, 2) ensemble predictions
    """
    if weights is None:
        weights = [1.0 / len(predictions)] * len(predictions)
    
    # Normalize weights
    weights = np.array(weights)
    weights = weights / weights.sum()
    
    # Weighted average
    ensemble = np.zeros_like(predictions[0])
    for pred, w in zip(predictions, weights):
        ensemble += pred * w
    
    return ensemble

# Example usage
print('Example ensemble:')
pred1 = np.random.randn(100, 94, 2)  # Model 1
pred2 = np.random.randn(100, 94, 2)  # Model 2
pred3 = np.random.randn(100, 94, 2)  # Model 3

weights = [0.4, 0.35, 0.25]  # Based on LB scores
ensemble_pred = simple_ensemble([pred1, pred2, pred3], weights)

print(f'  Input shapes: {pred1.shape}, {pred2.shape}, {pred3.shape}')
print(f'  Output shape: {ensemble_pred.shape}')

Example ensemble:
  Input shapes: (100, 94, 2), (100, 94, 2), (100, 94, 2)
  Output shape: (100, 94, 2)


## 5. Ensemble Results Summary

In [6]:
# Results from actual submissions
results = pd.DataFrame([
    {'Model': 'ST Transformer (6L)', 'Public LB': 0.547, 'CV Score': 0.0750, 'In Ensemble': 'Yes'},
    {'Model': 'Multiscale CNN', 'Public LB': 0.548, 'CV Score': 0.0751, 'In Ensemble': 'Yes'},
    {'Model': 'Position-Specific ST', 'Public LB': 0.553, 'CV Score': 0.0750, 'In Ensemble': 'Yes'},
    {'Model': 'GRU (Seed 27)', 'Public LB': 0.557, 'CV Score': 0.0798, 'In Ensemble': 'Yes'},
    {'Model': 'Geometric Network', 'Public LB': 0.559, 'CV Score': 0.0828, 'In Ensemble': 'Top5'},
    {'Model': '4-Model Ensemble', 'Public LB': 0.541, 'CV Score': 'N/A', 'In Ensemble': 'BEST'},
])

print('Model Performance Summary:')
print(results.to_string(index=False))

print('\n Key Insights:')
print('  - Ensemble improves ~0.006 over best single model')
print('  - Diverse architectures contribute most')
print('  - TTA adds ~0.005-0.010 improvement')
print('  - Weight by inverse LB score works well')

Model Performance Summary:
               Model  Public LB CV Score In Ensemble
 ST Transformer (6L)      0.547    0.075         Yes
      Multiscale CNN      0.548   0.0751         Yes
Position-Specific ST      0.553    0.075         Yes
       GRU (Seed 27)      0.557   0.0798         Yes
   Geometric Network      0.559   0.0828        Top5
    4-Model Ensemble      0.541      N/A        BEST

 Key Insights:
  - Ensemble improves ~0.006 over best single model
  - Diverse architectures contribute most
  - TTA adds ~0.005-0.010 improvement
  - Weight by inverse LB score works well


## Summary

**Ensemble Best Practices**:

1. **Diversity** - Use different architectures (Transformer, GRU, CNN)
2. **Seeds** - Train same architecture with multiple seeds
3. **Weighting** - Inverse LB score weighting
4. **TTA** - Horizontal flip averaging
5. **Folds** - Average across all CV folds

**Final Ensemble (0.541 LB)**:
- ST Transformer (25.17%)
- Multiscale CNN (25.17%)
- GRU Seed 27 (24.76%)
- Position-Specific ST (24.90%)

**Code Notebooks**:
- `02_st_transformer_training.ipynb` - ST Transformer
- `05_gnn_geometric_training.ipynb` - GNN/Geometric
- `06_gru_training.ipynb` - GRU
- `07_kaggle_submission.ipynb` - Submission format