##  1. What Is Ensembling?

In **deep learning**, *ensembling* means **combining predictions from multiple models** to produce a final result that’s usually **more accurate and robust** than any single model.

Think of it like:

> Asking 5 expert photographers to pick the best photo. Even if each one makes small mistakes, combining their opinions usually gets you closer to the best choice.

---

**Why Ensemble?**

You do it to **reduce error** caused by:

* **Variance** → Model overfits differently on different data splits.
* **Bias** → Combining models with different architectures can capture more complexity.
* **Noise** → Random quirks in training that hurt one model are averaged out.

**Benefits:**

* Higher accuracy
* More robustness to outliers
* Better generalization

**Trade-offs:**

* More computing at inference
* More storage for multiple models
* Harder to deploy in production

---

## **2. Sampling With Replacement**

**Sampling with replacement** means:

* You **pick an item from the dataset**.
* **You put it back** before picking the next item.
* This means the **same item can be chosen more than once** in the sample.

---

####  Small Example

Let’s say our dataset is:

$$
\{A, B, C\}
$$

We want to pick **3 samples** *with replacement*.

### Step-by-step:

1. First pick → **B** (now we put B back into the pool).
2. Second pick → **A** (A goes back into the pool).
3. Third pick → **B** again (possible because we replaced it).

**Final sample**: $\{B, A, B\}$
Note:

* **B** appears twice.
* **C** was never picked.

---

####  Why “with replacement” Matters in Bagging

In **Bagging (Bootstrap Aggregating)**:

* The bootstrap sample is created by drawing **n** items **with replacement** from the original **n**-sized dataset.
* Because we replace each time, the **same data point might be in the sample multiple times** and some points might be **missing entirely**.
* This randomness means each model gets a **different view** of the data, adding diversity.

---

####  Probability Insight

If we have $n$ data points and we sample $n$ times with replacement:

* The probability that a given data point **is NOT selected** in a single draw = $\frac{n-1}{n}$
* Probability it is **never selected** in $n$ draws = $\left(\frac{n-1}{n}\right)^n \approx e^{-1} \approx 0.368$

➡ On average, **about 36.8% of the original dataset** will be missing from any given bootstrap sample, and about 63.2% will be present (with duplicates).

---


##  **3. Common Ensembling Strategies**

### **3.1. Bagging (Bootstrap Aggregating)**

* Train multiple models **independently** on different **random subsets** of the data (with replacement).
* Average predictions (for regression) or use majority voting (for classification).
* Example: **Random Forest** is bagging with decision trees.
* In deep learning, bagging often means training the same architecture with different seeds + shuffled datasets.

---


### **3.1.1. Numerical Example for Bagging Using Voting**

We’ll pretend we have:

* **Dataset**: 6 data points
* **Goal**: Predict whether a fruit is **Sweet (1)** or **Not Sweet (0)**
* **Base models**: 3 weak classifiers (decision stumps in this example)
* **Final decision rule**: Majority vote

---

#### **Step 1 — Original Dataset**

| ID | Feature (Sugar %) | Label (Sweet=1, Not=0) |
| -- | ----------------- | ---------------------- |
| 1  | 5%                | 0                      |
| 2  | 8%                | 0                      |
| 3  | 12%               | 1                      |
| 4  | 15%               | 1                      |
| 5  | 18%               | 1                      |
| 6  | 20%               | 1                      |

---

#### **Step 2 — Create Bootstrap Samples**

We randomly sample **with replacement** for each model.

**Model 1’s training set** (random pick with replacement):

* {ID 2, ID 3, ID 3, ID 4, ID 5, ID 6}

**Model 2’s training set**:

* {ID 1, ID 2, ID 4, ID 4, ID 5, ID 6}

**Model 3’s training set**:

* {ID 1, ID 3, ID 3, ID 4, ID 5, ID 5}

⚠ Notice:

* Some points appear multiple times.
* Some points are missing in each sample.

---

#### **Step 3 — Train Models**

We train a **simple classifier** on each bootstrap set.

For example:

* Model 1’s rule → “Sweet if sugar > 10%”
* Model 2’s rule → “Sweet if sugar > 15%”
* Model 3’s rule → “Sweet if sugar > 12%”

---

#### **Step 4 — Make Predictions on a New Fruit**

Let’s say our new fruit has **14% sugar**.

* **Model 1** (Sweet if > 10%) → **Predict: 1**
* **Model 2** (Sweet if > 15%) → **Predict: 0**
* **Model 3** (Sweet if > 12%) → **Predict: 1**

---

#### **Step 5 — Aggregate Predictions**

We take a **majority vote**:

| Model | Prediction |
| ----- | ---------- |
| M1    | 1          |
| M2    | 0          |
| M3    | 1          |

Sum = 1 + 0 + 1 = **2 votes for Sweet** out of 3 → **Final Prediction = Sweet (1)** ✅

---

#### **Key Points to Notice**

* Bagging improves stability by **reducing variance** — each model sees a different “view” of the data.
* If a single model makes an error on a point, other models can overrule it.
* Works best when models are **unstable** (like decision trees).

---

### **3.1.2. Numerical Example for Regression Bagging**


We have one feature $x$ and a target $y$:

| ID |  x |  y |
| -: | -: | -: |
|  1 |  1 |  2 |
|  2 |  2 |  4 |
|  3 |  3 |  7 |
|  4 |  4 |  8 |
|  5 |  5 | 10 |
|  6 |  6 | 12 |

**Base learner** (kept intentionally simple): a **regression stump** with a **fixed threshold** at $x=4.5$.

* If $x \le 4.5$: predict the **mean $y$** of points on the **left** (as seen in that model’s training sample).
* If $x > 4.5$: predict the **mean $y$** of points on the **right** (as seen in that model’s training sample).

We’ll predict at a new point $x^\*=5.5$ (so we’ll use the right-branch mean).

---

#### Bagging: 3 bootstrap models

Each model is trained on a **bootstrap sample (size 6, with replacement)**.

#### Model A — sample: {2, 3, 3, 4, 5, 6}

* Left (≤4.5): IDs 2,3,3,4 → $y=\{4,7,7,8\}$ → mean $= (4+7+7+8)/4 = 6.5$
* Right (>4.5): IDs 5,6 → $y=\{10,12\}$ → mean $= 11$
* **Prediction at $x^*=5.5$**: **11**

#### Model B — sample: {1, 2, 4, 4, 5, 6}

* Left: IDs 1,2,4,4 → $y=\{2,4,8,8\}$ → mean $= 22/4 = 5.5$
* Right: IDs 5,6 → $y=\{10,12\}$ → mean $= 11$
* **Prediction at $x^*=5.5$**: **11**

#### Model C — sample: {1, 3, 3, 4, 5, 5}

* Left: IDs 1,3,3,4 → $y=\{2,7,7,8\}$ → mean $= 24/4 = 6$
* Right: IDs 5,5 → $y=\{10,10\}$ → mean $= 10$
* **Prediction at $x^*=5.5$**: **10**

---

#### Aggregate (bagged) prediction

Average the three model predictions:

$$
\hat{y}_{\text{bag}}(5.5) = \frac{11 + 11 + 10}{3} = \frac{32}{3} \approx \mathbf{10.67}
$$

---

#### Single (non-bagged) model baseline

Train the **same regression stump** on the **full dataset**:

* Left (IDs 1–4): $y=\{2,4,7,8\}$ → mean $= 21/4 = 5.25$
* Right (IDs 5–6): $y=\{10,12\}$ → mean $= 11$
* **Prediction at $x^*=5.5$**: **11**

---

#### What this shows

* Each bootstrap sample yields **slightly different right-branch means** (11, 11, 10).
* **Bagging averages** these to **10.67**, reducing variance relative to any single weak learner.
* With more diverse base learners (different seeds/architectures/hyperparams), this effect is stronger.


### **3.2. Boosting**

* Train models **sequentially**.
* Each new model focuses on **mistakes** made by previous ones.
* Usually used in tree-based models (e.g., XGBoost), but also applicable in deep learning.

---

### **3.3. Simple Averaging**

* Train different models (same or different architectures).
* Average their predicted probabilities:

  $$
  p_{\text{final}} = \frac{1}{N} \sum_{i=1}^N p_i
  $$
* Easiest to implement.

---

### **3.4. Weighted Averaging**

* Like above, but give **more weight** to better models:

  $$
  p_{\text{final}} = \sum_{i=1}^N w_i p_i
  $$

  where $\sum w_i = 1$.

---

### **3.5. Stacking (Stacked Generalization)**

* Train **level-1 models** (your base models).
* Collect their predictions as **features**.
* Train a **meta-model** (e.g., logistic regression, small NN) on these predictions to output the final result.

---

### **3.6. Snapshot Ensembling**

* Train **one model** with a cyclical learning rate.
* Save **snapshots** at different points → they behave like different models due to being at different local minima.
* Average their predictions at inference.

---

### **3.7. Test Time Augmentation (TTA)**

* Not exactly a multi-model ensemble.
* At inference, create multiple augmented versions of the **same input** (e.g., flips, crops), pass them through the **same model**, and average the results.

---


##  PyTorch Example — Simple Weighted Average

Let’s say you have trained:

* `modelA` = ResNet50
* `modelB` = EfficientNet

```python
import torch
import torch.nn.functional as F

def ensemble_predict(models, weights, x):
    """
    models: list of trained PyTorch models
    weights: list of weights for each model (sum to 1)
    x: input batch
    """
    preds = []
    for model, w in zip(models, weights):
        model.eval()
        with torch.no_grad():
            output = F.softmax(model(x), dim=1)  # probs
            preds.append(w * output)
    return torch.stack(preds).sum(dim=0)

# Example usage:
models = [modelA, modelB]
weights = [0.6, 0.4]
final_probs = ensemble_predict(models, weights, input_batch)
final_labels = final_probs.argmax(dim=1)
```

---

## **4. PyTorch Example — Stacking with Logistic Regression**

```python
from sklearn.linear_model import LogisticRegression
import numpy as np

# Assume we have validation predictions
predsA = modelA(val_x).detach().numpy()
predsB = modelB(val_x).detach().numpy()

# Stack as features
stacked_preds = np.hstack([predsA, predsB])

# Train meta-model
meta_model = LogisticRegression()
meta_model.fit(stacked_preds, val_labels)

# At test time:
test_predsA = modelA(test_x).detach().numpy()
test_predsB = modelB(test_x).detach().numpy()
stacked_test = np.hstack([test_predsA, test_predsB])
final_preds = meta_model.predict(stacked_test)
```

---

##  Practical Tips

* **Diversity matters** → Models should differ (architecture, initialization, training data order).
* **Calibrate probabilities** → Ensures fair combination.
* **Avoid overfitting in stacking** → Use out-of-fold predictions for meta-model training.
* **Balance cost vs. accuracy** → Sometimes a single large model is better than a complex ensemble for deployment.

---



## **5. Practical workflow for Ensembling Multiple Pretrained Models**

Here’s a **practical, end-to-end workflow** to ensemble multiple **ImageNet-pretrained** models in **PyTorch** (e.g., ResNet50, DenseNet121, EfficientNet-B0). It’s the pattern I recommend for robust results without getting lost in glue code.

---

# 0) Goals & High-level plan

* **Train** 3–5 diverse backbones with the **same dataset & splits**.
* **Track** out-of-fold (OOF) predictions for proper validation.
* **Choose** an ensemble method: simple average → weighted average → stacking.
* **Polish**: TTA + probability calibration.
* **Ship**: optionally **distill** the ensemble to one student for cheap inference.

---

# 1) Project skeleton

```
project/
  data/
  src/
    dataset.py
    models.py
    train.py
    infer.py
    utils.py
  oof/            # OOF predictions per fold & model
  ckpt/           # checkpoints
  cfg.yaml
```

### cfg.yaml (example)

```yaml
seed: 2025
img_size: 224
n_classes: 4
folds: 5
batch_size: 32
epochs: 20
models:
  - name: resnet50
    lr: 3e-4
  - name: densenet121
    lr: 3e-4
  - name: efficientnet_b0
    lr: 2e-4
```

---

# 2) Data & transforms (shared across models)

```python
# src/dataset.py
import torch, torchvision as tv
from torch.utils.data import Dataset
from PIL import Image

def get_transforms(img_size):
    train_tf = tv.transforms.Compose([
        tv.transforms.Resize(int(img_size*1.15)),
        tv.transforms.CenterCrop(img_size),
        tv.transforms.RandomHorizontalFlip(),
        tv.transforms.ColorJitter(0.1,0.1,0.1,0.05),
        tv.transforms.ToTensor(),
        tv.transforms.Normalize(mean=[0.485,0.456,0.406],
                                std=[0.229,0.224,0.225]),
    ])
    val_tf = tv.transforms.Compose([
        tv.transforms.Resize(int(img_size*1.15)),
        tv.transforms.CenterCrop(img_size),
        tv.transforms.ToTensor(),
        tv.transforms.Normalize(mean=[0.485,0.456,0.406],
                                std=[0.229,0.224,0.225]),
    ])
    return train_tf, val_tf

class ImgDataset(Dataset):
    def __init__(self, df, tfm):
        self.df, self.tfm = df.reset_index(drop=True), tfm
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        row = self.df.iloc[i]
        img = Image.open(row.path).convert("RGB")
        x = self.tfm(img)
        y = torch.tensor(row.label, dtype=torch.long)
        return x, y, row.index  # return index for OOF mapping
```

Use **StratifiedKFold** to create folds once and reuse for every backbone.

---

# 3) Load & adapt ImageNet backbones

```python
# src/models.py
import torch.nn as nn
import torchvision.models as M

def make_model(name: str, n_classes: int, pretrained=True):
    if name == "resnet50":
        m = M.resnet50(weights=M.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None)
        m.fc = nn.Linear(m.fc.in_features, n_classes)
    elif name == "densenet121":
        m = M.densenet121(weights=M.DenseNet121_Weights.IMAGENET1K_V1 if pretrained else None)
        m.classifier = nn.Linear(m.classifier.in_features, n_classes)
    elif name == "efficientnet_b0":
        m = M.efficientnet_b0(weights=M.EfficientNet_B0_Weights.IMAGENET1K_V1 if pretrained else None)
        m.classifier[1] = nn.Linear(m.classifier[1].in_features, n_classes)
    else:
        raise ValueError(name)
    return m
```

---

# 4) Train each backbone with K-fold and save OOF predictions

Key ideas:

* **OOF** = predictions for validation fold **not seen** during training → honest validation for ensembling/stacking.
* Save both **checkpoint** and **OOF logits/probs** to disk.

```python
# src/train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from dataset import ImgDataset, get_transforms
from models import make_model

def train_one_fold(model_name, df, fold_idx, folds=5, n_classes=4, epochs=20, batch_size=32, lr=3e-4, img_size=224, device="cuda"):
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=2025)
    train_idx, val_idx = list(kf.split(df.path, df.label))[fold_idx]
    tr_df, va_df = df.iloc[train_idx], df.iloc[val_idx]

    tf_train, tf_val = get_transforms(img_size)
    dl_tr = DataLoader(ImgDataset(tr_df, tf_train), batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
    dl_va = DataLoader(ImgDataset(va_df, tf_val), batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)

    model = make_model(model_name, n_classes).to(device)
    optimizer = optim.AdamW(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    scaler = torch.cuda.amp.GradScaler(enabled=(device=="cuda"))

    best_acc, best_path = 0.0, f"ckpt/{model_name}_fold{fold_idx}.pt"
    for epoch in range(epochs):
        model.train()
        for x,y,_ in dl_tr:
            x,y = x.to(device), y.to(device)
            optimizer.zero_grad()
            with torch.cuda.amp.autocast(enabled=(device=="cuda")):
                logits = model(x)
                loss = criterion(logits, y)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        # validate
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for x,y,_ in dl_va:
                x,y = x.to(device), y.to(device)
                logits = model(x)
                pred = logits.argmax(1)
                correct += (pred==y).sum().item()
                total += y.size(0)
        acc = correct/total
        if acc > best_acc:
            best_acc = acc
            torch.save(model.state_dict(), best_path)

    # Save OOF probabilities for stacking/weighting
    model.load_state_dict(torch.load(best_path, map_location=device))
    model.eval()
    oof_probs = np.zeros((len(va_df), n_classes), dtype=np.float32)
    idxs = []
    with torch.no_grad():
        for x,y,ii in dl_va:
            x = x.to(device)
            logits = model(x)
            probs = torch.softmax(logits, dim=1).cpu().numpy()
            oof_probs[len(idxs):len(idxs)+len(ii)] = probs
            idxs.extend(ii.numpy().tolist())

    oof_df = pd.DataFrame(oof_probs, index=va_df.index, columns=[f"p{c}" for c in range(n_classes)])
    oof_df["label"] = va_df.label.values
    oof_df.to_csv(f"oof/{model_name}_fold{fold_idx}.csv", index=True)
    print(model_name, fold_idx, "best_acc", best_acc)
```

Train **each model across all folds** (loop `model_name` in cfg and folds 0..K-1).

---

# 5) Build the ensemble on OOF predictions

## 5.1 Simple average (strong baseline)

```python
import pandas as pd, numpy as np, glob
# load all OOF CSVs and align on index
paths = sorted(glob.glob("oof/*.csv"))  # e.g., resnet50_fold0.csv, ...
dfs = [pd.read_csv(p, index_col=0) for p in paths]

# group by fold/model as needed — here we just concatenate same rows (same indices)
# compute per-row average of probabilities across models (not across folds)
# (In practice: first merge OOF per model across folds -> one OOF per model, then average across models)
def merge_folds(model_prefix):
    files = [p for p in paths if model_prefix in p]
    dd = [pd.read_csv(p, index_col=0) for p in files]
    dfm = pd.concat(dd).sort_index()  # back to dataset index order
    return dfm

resnet_oof = merge_folds("resnet50")
dense_oof  = merge_folds("densenet121")
eff_oof    = merge_folds("efficientnet_b0")

all_probs = np.stack([resnet_oof.filter(like="p").values,
                      dense_oof.filter(like="p").values,
                      eff_oof.filter(like="p").values], axis=0)
avg_probs = all_probs.mean(axis=0)
labels = resnet_oof["label"].values  # same for all

oof_acc = (avg_probs.argmax(1) == labels).mean()
print("OOF accuracy (simple avg):", oof_acc)
```

## 5.2 Learn **non-negative weights** for a weighted average

We’ll parameterize weights via softmax so they’re ≥0 and sum to 1, and **optimize them on OOF**.

```python
import torch
probs_t = torch.tensor(all_probs)       # [M, N, C]
labels_t = torch.tensor(labels)         # [N]
w_logits = torch.zeros(len(probs_t), requires_grad=True)  # init equal weights
opt = torch.optim.LBFGS([w_logits], max_iter=100)

def closure():
    opt.zero_grad()
    w = torch.softmax(w_logits, dim=0)               # [M]
    ens = (probs_t * w[:,None,None]).sum(dim=0)      # [N, C]
    loss = torch.nn.functional.nll_loss(torch.log(ens+1e-8), labels_t)
    loss.backward()
    return loss

opt.step(closure)
w = torch.softmax(w_logits.detach(), dim=0)
print("Learned weights:", w.tolist())
```

Use these weights at test time.

## 5.3 Stacking (meta-learner)

* Build a feature vector = **concatenate** model probabilities: `[p_resnet | p_densenet | p_efficientnet]`.
* Train a **meta model** (e.g., LogisticRegression, small MLP) **on OOF**.
* Predict with the meta model on test-time base probabilities.

*(Skeleton, using scikit):*

```python
from sklearn.linear_model import LogisticRegression
X = np.concatenate([resnet_oof.filter(like="p").values,
                    dense_oof.filter(like="p").values,
                    eff_oof.filter(like="p").values], axis=1)
y = labels
meta = LogisticRegression(max_iter=1000, multi_class="multinomial")
meta.fit(X, y)
print("OOF meta-acc:", meta.score(X, y))
```

---

# 6) Inference with TTA + ensemble

```python
# src/infer.py
import torch, torchvision as tv
from PIL import Image
import numpy as np

def tta_transforms(img_size):
    base = tv.transforms.Compose([
        tv.transforms.Resize(int(img_size*1.15)),
        tv.transforms.CenterCrop(img_size),
        tv.transforms.ToTensor(),
        tv.transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]),
    ])
    hflip = tv.transforms.Compose([tv.transforms.RandomHorizontalFlip(p=1.0), *base.transforms])
    return [base, hflip]

@torch.no_grad()
def predict_single(models, img_path, weights=None, img_size=224, device="cuda"):
    tfs = tta_transforms(img_size)
    img = Image.open(img_path).convert("RGB")
    M = len(models)
    if weights is None:
        weights = [1.0/M]*M

    # TTA per model
    total = None
    for m, w in zip(models, weights):
        m.eval().to(device)
        probs_sum = 0
        for tf in tfs:
            x = tf(img).unsqueeze(0).to(device)
            logits = m(x)
            probs_sum += torch.softmax(logits, dim=1)
        probs = probs_sum / len(tfs)
        total = probs * w if total is None else total + probs * w

    return total.squeeze(0).cpu().numpy()  # [C]
```

---

# 7) Probability calibration (nice polish)

Calibrate the **final ensemble** with **temperature scaling** on a held-out set (or OOF).

```python
class TemperatureScaler(torch.nn.Module):
    def __init__(self): 
        super().__init__(); self.t = torch.nn.Parameter(torch.ones(1))
    def forward(self, logits): 
        return logits / self.t.clamp_min(1e-6)

# Fit on logits+labels (use OOF logits from the ensemble, not probs)
# minimize NLL by optimizing t
```

*(You can store per-model logits during OOF generation, combine to ensemble logits via log-weights, and fit one temperature.)*

---

# 8) Speed/Memory options

* **SWA/Polyak averaging** per model → better single checkpoints.
* **Half-precision** (AMP) + **channels-last**.
* **Prune** or **quantize** the **student** if you distill.

---

# 9) Distill the ensemble (optional, highly recommended for prod)

Train a **student** to match the **ensemble’s soft targets**:

```python
# KL(student_logits / T, teacher_probs) + α * CE(student, hard_labels)
```

You keep ensemble quality but ship **one small model**.

---

# 10) Checklist / gotchas

* Keep **identical folds** across backbones.
* Always compute metrics on **OOF** (not per-fold val reported separately).
* Prefer **probability-level** blending (not argmax of each).
* Start with **simple average**, then try **learned weights**, then **stacking**.
* Save seeds & versions; log runs (MLflow/W\&B) for reproducibility.

---

## TL;DR recipe

1. Make 5 folds; train **ResNet50, DenseNet121, EfficientNet-B0** per fold, save **best ckpt** and **OOF probs**.
2. **Simple average** OOF → baseline ensemble score.
3. Fit **weights** on OOF via softmax-constrained optimization.
4. (Optional) Fit a **stacking** meta-model on concatenated OOF probs.
5. At inference: **TTA + weighted ensemble**.
6. (Optional) **Calibrate** + **distill** to a single student.

If you want, I can drop in a **minimal runnable script** that trains one fold and demonstrates OOF saving + weighted ensembling, ready to paste into your repo.
