# How the order of BatchNorm and Activation affects performance.

**Objective:**  
This experiment investigates how the order of BatchNorm and Activation affects performance.

**Conditions:**  
- Implement three networks with the following structures:
  1) Conv (or FC) → BatchNorm → ReLU  
  2) Conv (or FC) → ReLU → BatchNorm  
  3) Conv (or FC) → LeakyReLU → BatchNorm  
- Each network should have 8 blocks

**Expected Outcome:**  
- No predefined “correct” result; analyze and explain observed performance differences  
- A convincing analysis is required regardless of whether the result matches expectations

## Theoretical Background

In deep learning, the order of Batch Normalization (BN) and activation functions (e.g., ReLU) within a neural network layer can significantly influence the training dynamics and final performance. Conventionally, BN is applied before the activation (i.e., `Conv → BN → ReLU`). This ordering stabilizes the input distribution to the nonlinearity, potentially enhancing gradient flow.

However, the impact of altering this order—such as placing BN after the activation (`Conv → ReLU → BN`) or combining BN with alternative activations like LeakyReLU—remains a topic of empirical interest. Differences in gradient propagation, representational capacity, and learning stability can emerge depending on this sequence.

This experiment aims to investigate how such ordering affects not only performance (in terms of accuracy) but also stability (variance across runs) and generalization (consistency across architectures).

## Experimental Setup

- **Architectures Tested**:
  - Multi-Layer Perceptron (MLP)
  - Convolutional Neural Network (CNN)

- **Block Definition** (repeated 5 times per model):
  - Structure 1: `Layer → BatchNorm → ReLU`
  - Structure 2: `Layer → ReLU → BatchNorm`
  - Structure 3: `Layer → LeakyReLU → BatchNorm`

- **Evaluation Metrics**:
  - **Performance**: Top-1 accuracy on test set
  - **Stability**: Mean and standard deviation over 5 independent runs
  - **Generalization**: Comparison of patterns across MLP and CNN

- **Dataset**: CIFAR-10
- **Optimizer**: Adam
- **Epochs**: 10
- **Batch Size**: 64

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np

## Define Three Cases for the Basic Block

In [2]:
class Case1Block(nn.Module):
    """Conv -> BatchNorm -> ReLU"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))

class Case2Block(nn.Module):
    """Conv -> ReLU -> BatchNorm"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.bn = nn.BatchNorm2d(out_channels)

    def forward(self, x):
        return self.bn(self.relu(self.conv(x)))

class Case3Block(nn.Module):
    """Conv -> LeakyReLU -> BatchNorm"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.act = nn.LeakyReLU()
        self.bn = nn.BatchNorm2d(out_channels)

    def forward(self, x):
        return self.bn(self.act(self.conv(x)))

## Network

In [3]:
class CNNNetwork(nn.Module):
    def __init__(self, block_type='case1', num_blocks=8):
        super().__init__()
        if block_type == 'case1':
            block = Case1Block
        elif block_type == 'case2':
            block = Case2Block
        else :
            block = Case3Block

        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.blocks = nn.Sequential(*[block(64, 64) for _ in range(num_blocks)])
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(64, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.blocks(x)
        x = self.pool(x)
        x = torch.flatten(x, 1)
        return self.fc(x)

## Train, Evaluate

In [4]:
# train
def train(model, loader, criterion, optimizer, device):
    model.train()
    correct, total, loss_sum = 0, 0, 0.0
    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        loss_sum += loss.item()
        _, preds = outputs.max(1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    return loss_sum / len(loader), 100. * correct / total

# eval
def test(model, loader, criterion, device):
    model.eval()
    correct, total, loss_sum = 0, 0, 0.0
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss_sum += loss.item()
            _, preds = outputs.max(1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
    return loss_sum / len(loader), 100. * correct / total

In [5]:
def run_experiment(block_type, num_blocks, device, train_loader, test_loader, num_epochs, lr):
    model = CNNNetwork(block_type=block_type, num_blocks=num_blocks).to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    best_test_acc = 0
    for epoch in range(num_epochs):
        train_loss, train_acc = train(model, train_loader, criterion, optimizer, device)
        test_loss, test_acc = test(model, test_loader, criterion, device)
        best_test_acc = max(best_test_acc, test_acc)
        print(f"Epoch {epoch+1}/{num_epochs} | Test Acc: {test_acc:.2f}%")
    return best_test_acc

def repeat_experiment(block_type, num_blocks, repeat, device, train_loader, test_loader, num_epochs, lr):
    accs = []
    print(f"\n[Experiment] {block_type.upper()} - {repeat} times\n")

    for run in range(repeat):
        print(f"[{block_type}] Run {run+1}/{repeat}")
        model = CNNNetwork(block_type=block_type, num_blocks=num_blocks).to(device)
        optimizer = optim.Adam(model.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()

        for epoch in range(num_epochs):
            train_loss, train_acc = train(model, train_loader, criterion, optimizer, device)
            test_loss, test_acc = test(model, test_loader, criterion, device)
            print(f"\tEpoch {epoch+1}/{num_epochs} | Test Acc: {test_acc:.2f}%")

        accs.append(test_acc)

    mean_acc = np.mean(accs)
    std_acc = np.std(accs)
    print(f"\n {block_type.upper()} result: average Test Acc = {mean_acc:.2f}%, std = {std_acc:.2f}%\n")
    return mean_acc, std_acc

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device)

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

print("\n==== Experiment 5 times ====")
results = {}

for case in ['case1', 'case2', 'case3']:
    mean_acc, std_acc = repeat_experiment(case, num_blocks=8, repeat=5, device=device,
                                          train_loader=train_loader, test_loader=test_loader,
                                          num_epochs=10, lr=0.001)
    results[case] = (mean_acc, std_acc)

print("\n=== Summary of Results ===")
for case, (mean, std) in results.items():
    print(f"{case.upper():<6} : average Test Accuracy = {mean:.2f}% ± {std:.2f}%")

Using device: cuda


100%|██████████| 170M/170M [00:02<00:00, 76.6MB/s]



==== 실험 5회 반복 - 평균과 표준편차 ====

[실험] CASE1 - 5회 반복

[case1] Run 1/5
	Epoch 1/10 | Test Acc: 55.42%
	Epoch 2/10 | Test Acc: 58.55%
	Epoch 3/10 | Test Acc: 66.91%
	Epoch 4/10 | Test Acc: 66.51%
	Epoch 5/10 | Test Acc: 71.23%
	Epoch 6/10 | Test Acc: 68.95%
	Epoch 7/10 | Test Acc: 75.48%
	Epoch 8/10 | Test Acc: 74.47%
	Epoch 9/10 | Test Acc: 77.94%
	Epoch 10/10 | Test Acc: 71.78%
[case1] Run 2/5
	Epoch 1/10 | Test Acc: 53.30%
	Epoch 2/10 | Test Acc: 54.39%
	Epoch 3/10 | Test Acc: 59.88%
	Epoch 4/10 | Test Acc: 69.44%
	Epoch 5/10 | Test Acc: 69.20%
	Epoch 6/10 | Test Acc: 73.15%
	Epoch 7/10 | Test Acc: 73.24%
	Epoch 8/10 | Test Acc: 78.16%
	Epoch 9/10 | Test Acc: 75.87%
	Epoch 10/10 | Test Acc: 76.14%
[case1] Run 3/5
	Epoch 1/10 | Test Acc: 45.12%
	Epoch 2/10 | Test Acc: 63.17%
	Epoch 3/10 | Test Acc: 69.20%
	Epoch 4/10 | Test Acc: 63.68%
	Epoch 5/10 | Test Acc: 71.95%
	Epoch 6/10 | Test Acc: 75.72%
	Epoch 7/10 | Test Acc: 79.22%
	Epoch 8/10 | Test Acc: 73.29%
	Epoch 9/10 | Test Acc: 78.25%

## Result - Experimet1

|  | 평균 Test Accuracy (%) | 표준편차 (%) |
| --- | --- | --- |
| CASE1 | 75.5 | ± 2.28 |
| CASE2 | 80.91 | ± 0.77 |
| CASE3 | 81.73 | ± 0.92 |

- test acc : Case 1 < case 2 < case3
- std : case1 < case3 < case2

## Experiment 2 — Generalization Across Architectures

This experiment aims to assess the generality of the observed effects regarding the ordering of activation functions and batch normalization layers. Specifically, we investigate whether the trends observed in performance and stability are consistent across different network architectures.

To this end, we conduct parallel experiments on two distinct model types:
- A **Multi-Layer Perceptron (MLP)**, representing a fully connected feedforward network
- A **Convolutional Neural Network (CNN)**, representing spatial feature-based architectures

### Research Question
> Does altering the sequence of batch normalization and activation layers lead to similar outcomes in both MLP and CNN architectures, in terms of performance trends and training stability?

By comparing these two structurally different models under identical experimental conditions, we aim to determine the extent to which the effect of layer ordering generalizes across architectural paradigms.

### MLP model

In [7]:
class MLPNetwork(nn.Module):
    def __init__(self, block_type):
        super().__init__()
        self.flatten = nn.Flatten()
        if block_type == 'case1':
            act1 = nn.Sequential(nn.Linear(3*32*32, 512), nn.BatchNorm1d(512), nn.ReLU())
        elif block_type == 'case2':
            act1 = nn.Sequential(nn.Linear(3*32*32, 512), nn.ReLU(), nn.BatchNorm1d(512))
        else :
            act1 = nn.Sequential(nn.Linear(3*32*32, 512), nn.LeakyReLU(), nn.BatchNorm1d(512))

        # 여러 층 쌓기
        self.mlp = nn.Sequential(
            act1,
            nn.Linear(512, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.mlp(x)


In [8]:
def run_experiment(model_type='cnn', block_type='case1', num_blocks=8,
                   epochs=10, batch_size=128, lr=0.001, repeat=5, device='cpu'):
    criterion = nn.CrossEntropyLoss()
    acc_list = []
    for r in range(repeat):
        # select model
        if model_type == 'cnn':
            model = CNNNetwork(block_type, num_blocks).to(device)
        elif model_type == 'mlp':
            model = MLPNetwork(block_type).to(device)

        optimizer = optim.Adam(model.parameters(), lr=lr)

        for epoch in range(epochs):
            train_loss, train_acc = train(model, train_loader, criterion, optimizer, device)
            test_loss, test_acc = test(model, test_loader, criterion, device)

        acc_list.append(test_acc)
        print(f"{model_type.upper()} {block_type.upper()} Run {r+1}: {test_acc:.2f}%")

    acc_array = np.array(acc_list)
    return acc_array.mean(), acc_array.std()

In [9]:
for model_type in ['cnn', 'mlp']:
    print(f"\n {model_type.upper()} Results:")
    for block_type in ['case1', 'case2', 'case3']:
        mean_acc, std_acc = run_experiment(model_type=model_type, block_type=block_type, device=device)
        print(f"{block_type.upper()} : {mean_acc:.2f}% ± {std_acc:.2f}%")


 CNN Results:
CNN CASE1 Run 1: 80.59%
CNN CASE1 Run 2: 77.32%
CNN CASE1 Run 3: 76.81%
CNN CASE1 Run 4: 78.24%
CNN CASE1 Run 5: 80.74%
CASE1 : 78.74% ± 1.64%
CNN CASE2 Run 1: 80.21%
CNN CASE2 Run 2: 80.65%
CNN CASE2 Run 3: 81.83%
CNN CASE2 Run 4: 79.96%
CNN CASE2 Run 5: 80.90%
CASE2 : 80.71% ± 0.65%
CNN CASE3 Run 1: 81.42%
CNN CASE3 Run 2: 80.91%
CNN CASE3 Run 3: 81.55%
CNN CASE3 Run 4: 80.46%
CNN CASE3 Run 5: 79.82%
CASE3 : 80.83% ± 0.64%

 MLP Results:
MLP CASE1 Run 1: 56.60%
MLP CASE1 Run 2: 56.05%
MLP CASE1 Run 3: 55.43%
MLP CASE1 Run 4: 55.45%
MLP CASE1 Run 5: 55.36%
CASE1 : 55.78% ± 0.48%
MLP CASE2 Run 1: 55.50%
MLP CASE2 Run 2: 55.73%
MLP CASE2 Run 3: 55.27%
MLP CASE2 Run 4: 56.02%
MLP CASE2 Run 5: 55.82%
CASE2 : 55.67% ± 0.26%
MLP CASE3 Run 1: 55.07%
MLP CASE3 Run 2: 55.72%
MLP CASE3 Run 3: 56.24%
MLP CASE3 Run 4: 55.25%
MLP CASE3 Run 5: 56.25%
CASE3 : 55.71% ± 0.49%


## Result - Experiment 2

| **Model** | **Case** | **Run 1** | **Run 2** | **Run 3** | **Run 4** | **Run 5** | **Mean (%)** | **Std (%)** |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| CNN | CASE1 | 80.59% | 77.32% | 76.81% | 78.24% | 80.74% | **78.74** | **1.64** |
| CNN | CASE2 | 80.21% | 80.65% | 81.83% | 79.96% | 80.90% | **80.71** | **0.65** |
| CNN | CASE3 | 81.42% | 80.91% | 81.55% | 80.46% | 79.82% | **80.83** | **0.64** |
| MLP | CASE1 | 56.60% | 56.05% | 55.43% | 55.45% | 55.36% | **55.78** | **0.48** |
| MLP | CASE2 | 55.50% | 55.73% | 55.27% | 56.02% | 55.82% | **55.67** | **0.26** |
| MLP | CASE3 | 55.07% | 55.72% | 56.24% | 55.25% | 56.25% | **55.71** | **0.49** |

**Based on performance (Test Accuracy):**
- In the CNN architecture: CASE 3 > CASE 2 > CASE 1
- In the MLP architecture: CASE 1 ≈ CASE 3 ≈ CASE 2 (differences are negligible)

**Based on stability (Standard Deviation):**
- CNN: CASE 3 ≈ CASE 2 < CASE 1
- MLP: All three cases exhibit low variance, with no significant difference observed

## Analysis

### Interpretation Based on Performance Metrics
- **CASE 1** (`Conv → BN → ReLU`) consistently yielded the lowest accuracy among the three configurations. This may be attributed to the reduced nonlinearity introduced after batch normalization, potentially limiting the representational power of the network.
- Both **CASE 2** and **CASE 3** achieved higher test accuracy. This suggests that applying Batch Normalization after the activation function may enhance nonlinearity and improve learning dynamics.
- Notably, **CASE 3**, which employs LeakyReLU followed by BatchNorm, outperformed the others. The slight preservation of negative gradients by LeakyReLU likely contributed to a performance gain of approximately 0.82%.

### Interpretation Based on Training Stability
- In the CNN architecture, **CASE 2** and **CASE 3** exhibited lower standard deviation compared to **CASE 1**, indicating more consistent convergence across runs.
- This may be explained by the fact that when BatchNorm is applied after the activation, the activation outputs are normalized, which stabilizes gradient flow and leads to reduced variability between experiments.

### Architectural Differences
- In the MLP architecture, all three configurations showed negligible differences in performance. This may be due to the nature of MLPs, which process fixed-size vector inputs and are less sensitive to the activation-BN ordering.
- Although the MLP results exhibited low variance overall, their absolute accuracy was lower than CNNs, suggesting that the model may have been less responsive to the structural variations under the given experimental conditions.

## Conclusion

In the CNN architecture, **CASE 3** (`LeakyReLU → BatchNorm`) yielded the highest performance, with **CASE 2** (`ReLU → BatchNorm`) performing slightly lower but still superior to the conventional approach.

In contrast, the MLP architecture showed minimal sensitivity to the ordering of activation and normalization layers, with negligible differences observed across all configurations.

Notably, **CASE 2** (`ReLU → BatchNorm`) consistently outperformed the widely adopted **CASE 1** (`BatchNorm → ReLU`) in terms of both accuracy and stability (as measured by standard deviation). These findings suggest that the placement of normalization relative to the activation function should not be considered a fixed design convention. Instead, it should be treated as a tunable architectural choice that warrants empirical evaluation, particularly in convolutional models where its impact may be more pronounced.