# üìâ Pooling Layers - Smarter Downsampling

Welcome to pooling! The unsung hero of CNNs! üéâ

In the previous notebooks, we learned about convolution - the star of CNNs. Now let's meet its supporting actor: **pooling**! While not as flashy, pooling plays a crucial role in making CNNs work efficiently.

## üéØ What You'll Learn

By the end of this notebook, you'll understand:
- **What is pooling** and why we need it
- **Max pooling** - keeping the strongest signals
- **Average pooling** - smooth downsampling
- **When to use each** type of pooling
- **Translation invariance** - why pooling helps recognition
- **Implementing pooling** from scratch in NumPy
- **Global pooling** - a special case
- **Alternatives** to pooling in modern CNNs

**Prerequisites:** Notebooks 01-02 (What are CNNs, Convolution Operation)

---

## üñºÔ∏è The Photo Resizing Analogy

Think of pooling like **resizing a photo**:
- **Original**: 1000√ó1000 pixels (huge!)
- **Thumbnail**: 100√ó100 pixels (manageable!)

But how do we shrink it?
- **Max pooling**: Take the brightest pixel from each region
- **Average pooling**: Take the average color of each region

Both make the image smaller while preserving important information! üìê

Let's explore! üöÄ

In [None]:
# Import our tools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import matplotlib.patches as mpatches

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib for better plots
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully!")
print(f"üì¶ NumPy version: {np.__version__}")

---
## ü§î What is Pooling?

### üéØ The Core Idea

**Pooling** is a downsampling operation that:
- **Reduces spatial dimensions** (makes feature maps smaller)
- **Keeps important information** (doesn't throw away everything)
- **Has no learnable parameters** (just a fixed operation)

### üìê How It Works

1. **Divide** feature map into non-overlapping regions (e.g., 2√ó2 blocks)
2. **Apply** pooling operation to each region
3. **Output** one value per region

**Result**: Smaller feature map! üéØ

### ü§∑ Why Do We Need Pooling?

**Problem without pooling:**
```
Input: 224√ó224√ó3
After Conv1: 224√ó224√ó64
After Conv2: 224√ó224√ó128
After Conv3: 224√ó224√ó256
...
```

Feature maps stay HUGE! üò±
- Too much memory
- Too much computation
- Too many parameters in FC layers

**Solution: Pooling!**
```
Input: 224√ó224√ó3
Conv1: 224√ó224√ó64 ‚Üí Pool: 112√ó112√ó64  ‚úÖ Halved!
Conv2: 112√ó112√ó128 ‚Üí Pool: 56√ó56√ó128  ‚úÖ Halved!
Conv3: 56√ó56√ó256 ‚Üí Pool: 28√ó28√ó256    ‚úÖ Halved!
```

Much more manageable! üéâ

### üéÅ Benefits of Pooling

‚úÖ **Reduces computational cost** (smaller feature maps)
‚úÖ **Provides translation invariance** (small shifts don't matter)
‚úÖ **Increases receptive field** (each neuron sees more of the image)
‚úÖ **Helps prevent overfitting** (reduces parameters in FC layers)
‚úÖ **Makes network more robust** (less sensitive to exact positions)

Let's see pooling in action!

In [None]:
# Create a simple example to show pooling effect
simple_image = np.array([
    [1, 3, 2, 4],
    [5, 6, 7, 8],
    [9, 2, 1, 3],
    [4, 5, 8, 7]
])

print("üñºÔ∏è  Simple 4√ó4 Image:")
print(simple_image)
print(f"\nShape: {simple_image.shape}")
print(f"Total elements: {simple_image.size}")

# Manually compute 2x2 max pooling
pooled = np.array([
    [np.max(simple_image[0:2, 0:2]), np.max(simple_image[0:2, 2:4])],
    [np.max(simple_image[2:4, 0:2]), np.max(simple_image[2:4, 2:4])]
])

print("\nüìâ After 2√ó2 Max Pooling:")
print(pooled)
print(f"\nShape: {pooled.shape}")
print(f"Total elements: {pooled.size}")

print("\nüéØ What Happened:")
print(f"   ‚Ä¢ Original: 4√ó4 = 16 values")
print(f"   ‚Ä¢ After pooling: 2√ó2 = 4 values")
print(f"   ‚Ä¢ Reduction: {(1 - pooled.size/simple_image.size)*100:.0f}% fewer values!")
print("\nüí° Pooling kept the MAXIMUM value from each 2√ó2 region")

---
## üèÜ Max Pooling - Keep the Winner!

### üéØ The Idea

**Max pooling** takes the MAXIMUM value from each pooling window.

```
Input region:     Max pooling result:
[1  3]              6
[5  6]              ‚Üë
                (takes maximum)
```

### ü§î Why Maximum?

Think about what feature maps represent:
- **High values** = strong feature detected
- **Low values** = weak feature detected

**Max pooling says:** "I only care about the STRONGEST signal in this region!"

This makes sense because:
- If a cat ear is detected somewhere in a region, that's what matters
- Exact position within the region is less important
- We want to preserve strong activations

### üìê Max Pooling Algorithm

```python
for each non-overlapping window in the feature map:
    output[i, j] = max(window)
```

**Common configuration**: 2√ó2 windows, stride 2
- Result: Feature map size halved in each dimension
- Channels remain unchanged

Let's implement it!

In [None]:
def max_pool2d(input_data, pool_size=2, stride=2):
    """
    Perform 2D max pooling.
    
    Parameters:
    -----------
    input_data : np.ndarray, shape (H, W) or (H, W, C)
        Input feature map(s)
    pool_size : int
        Size of pooling window (pool_size √ó pool_size)
    stride : int
        Step size between pooling windows
    
    Returns:
    --------
    output : np.ndarray
        Pooled feature map(s)
    """
    # Handle both 2D and 3D inputs
    if input_data.ndim == 2:
        # Add channel dimension
        input_data = input_data[:, :, np.newaxis]
        squeeze_output = True
    else:
        squeeze_output = False
    
    height, width, channels = input_data.shape
    
    # Calculate output dimensions
    out_height = (height - pool_size) // stride + 1
    out_width = (width - pool_size) // stride + 1
    
    # Initialize output
    output = np.zeros((out_height, out_width, channels))
    
    # Perform max pooling
    for c in range(channels):
        for i in range(out_height):
            for j in range(out_width):
                # Calculate window boundaries
                h_start = i * stride
                h_end = h_start + pool_size
                w_start = j * stride
                w_end = w_start + pool_size
                
                # Extract window
                window = input_data[h_start:h_end, w_start:w_end, c]
                
                # Take maximum
                output[i, j, c] = np.max(window)
    
    # Remove channel dimension if input was 2D
    if squeeze_output:
        output = output[:, :, 0]
    
    return output

# Test with our simple example
print("üß™ Testing Max Pooling Implementation")
print("="*60)
print("\nOriginal 4√ó4 image:")
print(simple_image)

pooled = max_pool2d(simple_image, pool_size=2, stride=2)

print("\nAfter 2√ó2 max pooling (stride=2):")
print(pooled)

print("\n‚úÖ Implementation works correctly!")
print(f"\nüìä Size reduction: {simple_image.shape} ‚Üí {pooled.shape}")

### üé® Visualizing Max Pooling

Let's see exactly what max pooling does at each position!

In [None]:
# Create a more interesting 8√ó8 test image
test_image = np.random.randint(0, 10, size=(8, 8))

# Apply max pooling
pooled_image = max_pool2d(test_image, pool_size=2, stride=2)

# Visualize the process
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Original image with grid
ax1 = axes[0]
im1 = ax1.imshow(test_image, cmap='viridis', interpolation='nearest')
ax1.set_title('Original Image (8√ó8)', fontsize=13, fontweight='bold')

# Draw 2√ó2 pooling regions
for i in range(0, 8, 2):
    for j in range(0, 8, 2):
        rect = Rectangle((j-0.5, i-0.5), 2, 2,
                        linewidth=3, edgecolor='red', facecolor='none')
        ax1.add_patch(rect)

# Add values
for i in range(8):
    for j in range(8):
        ax1.text(j, i, f'{test_image[i, j]}',
                ha='center', va='center',
                color='white', fontweight='bold', fontsize=9)

ax1.set_xticks(range(8))
ax1.set_yticks(range(8))
ax1.grid(True, color='white', linewidth=0.5)
plt.colorbar(im1, ax=ax1, fraction=0.046, pad=0.04)

# Plot 2: Pooled image
ax2 = axes[1]
im2 = ax2.imshow(pooled_image, cmap='viridis', interpolation='nearest')
ax2.set_title('After Max Pooling (4√ó4)', fontsize=13, fontweight='bold')

# Add values
for i in range(4):
    for j in range(4):
        ax2.text(j, i, f'{pooled_image[i, j]:.0f}',
                ha='center', va='center',
                color='white', fontweight='bold', fontsize=11)

ax2.set_xticks(range(4))
ax2.set_yticks(range(4))
ax2.grid(True, color='white', linewidth=1)
plt.colorbar(im2, ax=ax2, fraction=0.046, pad=0.04)

# Plot 3: Show one pooling region in detail
ax3 = axes[2]
ax3.axis('off')
ax3.set_xlim(0, 10)
ax3.set_ylim(0, 10)
ax3.set_title('Example: One Pooling Operation', fontsize=13, fontweight='bold')

# Show a 2√ó2 region
example_region = test_image[0:2, 0:2]
example_max = pooled_image[0, 0]

# Draw the region
region_str = f"Region:\n"
for i in range(2):
    region_str += "  ".join([f"{example_region[i, j]:2.0f}" for j in range(2)]) + "\n"

ax3.text(5, 7, region_str, ha='center', va='top',
        fontsize=14, family='monospace', fontweight='bold',
        bbox=dict(boxstyle='round,pad=1', facecolor='lightblue',
                 edgecolor='blue', linewidth=3))

# Draw arrow
ax3.annotate('', xy=(5, 4), xytext=(5, 5.5),
            arrowprops=dict(arrowstyle='->', lw=3, color='red'))
ax3.text(5, 4.8, 'max()', ha='center', fontsize=12, fontweight='bold', color='red')

# Draw result
ax3.text(5, 3, f"Result: {example_max:.0f}", ha='center', va='top',
        fontsize=16, fontweight='bold',
        bbox=dict(boxstyle='round,pad=1', facecolor='lightgreen',
                 edgecolor='green', linewidth=3))

ax3.text(5, 1, f'Maximum of {example_region.flatten().tolist()}\nis {example_max:.0f}',
        ha='center', fontsize=11, style='italic')

plt.tight_layout()
plt.show()

print("\nüéØ Key Observations:")
print("   ‚Ä¢ Each 2√ó2 region (red boxes) becomes one output value")
print("   ‚Ä¢ Output value = maximum from that region")
print("   ‚Ä¢ Spatial size reduced by factor of 2 (8√ó8 ‚Üí 4√ó4)")
print("   ‚Ä¢ Channels remain unchanged (applies independently to each channel)")

---
## üìä Average Pooling - Take the Mean

### üéØ The Idea

**Average pooling** takes the AVERAGE (mean) value from each pooling window.

```
Input region:     Average pooling result:
[1  3]              3.75
[5  6]               ‚Üë
              (1+3+5+6)/4 = 3.75
```

### ü§î Why Average?

Average pooling is **smoother** than max pooling:
- Considers all values in the region (not just the max)
- Less aggressive downsampling
- Preserves more information about the overall pattern

### üìä Max vs Average: When to Use?

**Max Pooling** üèÜ:
- Most common in modern CNNs
- Good for detecting features ("is this feature present?")
- Preserves strong activations
- More translation invariant

**Average Pooling** üìä:
- Smoother, less aggressive
- Good for preserving overall structure
- Often used in final layers (global average pooling)
- Less prone to noise

Let's implement average pooling!

In [None]:
def avg_pool2d(input_data, pool_size=2, stride=2):
    """
    Perform 2D average pooling.
    
    Parameters:
    -----------
    input_data : np.ndarray, shape (H, W) or (H, W, C)
        Input feature map(s)
    pool_size : int
        Size of pooling window
    stride : int
        Step size between pooling windows
    
    Returns:
    --------
    output : np.ndarray
        Pooled feature map(s)
    """
    # Handle both 2D and 3D inputs
    if input_data.ndim == 2:
        input_data = input_data[:, :, np.newaxis]
        squeeze_output = True
    else:
        squeeze_output = False
    
    height, width, channels = input_data.shape
    
    # Calculate output dimensions
    out_height = (height - pool_size) // stride + 1
    out_width = (width - pool_size) // stride + 1
    
    # Initialize output
    output = np.zeros((out_height, out_width, channels))
    
    # Perform average pooling
    for c in range(channels):
        for i in range(out_height):
            for j in range(out_width):
                # Calculate window boundaries
                h_start = i * stride
                h_end = h_start + pool_size
                w_start = j * stride
                w_end = w_start + pool_size
                
                # Extract window
                window = input_data[h_start:h_end, w_start:w_end, c]
                
                # Take average (mean)
                output[i, j, c] = np.mean(window)
    
    if squeeze_output:
        output = output[:, :, 0]
    
    return output

# Test with our simple example
print("üß™ Testing Average Pooling Implementation")
print("="*60)
print("\nOriginal 4√ó4 image:")
print(simple_image)

avg_pooled = avg_pool2d(simple_image, pool_size=2, stride=2)

print("\nAfter 2√ó2 average pooling (stride=2):")
print(avg_pooled)

print("\n‚úÖ Implementation works correctly!")

# Show the calculation for one region
region = simple_image[0:2, 0:2]
print(f"\nüîç Example calculation for top-left region:")
print(f"   Region: {region.flatten()}")
print(f"   Average: ({' + '.join(map(str, region.flatten()))}) / 4 = {np.mean(region):.2f}")

### ‚öñÔ∏è Comparing Max and Average Pooling

Let's see them side-by-side on the same image!

In [None]:
# Create a test image with clear features
test_img = np.array([
    [1, 2, 3, 4, 5, 6, 7, 8],
    [2, 9, 3, 5, 6, 8, 7, 9],  # High value (9) in this region
    [3, 4, 5, 6, 7, 8, 9, 1],
    [4, 5, 6, 2, 8, 9, 1, 2],
    [5, 6, 7, 8, 1, 2, 3, 4],
    [6, 7, 8, 9, 2, 3, 4, 5],
    [7, 8, 9, 1, 3, 4, 5, 6],
    [8, 9, 1, 2, 4, 5, 6, 7]
])

# Apply both types of pooling
max_pooled = max_pool2d(test_img, pool_size=2, stride=2)
avg_pooled = avg_pool2d(test_img, pool_size=2, stride=2)

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Original
im0 = axes[0].imshow(test_img, cmap='hot', interpolation='nearest')
axes[0].set_title('Original Image (8√ó8)', fontsize=13, fontweight='bold')
for i in range(8):
    for j in range(8):
        axes[0].text(j, i, f'{test_img[i, j]}',
                    ha='center', va='center',
                    color='white', fontweight='bold', fontsize=8)
axes[0].grid(True, color='gray', linewidth=0.5)
plt.colorbar(im0, ax=axes[0], fraction=0.046, pad=0.04)

# Max pooling
im1 = axes[1].imshow(max_pooled, cmap='hot', interpolation='nearest')
axes[1].set_title('Max Pooling (4√ó4)\n"Keep strongest signals"',
                 fontsize=13, fontweight='bold')
for i in range(4):
    for j in range(4):
        axes[1].text(j, i, f'{max_pooled[i, j]:.1f}',
                    ha='center', va='center',
                    color='white', fontweight='bold', fontsize=10)
axes[1].grid(True, color='gray', linewidth=1)
plt.colorbar(im1, ax=axes[1], fraction=0.046, pad=0.04)

# Average pooling
im2 = axes[2].imshow(avg_pooled, cmap='hot', interpolation='nearest')
axes[2].set_title('Average Pooling (4√ó4)\n"Smooth downsampling"',
                 fontsize=13, fontweight='bold')
for i in range(4):
    for j in range(4):
        axes[2].text(j, i, f'{avg_pooled[i, j]:.1f}',
                    ha='center', va='center',
                    color='white', fontweight='bold', fontsize=10)
axes[2].grid(True, color='gray', linewidth=1)
plt.colorbar(im2, ax=axes[2], fraction=0.046, pad=0.04)

plt.tight_layout()
plt.show()

# Compare statistics
print("\nüìä Comparison Statistics:")
print("="*70)
print(f"{'Metric':<25} {'Max Pooling':<20} {'Average Pooling':<20}")
print("="*70)
print(f"{'Output range':<25} [{max_pooled.min():.1f}, {max_pooled.max():.1f}]" + 
      f"{' '*8} [{avg_pooled.min():.1f}, {avg_pooled.max():.1f}]")
print(f"{'Mean value':<25} {max_pooled.mean():.2f}" + 
      f"{' '*15} {avg_pooled.mean():.2f}")
print(f"{'Std deviation':<25} {max_pooled.std():.2f}" + 
      f"{' '*15} {avg_pooled.std():.2f}")
print("="*70)

print("\nüéØ Key Differences:")
print("   ‚Ä¢ Max pooling preserves high values (peaks)")
print("   ‚Ä¢ Average pooling is smoother (less extreme values)")
print("   ‚Ä¢ Max pooling has higher variance")
print("   ‚Ä¢ Average pooling mean ‚âà original mean")

---
## üîÑ Translation Invariance - The Magic of Pooling

### üéØ What is Translation Invariance?

**Translation invariance** means: "Small shifts in input don't change output"

**Why is this important?**
- A cat is still a cat whether it's on the left or right of the image
- We want the network to recognize objects regardless of their exact position
- Small movements shouldn't dramatically change activations

### üé® How Pooling Helps

Pooling provides **local** translation invariance:
```
Region 1:        Region 2:        Both pool to:
[1, 9]           [9, 1]              9
[2, 3]           [3, 2]              ‚Üë
                                 (max is 9)
```

The pattern shifted within the region, but the pooled output is the same! üéØ

Let's demonstrate this!

In [None]:
# Create an image with a bright spot
def create_image_with_spot(spot_position, image_size=8):
    """Create image with a 2√ó2 bright spot at given position."""
    img = np.ones((image_size, image_size)) * 2
    y, x = spot_position
    img[y:y+2, x:x+2] = 9  # Bright spot
    return img

# Create images with spot at different positions (within same pooling region)
spot_positions = [(0, 0), (0, 1), (1, 0), (1, 1)]
images = [create_image_with_spot(pos) for pos in spot_positions]

# Apply max pooling to all
pooled_images = [max_pool2d(img, pool_size=2, stride=2) for img in images]

# Visualize
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for idx, (img, pooled, pos) in enumerate(zip(images, pooled_images, spot_positions)):
    # Original image
    ax_orig = axes[0, idx]
    im = ax_orig.imshow(img, cmap='hot', interpolation='nearest', vmin=0, vmax=10)
    ax_orig.set_title(f'Spot at {pos}', fontsize=11, fontweight='bold')
    ax_orig.set_xticks([])
    ax_orig.set_yticks([])
    
    # Highlight the pooling region
    rect = Rectangle((-0.5, -0.5), 2, 2,
                    linewidth=3, edgecolor='cyan', facecolor='none')
    ax_orig.add_patch(rect)
    
    # Pooled image
    ax_pool = axes[1, idx]
    ax_pool.imshow(pooled, cmap='hot', interpolation='nearest', vmin=0, vmax=10)
    ax_pool.set_title(f'After pooling\nTop-left: {pooled[0, 0]:.0f}',
                     fontsize=11, fontweight='bold')
    ax_pool.set_xticks([])
    ax_pool.set_yticks([])
    
    # Add values to pooled image
    for i in range(4):
        for j in range(4):
            ax_pool.text(j, i, f'{pooled[i, j]:.0f}',
                        ha='center', va='center',
                        color='white', fontweight='bold')

plt.suptitle('Translation Invariance: Spot Moves, but Pooled Output Stays Same!',
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüéØ Translation Invariance Demonstrated!")
print("="*70)
print("\nTop-left pooled values (all should be 9):")
for idx, (pos, pooled) in enumerate(zip(spot_positions, pooled_images)):
    print(f"   Spot at {pos}: Pooled value = {pooled[0, 0]:.0f}")

print("\nüí° Key Insight:")
print("   Even though the bright spot moved within the region (cyan box),")
print("   the max pooled output STAYED THE SAME (9)!")
print("   This is translation invariance - small shifts don't change the output!")

---
## üåç Global Pooling - Pool Everything!

### üéØ What is Global Pooling?

**Global pooling** pools the ENTIRE feature map into a single value!

```
Feature map:         Global pooling:
[1  3  2  4]              
[5  6  7  8]         ‚Üí   Single value
[9  2  1  3]              (max or average)
[4  5  8  7]
```

### ü§î Why Use Global Pooling?

**Problem with fully-connected layers:**
```
Feature map: 7√ó7√ó512 = 25,088 neurons
FC layer: 1000 outputs
Parameters: 25,088 √ó 1,000 = 25 MILLION parameters! üò±
```

**Solution: Global Average Pooling (GAP)**
```
Feature map: 7√ó7√ó512
‚Üì Global pool each channel
Result: 1√ó1√ó512 = 512 values
‚Üì FC layer
Output: 1000 classes
Parameters: 512 √ó 1,000 = 512k (50√ó fewer!) üéâ
```

### üéÅ Benefits of Global Average Pooling

‚úÖ **Drastically reduces parameters** (helps prevent overfitting)
‚úÖ **No spatial information to lose** (we're at the end anyway)
‚úÖ **More robust to spatial translation** (completely position-invariant)
‚úÖ **Forces features to be meaningful** (can't rely on position)

Let's implement it!

In [None]:
def global_avg_pool2d(input_data):
    """
    Perform global average pooling.
    
    Parameters:
    -----------
    input_data : np.ndarray, shape (H, W, C)
        Input feature maps
    
    Returns:
    --------
    output : np.ndarray, shape (C,)
        One value per channel
    """
    if input_data.ndim == 2:
        # Single channel
        return np.mean(input_data)
    else:
        # Multiple channels - average each channel separately
        return np.mean(input_data, axis=(0, 1))

def global_max_pool2d(input_data):
    """
    Perform global max pooling.
    
    Parameters:
    -----------
    input_data : np.ndarray, shape (H, W, C)
        Input feature maps
    
    Returns:
    --------
    output : np.ndarray, shape (C,)
        One value per channel
    """
    if input_data.ndim == 2:
        return np.max(input_data)
    else:
        return np.max(input_data, axis=(0, 1))

# Test with a multi-channel feature map
print("üß™ Testing Global Pooling")
print("="*70)

# Simulate a 7√ó7√ó3 feature map (3 channels)
feature_map = np.random.randn(7, 7, 3)

print(f"\nInput feature map shape: {feature_map.shape}")
print(f"Total values: {feature_map.size}")

# Apply global pooling
gap_result = global_avg_pool2d(feature_map)
gmp_result = global_max_pool2d(feature_map)

print(f"\nAfter Global Average Pooling: {gap_result.shape}")
print(f"Values: {gap_result}")

print(f"\nAfter Global Max Pooling: {gmp_result.shape}")
print(f"Values: {gmp_result}")

print("\nüéØ What Happened:")
print(f"   ‚Ä¢ Original: 7√ó7√ó3 = {7*7*3} values")
print(f"   ‚Ä¢ After global pooling: 3 values (one per channel)")
print(f"   ‚Ä¢ Reduction: {(1 - 3/(7*7*3))*100:.1f}% fewer values!")

# Demonstrate parameter reduction
print("\nüí° Parameter Reduction Example:")
print("   Without GAP (7√ó7√ó512 ‚Üí 1000 classes):")
fc_params_without = 7 * 7 * 512 * 1000
print(f"     {fc_params_without:,} parameters")

print("\n   With GAP (512 ‚Üí 1000 classes):")
fc_params_with = 512 * 1000
print(f"     {fc_params_with:,} parameters")

print(f"\n   Reduction: {fc_params_without / fc_params_with:.0f}√ó fewer parameters! üéâ")

---
## üèóÔ∏è Pooling in Real CNN Architectures

### üìä Typical CNN Pattern

```
Conv ‚Üí ReLU ‚Üí Pool ‚Üí Conv ‚Üí ReLU ‚Üí Pool ‚Üí ... ‚Üí Flatten ‚Üí FC
```

**Each pooling layer:**
- Reduces spatial dimensions (halves height and width)
- Keeps channel dimension unchanged
- Has zero learnable parameters

### üéØ Common Configurations

**Standard Max Pooling:**
- Window: 2√ó2
- Stride: 2
- Result: Halves spatial dimensions

**Overlapping Pooling:**
- Window: 3√ó3
- Stride: 2
- Result: Overlapping regions, slightly more information preserved
- Used in AlexNet

**Global Pooling:**
- Window: entire feature map
- Used instead of FC layers
- Common in modern architectures (ResNet, Inception)

Let's trace through a typical CNN!

In [None]:
# Simulate a typical CNN architecture with pooling
def trace_cnn_with_pooling():
    """
    Trace feature map sizes through a typical CNN.
    """
    print("üèóÔ∏è  Typical CNN Architecture with Pooling")
    print("="*80)
    print(f"{'Layer':<20} {'Operation':<20} {'Output Shape':<20} {'Parameters':<15}")
    print("="*80)
    
    # Start with ImageNet-sized input
    h, w, c = 224, 224, 3
    
    print(f"{'Input':<20} {'-':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    # Block 1
    c = 64
    params1 = 3 * 3 * 3 * c + c  # 3√ó3 conv from 3 to 64 channels
    print(f"{'Conv1 (3√ó3, 64)':<20} {'Convolution':<20} {f'{h}√ó{w}√ó{c}':<20} {params1:<15,}")
    print(f"{'ReLU1':<20} {'Activation':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    h, w = h // 2, w // 2
    print(f"{'MaxPool1 (2√ó2)':<20} {'Max Pooling':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    # Block 2
    c_prev = c
    c = 128
    params2 = 3 * 3 * c_prev * c + c
    print(f"{'Conv2 (3√ó3, 128)':<20} {'Convolution':<20} {f'{h}√ó{w}√ó{c}':<20} {params2:<15,}")
    print(f"{'ReLU2':<20} {'Activation':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    h, w = h // 2, w // 2
    print(f"{'MaxPool2 (2√ó2)':<20} {'Max Pooling':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    # Block 3
    c_prev = c
    c = 256
    params3 = 3 * 3 * c_prev * c + c
    print(f"{'Conv3 (3√ó3, 256)':<20} {'Convolution':<20} {f'{h}√ó{w}√ó{c}':<20} {params3:<15,}")
    print(f"{'ReLU3':<20} {'Activation':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    h, w = h // 2, w // 2
    print(f"{'MaxPool3 (2√ó2)':<20} {'Max Pooling':<20} {f'{h}√ó{w}√ó{c}':<20} {0:<15}")
    
    # Global pooling
    print(f"{'GlobalAvgPool':<20} {'Global Pooling':<20} {f'1√ó1√ó{c}':<20} {0:<15}")
    
    # Final FC
    fc_params = c * 1000 + 1000
    print(f"{'FC (1000 classes)':<20} {'Fully Connected':<20} {'1000':<20} {fc_params:<15,}")
    
    print("="*80)
    
    total_params = params1 + params2 + params3 + fc_params
    print(f"\nTotal parameters: {total_params:,}")
    
    print("\nüéØ Key Observations:")
    print("   ‚Ä¢ Pooling layers have 0 parameters (just operations)")
    print("   ‚Ä¢ Spatial dimensions: 224 ‚Üí 112 ‚Üí 56 ‚Üí 28 (halved each time)")
    print("   ‚Ä¢ Channels increase: 3 ‚Üí 64 ‚Üí 128 ‚Üí 256 (learning more features)")
    print("   ‚Ä¢ Global pooling reduces 28√ó28√ó256 to just 256 values!")
    print("   ‚Ä¢ Without global pooling, FC layer would need 200K√ó more parameters!")

trace_cnn_with_pooling()

---
## üÜï Modern Alternatives to Pooling

### ü§î Is Pooling Always Necessary?

**Traditional wisdom**: Yes! Every CNN needs pooling.

**Modern view**: Not always! There are alternatives.

### üîÑ Alternative: Strided Convolutions

Instead of:
```
Conv (stride=1) ‚Üí Pool (stride=2)
```

Use:
```
Conv (stride=2)
```

**Advantages:**
- Learnable downsampling (can learn optimal way to reduce size)
- Fewer layers (simpler architecture)
- Still reduces spatial dimensions

**Disadvantages:**
- More parameters (convolution has weights)
- Less translation invariance

### üéØ When to Use What?

**Use Max Pooling when:**
- You want translation invariance
- You want to minimize parameters
- Classification tasks
- Standard CNN architectures

**Use Strided Convolutions when:**
- You want learnable downsampling
- Position matters (e.g., segmentation)
- Modern architectures (ResNet uses both!)

**Use Global Average Pooling when:**
- Replacing fully-connected layers
- Want to reduce parameters dramatically
- Final layers of the network

Let's compare these approaches!

In [None]:
# Compare pooling vs strided convolution
print("‚öñÔ∏è  Pooling vs Strided Convolution Comparison")
print("="*80)

# Scenario: Downsample 224√ó224√ó64 to 112√ó112√ó128
print("\nScenario: Downsample 224√ó224√ó64 ‚Üí 112√ó112√ó128\n")

print("Approach 1: Conv (stride=1) + Max Pool")
print("‚îÄ"*80)
conv1_params = 3 * 3 * 64 * 128 + 128  # 3√ó3 conv
pool_params = 0
total1 = conv1_params + pool_params
print(f"  Conv (3√ó3, stride=1): {conv1_params:,} parameters")
print(f"  Max Pool (2√ó2, stride=2): {pool_params:,} parameters")
print(f"  Total: {total1:,} parameters")

print("\nApproach 2: Strided Convolution Only")
print("‚îÄ"*80)
conv2_params = 3 * 3 * 64 * 128 + 128  # 3√ó3 conv with stride=2
total2 = conv2_params
print(f"  Conv (3√ó3, stride=2): {conv2_params:,} parameters")
print(f"  Total: {total2:,} parameters")

print("\n" + "="*80)
print("\nüìä Analysis:")
print(f"   Parameter difference: {abs(total2 - total1):,} (same in this case!)")
print("   Both produce 112√ó112√ó128 output")
print("   Pooling provides more translation invariance")
print("   Strided conv allows learning optimal downsampling")

print("\nüéØ Modern Practice:")
print("   ‚Ä¢ Early CNNs (AlexNet, VGG): Always use pooling")
print("   ‚Ä¢ Modern CNNs (ResNet, Inception): Mix both approaches")
print("   ‚Ä¢ Some networks (All-CNN): Replace all pooling with strided conv")
print("   ‚Ä¢ Choice depends on task and architecture goals")

---
## üéØ Summary: Pooling Layers

Congratulations! You now understand pooling - the downsampling hero of CNNs! üéâ

### ‚úÖ What We Learned

1. **What is Pooling:**
   - Downsampling operation that reduces spatial dimensions
   - No learnable parameters (just an operation)
   - Helps manage computational cost

2. **Max Pooling:**
   - Takes maximum value from each window
   - Preserves strongest activations
   - Most common in CNNs
   - Best for feature detection

3. **Average Pooling:**
   - Takes mean value from each window
   - Smoother downsampling
   - Used in some architectures
   - Good for final layers

4. **Translation Invariance:**
   - Small shifts in input don't change output
   - Key benefit of pooling
   - Makes networks robust to object position

5. **Global Pooling:**
   - Pools entire feature map to single value
   - Drastically reduces parameters
   - Alternative to fully-connected layers

6. **Modern Alternatives:**
   - Strided convolutions can replace pooling
   - Trade-offs between parameters and invariance
   - Modern architectures use both

### üßÆ Key Concepts

**Standard Pooling:**
```
Output Size = (Input - Pool_Size) / Stride + 1
```

**Common Configuration:**
- Pool size: 2√ó2
- Stride: 2
- Result: Halves spatial dimensions

**Global Pooling:**
```
Input: H √ó W √ó C
Output: 1 √ó 1 √ó C  (or just C values)
```

### üí° Design Guidelines

**Use Max Pooling:**
- After conv layers for downsampling
- 2√ó2 window, stride 2 is standard
- When you want translation invariance

**Use Average Pooling:**
- When you want smoother features
- Less common than max pooling

**Use Global Average Pooling:**
- Replace fully-connected layers
- Before final classification
- Dramatically reduces parameters

**Consider Strided Convolutions:**
- When you want learnable downsampling
- In modern architectures
- When position information is important

### üéì What's Next?

Now that you understand both convolution and pooling, you're ready to:

**Next Notebook: Building a Complete CNN**
- Combine convolution, pooling, and FC layers
- Train on MNIST or Fashion-MNIST
- Visualize what the network learns
- Understand the complete training pipeline

Let's build our first complete CNN! üöÄ

---
## üéÆ Practice Exercises

Test your understanding with these exercises:

### Exercise 1: Implement Different Pool Sizes
Modify the max pooling function to support:
- 3√ó3 pooling
- Non-square pooling (e.g., 2√ó3)
- Different strides

### Exercise 2: Visualize Pooling Effect
Create an image with various patterns (edges, spots, textures).
Apply max and average pooling with different window sizes.
Compare which features are preserved.

### Exercise 3: Calculate Output Dimensions
Given:
- Input: 100√ó100 feature map
- Pooling: 3√ó3 window
- Stride: 2

What is the output size? Write a function to verify.

### Exercise 4: Implement Overlapping Pooling
Implement max pooling with:
- Window: 3√ó3
- Stride: 2
- Compare with non-overlapping (2√ó2, stride=2)

### Exercise 5: Test Translation Invariance
Create an image with a pattern.
Shift the pattern by 1 pixel in different directions.
Apply pooling and compare outputs.
Measure how much the pooled output changes.

### Exercise 6: Compare Parameter Counts
For a network that goes from 224√ó224√ó64 to 112√ó112√ó128:
- Calculate parameters for: Conv + Pool
- Calculate parameters for: Strided Conv only
- Compare the trade-offs

**Try these exercises in the exercises.ipynb notebook!**

---

*Excellent work! You now understand both convolution AND pooling!* üí™

*Ready to build a complete CNN? Let's go!* ‚Üí **[Next: Notebook 04 - Building a Complete CNN](04_building_complete_cnn.ipynb)**