# Day 10 — "Activation Functions: Geometry, Derivatives & Optimization Effects"

Activation functions are the folds that give neural networks nonlinear expressivity. They sculpt geometry and control gradient flow.

## 1. Core Intuition

- Linear layers alone collapse to a single linear map.
- Activations bend space, creating folds, thresholds, and curved regions.
- Their derivatives determine how gradients propagate.

## 2. Popular Activations

| Function | Formula | Derivative | Geometry/Notes |
| --- | --- | --- | --- |
| Sigmoid | `σ(x) = 1/(1+e^{-x})` | `σ(x)(1-σ(x))` | S-curve (0→1); saturates, vanishing grads. |
| Tanh | `(e^x-e^{-x})/(e^x+e^{-x})` | `1 - tanh^2(x)` | Zero-centered S-curve, still saturates. |
| ReLU | `max(0,x)` | `1 (x>0) else 0` | Piecewise linear; strong gradients but dead units. |
| Leaky ReLU | `x` if `x>0` else `αx` | `1` or `α` | Fixes dead ReLU. |
| GELU | `0.5x(1+tanh(√(2/π)(x+0.044715x^3)))` | smooth | Used in Transformers; smooth gradient flow. |

## 3. Python Implementation — Activation Utilities

`days/day10/code/activations.py` bundles functions and derivatives.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path
import numpy as np


def find_repo_root(marker: str = "days") -> Path:
    path = Path.cwd()
    while path != path.parent:
        if (path / marker).exists():
            return path
        path = path.parent
    raise RuntimeError("Run this notebook from inside the repository tree.")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

from days.day10.code.activations import build_activations

acts = build_activations()
x = np.linspace(-5, 5, 5)
for name, act in acts.items():
    print(name, act.fn(x))


## 4. Visualization — Curves & Derivatives

`days/day10/code/visualizations.py` plots activations and animates their derivatives.

In [None]:
from days.day10.code.visualizations import plot_activations, anim_activation_derivatives

RUN_ANIMATIONS = False

if RUN_ANIMATIONS:
    x = np.linspace(-5, 5, 400)
    path_curve = plot_activations(x)
    path_gif = anim_activation_derivatives(x)
    print('Saved assets →', path_curve, path_gif)
else:
    print('Set RUN_ANIMATIONS = True to regenerate Day 10 figures in days/day10/outputs/.')


## 5. Optimization Effects

- Gradient flow: ReLU/GELU keep gradients alive; sigmoid/tanh shrink them.
- Loss landscape: smooth activations (GELU/Swish) lead to smoother surfaces.
- Architecture choices: CNNs (ReLU), Transformers (GELU), RNN gates (sigmoid/tanh).

## 6. Mini Exercises

1. Compare ReLU vs LeakyReLU derivatives.
2. Train a small MLP with Sigmoid vs ReLU; inspect gradient norms.
3. Swap GELU into a CNN and measure training speed.
4. Visualize gradient norms across layers for different activations.
5. Add BatchNorm to tanh networks and observe stability improvements.

## 7. Key Takeaways

| Point | Meaning |
| --- | --- |
| Activations bend space | Enable complex decision boundaries. |
| Derivatives control gradient flow | Good activations preserve signal. |
| ReLU revolutionized deep nets | No saturation on positive side. |
| GELU/Swish | Smooth, modern choices (LLMs, ViTs). |
| Architecture-specific choices | Use the activation that fits gradient needs. |

> Activations are the soul of deep learning—they shape the geometry of learning itself.