# Concepts: Neural Networks

Understanding the building blocks of deep learning.

This notebook covers the **theory** behind neural networks. No framework code—just intuition and visualization.

---

## What is a Neural Network?

A neural network is a **function** that maps inputs to outputs by learning patterns from data.

```
    INPUT              HIDDEN LAYERS              OUTPUT
    
    x₁ ─────┐     ┌─────○─────┐     ┌─────○
            │     │           │     │      
    x₂ ─────┼─────┼─────○─────┼─────┼─────○───▶ ŷ
            │     │           │     │      
    x₃ ─────┘     └─────○─────┘     └─────○
    
           weights (W)    activation    weights (W)
```

**The key insight:** By adjusting the **weights**, the network can learn to approximate any function.

---

## The Neuron (Perceptron)

A single neuron performs two operations:

**1. Linear combination:**

$$z = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b$$

**2. Activation:**

$$a = \sigma(z)$$

Where:
- $\mathbf{x}$ = input vector
- $\mathbf{w}$ = weights (learned)
- $b$ = bias (learned)
- $\sigma$ = activation function
- $a$ = output (activation)

```
        ┌─────────────────────────────────┐
   x₁ ──┤ w₁                              │
        │     ╲                           │
   x₂ ──┤ w₂ ──→  Σ + b  ──→  σ(z)  ──→  │──→ output
        │     ╱                           │
   x₃ ──┤ w₃                              │
        └─────────────────────────────────┘
```

---

## Activation Functions

Activation functions introduce **non-linearity**, allowing networks to learn complex patterns.

Without them, a deep network would just be a linear function (no matter how many layers).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(1, 4, figsize=(14, 3))

# Sigmoid
axes[0].plot(z, 1 / (1 + np.exp(-z)), 'b-', linewidth=2)
axes[0].set_title('Sigmoid\n$\sigma(z) = 1/(1+e^{-z})$')
axes[0].set_xlabel('z')
axes[0].axhline(0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)

# Tanh
axes[1].plot(z, np.tanh(z), 'g-', linewidth=2)
axes[1].set_title('Tanh\n$\\tanh(z)$')
axes[1].set_xlabel('z')
axes[1].axhline(0, color='gray', linestyle='--', alpha=0.5)
axes[1].grid(True, alpha=0.3)

# ReLU
axes[2].plot(z, np.maximum(0, z), 'r-', linewidth=2)
axes[2].set_title('ReLU\n$\max(0, z)$')
axes[2].set_xlabel('z')
axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5)
axes[2].grid(True, alpha=0.3)

# Linear
axes[3].plot(z, z, 'm-', linewidth=2)
axes[3].set_title('Linear\n$z$')
axes[3].set_xlabel('z')
axes[3].axhline(0, color='gray', linestyle='--', alpha=0.5)
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### When to Use Each Activation

| Function | Range | Use Case |
|----------|-------|----------|
| **Sigmoid** | (0, 1) | Binary classification output |
| **Tanh** | (-1, 1) | Hidden layers (older networks) |
| **ReLU** | [0, ∞) | Hidden layers (modern default) |
| **Softmax** | (0, 1), sums to 1 | Multi-class classification output |
| **Linear** | (-∞, ∞) | Regression output |

---

## Loss Functions

The **loss function** measures how wrong the model's predictions are. Training minimizes this loss.

### Regression: Mean Squared Error (MSE)

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Penalizes large errors heavily (squared term).

### Binary Classification: Binary Cross-Entropy

$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

Used when output is a probability (0 to 1).

### Multi-class Classification: Categorical Cross-Entropy

$$\mathcal{L} = -\sum_{i=1}^{K} y_i \log(\hat{y}_i)$$

Used with softmax output (probability distribution over classes).

### Summary: Which Loss to Use?

| Problem | Output Activation | Loss Function |
|---------|-------------------|---------------|
| Regression | Linear | MSE |
| Binary classification | Sigmoid | Binary Cross-Entropy |
| Multi-class (one-hot) | Softmax | Categorical Cross-Entropy |
| Multi-class (integers) | Softmax | Sparse Categorical Cross-Entropy |

---

## How Networks Learn: Gradient Descent

Training a neural network means finding weights that **minimize the loss**.

**Gradient descent** does this by:
1. Compute the loss for current weights
2. Compute the **gradient** (direction of steepest increase)
3. Update weights in the **opposite direction**

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}$$

Where $\eta$ is the **learning rate** (step size).

In [None]:
# Visualize gradient descent
def loss_fn(w):
    return (w - 3) ** 2 + 1  # Minimum at w=3

# Simulate gradient descent
w = 0.0
lr = 0.1
history = [w]
for _ in range(15):
    gradient = 2 * (w - 3)
    w = w - lr * gradient
    history.append(w)

# Plot
w_range = np.linspace(-1, 6, 100)
plt.figure(figsize=(10, 4))
plt.plot(w_range, loss_fn(w_range), 'b-', linewidth=2, label='Loss function')
plt.scatter(history, [loss_fn(h) for h in history], c='red', s=80, zorder=5, label='Gradient descent steps')
plt.plot(history, [loss_fn(h) for h in history], 'r--', alpha=0.5)
plt.xlabel('Weight (w)', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Gradient Descent: Finding the Minimum', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---

## Backpropagation (Intuition)

**Backpropagation** computes gradients efficiently using the chain rule.

The key insight: to know how much each weight contributes to the loss, we propagate the error **backwards** through the network.

```
FORWARD PASS (compute predictions)
────────────────────────────────▶

  Input ──▶ Layer 1 ──▶ Layer 2 ──▶ Output ──▶ Loss

◀────────────────────────────────
BACKWARD PASS (compute gradients)
```

**You don't need to implement this yourself** — frameworks like Keras handle it automatically!

---

## Key Takeaways

1. **Neurons** compute weighted sums followed by an activation function
2. **Activation functions** add non-linearity (ReLU for hidden layers, sigmoid/softmax for output)
3. **Loss functions** measure prediction error (MSE for regression, cross-entropy for classification)
4. **Gradient descent** minimizes the loss by iteratively updating weights
5. **Backpropagation** efficiently computes gradients for all weights

**Next:** Learn about training in [02_concepts_training](02_concepts_training.ipynb), then apply these concepts in [03_lab_first_neural_network](03_lab_first_neural_network.ipynb)