# 🧠 Softmax Activation Function
This notebook explains the Softmax function with formula, intuition, and code implementation.

## 📐 Formula
For a vector $\mathbf{z} \in \mathbb{R}^k$ (logits), the softmax function is defined as:

$$
\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}
$$

This turns a vector of raw scores into probabilities.

## 💡 Intuition
- Converts raw logits into a probability distribution.
- The output values lie in the range (0, 1) and sum to 1.
- Emphasizes larger values while still considering all elements.
- Commonly used in the output layer for multi-class classification problems.

In [None]:
import numpy as np

def softmax(z):
    """Basic softmax."""
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

def stable_softmax(z):
    """Numerically stable softmax."""
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example usage:
z = np.array([2.0, 1.0, 0.1])
print("Softmax:", softmax(z))
print("Stable Softmax:", stable_softmax(z))

## 📉 Gradient (Jacobian)
The derivative of softmax is:

$$
\frac{\partial \sigma_i}{\partial z_j} = 
\begin{cases}
\sigma_i (1 - \sigma_i), & i = j \\
- \sigma_i \sigma_j, & i \neq j
\end{cases}
$$

Or in matrix form:
$$
J = \text{diag}(\sigma) - \sigma \sigma^T
$$

## ✅ Summary
- Softmax maps arbitrary real values to probabilities.
- It is commonly used in classification problems.
- Always use the numerically stable version in real implementations.