# 🧠 Softmax Activation Function and Classifier
This notebook explains the softmax function, its mathematical formulation, and implements a softmax classifier from scratch using NumPy.

## 📐 Softmax Formula
Given a vector $\mathbf{z} \in \mathbb{R}^K$, the softmax function is defined as:

$$
\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

- Outputs probabilities that sum to 1.
- Used in the output layer of neural networks for multiclass classification.

## 💡 Intuition
- Converts raw logits into a probability distribution.
- Emphasizes the largest values, making the output interpretable as probabilities.
- Ensures numerical stability by subtracting the max logit before exponentiating.

In [None]:
import numpy as np

def softmax(z):
    z -= np.max(z, axis=1, keepdims=True)
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Example
z = np.array([[2.0, 1.0, 0.1]])
print("Softmax Output:", softmax(z))

## 🔁 Gradient and Training
We use **cross-entropy loss** combined with softmax for training.
The loss function for true labels $y \in \{0,1,\dots,K-1\}$ and predicted probs $\hat{Y}$ is:

$$
\mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \log(\hat{Y}_{i, y_i})
$$

### Gradient of the loss w.r.t logits:
$$
\frac{\partial \mathcal{L}}{\partial Z} = \hat{Y} - Y_{\text{true}}
$$
where $Y_{\text{true}}$ is a one-hot representation.

We use this in gradient descent to update weights and bias.

In [None]:
class SoftmaxClassifier:
    def __init__(self, lr=0.1, n_iter=1000):
        self.lr = lr
        self.n_iter = n_iter

    def _softmax(self, z):
        z -= np.max(z, axis=1, keepdims=True)
        exp_z = np.exp(z)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

    def fit(self, X, y):
        m, n = X.shape
        self.num_classes = np.max(y) + 1
        self.weights = np.zeros((n, self.num_classes))
        self.bias = np.zeros((1, self.num_classes))

        for _ in range(self.n_iter):
            logits = np.dot(X, self.weights) + self.bias
            probs = self._softmax(logits)

            # Gradient computation
            probs[np.arange(m), y] -= 1
            probs /= m

            dw = np.dot(X.T, probs)
            db = np.sum(probs, axis=0, keepdims=True)

            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        logits = np.dot(X, self.weights) + self.bias
        probs = self._softmax(logits)
        return np.argmax(probs, axis=1)

## ✅ Summary
- Softmax converts logits to probabilities.
- Cross-entropy loss provides an effective gradient for multiclass learning.
- The classifier updates weights using the softmax derivative without needing one-hot labels.