# Softmax Regression from Scratch

## 1. **Intuitive Overview**

Softmax Regression, also known as **Multinomial Logistic Regression**, is a generalization of logistic regression for multi-class classification problems. While logistic regression is used for binary classification, softmax regression can handle cases where the target variable can take on more than two classes.

**Analogy:**  
Imagine you are at an ice cream shop with three flavors: vanilla, chocolate, and strawberry. Given your preferences (features), softmax regression helps estimate the probability that you will choose each flavor.

---

## 2. **Mathematical Foundations**

### **2.1. Model Formulation**

Given an input vector $\mathbf{x} \in \mathbb{R}^d$ and $K$ possible classes, softmax regression models the probability that $\mathbf{x}$ belongs to class $k$ as:

$[
P(y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}{\sum_{j=1}^K \exp(\mathbf{w}_j^\top \mathbf{x} + b_j)}
]$

- $\mathbf{w}_k$ is the weight vector for class $k$
- $b_k$ is the bias for class $k$

The denominator ensures that the probabilities sum to 1 across all classes.

---

### **2.2. The Softmax Function**

The **softmax function** transforms a vector of real numbers into a probability distribution:

$[
\text{softmax}(\mathbf{z})_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}
]$

where $\mathbf{z} = [z_1, z_2, ..., z_K]$.

**Key Properties:**
- All outputs are in $(0, 1)$
- Outputs sum to 1

---

### **2.3. Loss Function: Cross-Entropy**

For a dataset with $N$ samples, the **cross-entropy loss** is:

$[
L = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log P(y = k \mid \mathbf{x}_i)
]$

where $y_{ik}$ is 1 if sample $i$ belongs to class $k$, else 0.

---

### **2.4. Gradient Derivation**

Let’s derive the gradient for parameter $\mathbf{w}_k$:

$[
\frac{\partial L}{\partial \mathbf{w}_k} = -\frac{1}{N} \sum_{i=1}^N \left[ y_{ik} - P(y = k \mid \mathbf{x}_i) \right] \mathbf{x}_i
]$

**Proof Sketch:**
- The loss is differentiable and convex.
- The gradient points in the direction to adjust weights to minimize the loss.

---

## 3. **Why Softmax?**

- **Generalizes logistic regression** to multi-class problems.
- **Probabilistic interpretation:** Outputs can be interpreted as class probabilities.
- **Differentiable:** Suitable for gradient-based optimization.

---

## 4. **Real-World Analogy**

Think of softmax as a "voting system" where each class assigns a score to the input. The softmax function converts these scores into probabilities, much like how a group of friends might each rate a restaurant, and you use their ratings to decide where to eat.

---

## 5. **Modern Applications in AI/ML**

- **Image Classification:** Assigning labels to images (e.g., cats, dogs, cars).
- **Natural Language Processing:** Predicting the next word in a sentence (language modeling).
- **Recommender Systems:** Ranking items for users.
- **Neural Networks:** The final layer of many neural networks for classification uses softmax.

---

## 6. **Summary Table**

| Aspect                | Logistic Regression (Binary) | Softmax Regression (Multi-class) |
|-----------------------|------------------------------|-----------------------------------|
| Output                | Probability (2 classes)      | Probability distribution (K classes) |
| Activation Function   | Sigmoid                      | Softmax                           |
| Loss Function         | Binary Cross-Entropy         | Categorical Cross-Entropy         |

---

**In summary:**  
Softmax regression is a foundational tool for multi-class classification, combining clear probabilistic interpretation, mathematical elegance, and practical utility in modern AI systems. Understanding its derivation and intuition is essential for mastering machine learning.

## Implementing From Scratch

We will perform following steps while implementing softmax regression from scratch:
1. Generate synthetic data for 3 classes
2. Build softmax function
3. Compute cross-entropy loss
4. Calculate gradients
5. Train with gradient descent
6. Visualize accuracy

### **1. Generating Synthetic Data**

In [1]:
import numpy as np
import matplotlib.pyplot as plt