# 4.1. Softmax Regression

**Regression**: How much? or How many?

**Classification**: Which category does this belong to?

Examples of classification:
- Email: spam or not spam
- Image: cat, dog, or bird
- Handwriting: digit 0-9
- Recommendation: which movie to watch next

## 1. Classification

Suppose we have a collection of 2x2 pixel images (grayscale), and we want to classify them into 3 categories: dog, cat, or chicken.

- Features: $\mathbf{x}=(x_1, x_2, x_3, x_4)$ (2x2 pixel values)
- Labels: $y \in \{dog, cat, chicken\}$

How to represent the labels?
1. $y \in \{0, 1, 2\}$: dog=0, cat=1, chicken=2
   - This is called **integer encoding**
   - This is not a good representation for classification
   - Because the model may think that dog < cat < chicken
   - Might be useful for ordinal regression (e.g., rating 1-5)
2. **One-hot encoding**: $y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}$
   - dog=(1, 0, 0), cat=(0, 1, 0), chicken=(0, 0, 1)
   - This is a better representation for classification
   - Each category is represented by a vector with one element as 1 and the rest as 0


### 1.1. Linear Model
![alt text](https://d2l.ai/_images/softmaxreg.svg)

*Figure 1: Visualization of the softmax regression model.*

$ \mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b} $
- $\mathbf{o} \in \mathbb{R}^{3}$: output vector
- $\mathbf{W} \in \mathbb{R}^{3\times4}$: weight matrix (size: 3x4) 
- $\mathbf{b} \in \mathbb{R}^{3}$: bias vector (size: 3x1)
- $\mathbf{x} \in \mathbb{R}^{4}$: input vector (size: 4x1)


### 1.2. The Softmax Function


We could just train a linear model and use the output as the predicted class. 

However, this would not be a good idea because **the output could be negative or greater than 1**. This is not a valid probability distribution.

We need to convert the output into a **probability distribution** (0~1).

#### Probit model (obsolete)

- Label = Output + Gaussian noise
- $\mathbf{y}=\mathbf{o}+\epsilon$, where $\epsilon$ is a Gaussian noise.
- Does not work well in practice.

#### Softmax function
- Idea: $P(y=i)\propto \exp(o_i)$ and normalize it.

$$
\mathbf{\hat{y}} = \text{softmax}(\mathbf{o})
\quad \text{where} \quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j\exp(o_j)}
$$

- Then, we can interpret $\hat{y}_i$ as the probability of class $i$.

- Also, the softmax preserves the order of the output.
  - If $o_i > o_j$, then $\hat{y}_i > \hat{y}_j$.
  - If $o_i < o_j$, then $\hat{y}_i < \hat{y}_j$.
  - If $o_i = o_j$, then $\hat{y}_i = \hat{y}_j$.
  - $ \argmax_i \hat{y}_i = \argmax_i o_i $

- Idea of softmax is from the **Boltzmann distribution**.
- Actually, one may introduce a temperature parameter $T$ to control the sharpness of the distribution.

### 1.3. Vectorization of Softmax

- The softmax function can be vectorized as follows:
$$
\begin{aligned}
& \mathbf{O} = \mathbf{X} \mathbf{W} + \mathbf{b} \\
& \mathbf{\hat{Y}} = \text{softmax}(\mathbf{O}) 
\end{aligned}
$$

- $\mathbf{O} \in \mathbb{R}^{n\times q}$: output matrix
- $\mathbf{X} \in \mathbb{R}^{n\times d}$: input matrix (row-wise)
- $\mathbf{W} \in \mathbb{R}^{d\times q}$: weight matrix
- $\mathbf{b} \in \mathbb{R}^{q}$: bias vector

## 2. Loss Function

Our linear model maps features $\mathbf{x}$ to probabilities $\hat{\mathbf{y}}$.

$\mathbf{\hat{y}}$ can be interpreted as the estimated conditional probability of each class given the input $\mathbf{x}$:

$$
\hat{y}_i = \frac{\exp(o_i)}{\sum_j\exp(o_j)} = P(y=i|\mathbf{x}) 
$$

If we use one-hot encoding for the labels, we can represent the label as a vector $\mathbf{y} \in \mathbb{R}^{q}$, where $y_j=1$ if $j$ is the true class and $y_j=0$ otherwise.

We can train the model by maximizing the likelihood function:

$$
P(\mathbf{Y}|\mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)}|\mathbf{x}^{(i)})
$$

(Assuming independence: each sample is independently drawn from the same distribution.)

We usually convert the max. likelihood to the min. negative **log-likelihood**:

$$
-\log P(\mathbf{Y}|\mathbf{X}) = -\sum_{i=1}^n \log P(\mathbf{y}^{(i)}|\mathbf{x}^{(i)})
=-\sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})
$$


Then, how do we compute $l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})$?

The **cross-entropy loss** function is commonly used.

$$l(\mathbf{y}, \hat{\mathbf{y}})=-\sum_{j=1}^q y_j \log(\hat{y}_j)$$

- Cross entropy loss is a measure of the difference between two probability distributions: the true distribution $\mathbf{y}$ and the predicted distribution $\hat{\mathbf{y}}$.

- How to calculate $l(\mathbf{y}, \hat{\mathbf{y}})$:

$$
\begin{aligned}
l(\mathbf{y}, \hat{\mathbf{y}}) &= -\sum_{j=1}^q y_j \log(\hat{y}_j) \\
&= -\sum_{j=1}^{q} y_j \log(\frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)}) \\
&= \sum_{j=1}^{q} y_j \log(\sum_{k=1}^q \exp(o_k)) - \sum_{j=1}^{q} y_j o_j \\
&= \log(\sum_{k=1}^q \exp(o_k)) - \sum_{j=1}^{q} y_j o_j \\
\end{aligned}
$$

The gradient is

$$
\begin{aligned}
\frac{\partial l(\mathbf{y}, \hat{\mathbf{y}})}{\partial o_j} &= \frac{\partial}{\partial o_j} \left( \log(\sum_{k=1}^q \exp(o_k)) - \sum_{k=1}^{q} y_k o_k \right) \\
&= \frac{\partial}{\partial o_j} \log(\sum_{k=1}^q \exp(o_k)) - y_j \\
&= \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j \\
&= \hat{y}_j - y_j \\
&= \text{softmax}(\mathbf{o})_j - y_j
\end{aligned}
$$

The derivative of the loss function with respect to the output $o_j$ is simply the difference between the predicted probability $\hat{y}_j$ and the true label $y_j$.
(This is a generic result for any exponential family distribution.)
