# Softmax Regression

### Basics
- Let's assume 2x2 images = 4 features
- *One-hot encoding*: Vector with as many components as we have categories. The component corresponding to particular instance's category ist set to 1 and all other are set to 0. Label $y$ has 3 categories: cat, chicken, dog

$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$

- We need a model with multiple outputs -> **one per class**
- 4 Features and 3 categories = 12 weight scalars and 3 bias scalars
$$
\begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
\end{aligned}
$$
- Full connected Layers:
![softmax](softmaxreg.svg)
- Express model more compact:
$\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$
- - $\mathbf{W}$: 3x4 matrix

### Approach
- We interpret the outputs as probabilities
- - They must sum to 1
- Optimization: Maximize Likelihood
- **Softmax**: Does exactly this: Transform logits such that they become non-negative and sum to 1
$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}. $$

- So $\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$ Thus, $\hat{\mathbf{y}}$ is a proper probability distribution whose element values can be interpreted accordingly. exp wird also durch die Anzahl exp geteilt.

### Vectorization of Minibatches
- To improve computational efficiency
- We are given a minibatch $\mathbf{X}$ of examples with feature dimensionality $d$ (number of inputs) and batch size $n$.
- We have $q$ categories in the output
- - Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$
- - weights $\mathbf{W} \in \mathbb{R}^{d \times q}$
- - bias satisfies $\mathbf{b} \in \mathbb{R}^{1\times q}$

$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$

### Loss Function (Cross Entropy Loss)
- To measure the quality of our predicted probabilites.
- $\hat{\mathbf{y}}$ gives us a vector of **conditional** probabilities nof each class given any input $\mathbf{x}$. 
- - $\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$
- Suppose that the entire dataset $\{\mathbf{X}, \mathbf{Y}\}$ has $n$ examples, where the example indexed by $i$ consists of a feature vector $\mathbf{x}^{(i)}$ and a one-hot label vector $\mathbf{y}^{(i)}$. We can compare the estimates with reality by checking how probable the actual classes are according to our model, given the features:

- Probabilities of entire Dataset:
$$
P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
$$

- - ($ P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).$ = Probabilities of 1 Sample)

- According to maximum likelihood estimation, we maximize $P(\mathbf{Y} \mid \mathbf{X})$, which is equivalent to minimizing the negative log-likelihood.
- where for any pair of label $\mathbf{y}$ and model prediction $\hat{\mathbf{y}}$ over $q$ classes, the loss function $l$ is

$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$