# Softmax Regression

Linear regression gives a single output, which is not perfectly suitable for classification problems.

## Classification Problems

### label presentation

Statisticains use an alternative way to represent categorical data: the one hot encoding

$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\} $$


### Network Architecture

Multiple classes. We need a model with multiple outputs, one per category.  We need as many linear functions as we have outputs. Each output will correspond to its own linear function. 

4 features X 3 categories = 12 scalars to represent the weights, 3 scalars to represent the biases.

$$ \begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1, \\
o_2 &= x_2 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2, \\
o_3 &= x_3 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3,
\end{aligned} $$

Say if $o_1$ is 0.1, $o_2$ is 10, $o_3$ is 1, treated as relative confidence levels that the item belongs to each category. 
Than $o_2$ is the most confident one.

Vector form

$$ \mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b} $$


But how to convert these output into discrete prediction. 

Problems:

1. the range of the outputs are uncertain, difficult to judge the meaning of the values.
2. how to train the model. 


### Softmax Operation


$$ \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad 
\hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}
$$

So $\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$ with $0 \leq \hat{y}_i \leq 1$

$$ \hat{\imath}({\mathbf{o}}) = \operatorname*{argmax}_i o_i = \operatorname*{argmax}_i \hat y_i $$

The softmax operation preserves the orderings of its inputs.


$$ \begin{aligned}
\mathbf{O} & = \mathbf{W} \mathbf{X} + \mathbf{b} \\ 
\hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{o})
\end{aligned} $$





## Loss Function

We need find a loss function to measure how accurate our probability distribution to the real probability distribution.

We use Cross-Entropy Loss to measure the difference of two distributions.

### Log-Likelihood

Maximum Likelihood estimation:

$$ p(Y|X) = \prod_{i=1}^n p(y^{(i)}|x^{(i)}) $$

To maximize it is the same to minimize the negative log of it.

$$ -\log p(Y|X) = \sum_{i=1}^n -\log p(y^{(i)}|x^{(i)})$$

We know that the vector y consists of all zeros but for the corrent labels, such as (1, 0, 0). So log p(y|x) can be written to $y_j \log \hat{y}$ where $\hat{y}$ is the probability distribution.

Loss function:

$$l = -\log p(y|x) = - \sum_j y_j \log \hat{y}_j$$
