# Multiclass Logistic Regression

Multiclass logistic regression is a generalization of binary logistic regression to multiple classes. In binary logistic regression, the output variable $y$ is binary, taking values in $\{0, 1\}$. In multiclass logistic regression, the output variable $y$ can take on $K$ different values, where $K > 2$. The model is also known as multinomial logistic regression.

## Model

As in the binary classification case, we have a set of $n$ observations, each with $p$ features. The input data is represented by a matrix $X$ of size $n \times p$, where each row corresponds to an observation and each column corresponds to a feature. The output variable $y$ is represented by a vector of size $n$, where each element is an integer in the range $\{1, 2, \ldots, K\}$ (in `Python` it is convenient for the categories to take values in 
$\{0, 1, \ldots, K - 1\}$).

The model assumes that the probability of an observation $i$ belonging to class $k$ is given by the softmax function:

$$
P(y_i = k | x_i) = \frac{e^{w_k^T x_i}}{\sum_{j=1}^K e^{w_j^T x_i}}
$$

where $w_k$ is the weight vector for class $k$ and $x_i$ is the feature vector for observation $i$. The softmax function ensures that the predicted probabilities sum to 1 over all classes.

The dot product $w_k^T x_i$ is the linear predictor for class $k$ and observation $i$. The likelihood (assuming the observations are independent) is given by:

$$
L(w) = \prod_{i=1}^n \prod_{k=1}^K P(y_i = k | x_i)^{I(y_i = k)}
$$

where $I(y_i = k)$ is an indicator function that is 1 if $y_i = k$ and 0 otherwise. The log-likelihood is:

$$
\ell(w) = \sum_{i=1}^n \sum_{k=1}^K I(y_i = k) \log P(y_i = k | x_i)
$$


## Multiclass Logistic Regression as a Neural Network

The multinomial logistic regression model can be represented as a neural network with a single layer of neurons where each neuron corresponds to a class. The input layer has $p$ neurons, one for each feature, and the output layer has $K$ neurons, one for each class. The weights of the model are represented by the edges connecting the input layer to the output layer (the bias terms are not shown in the diagram). The output of each neuron in the output layer is passed through the softmax function to obtain the predicted probabilities.


```{mermaid}
%%| label: fig-single-neuron-multiclass
%%| fig-width: 6
%%| fig-cap: "ANN model for logistic regression for a single observation"

graph LR
    x1["$$x_{i1}$$"] -->|$$w_1$$| B1(("$$w_{1}^T x_i + b_1$$"))
    x2["$$x_{i2}$$"] -->|$$w_2$$| B2(("$$w_{2}^T x_i + b_2$$"))
    xp["$$x_{ip}$$"] -->|$$w_p$$| B3(("$$w_{p}^T x_i + b_p$$"))
    x1 --> B2
    x1 --> B3
    x2 --> B1
    x2 --> B3
    xp --> B1
    xp --> B2
    B1 --> P1["$$\hat{y}$$"]
    B2 --> P2["$$\hat{y}$$"]
    B3 --> P3["$$\hat{y}$$"]
```

# Entropy and Cross-Entropy

In evaluating a model's accuracy, we need a measure between our model's prediction and a perfect (out-of-sample) prediction. This measure should be able to account for the fact that some outcomes (targets) are easier to predict than others. Consider the task of predicting the weather (sunshine/rain) in a deser, where it almost never rains. A model that always predicts sunshine will be correct most of the time, but it is not a very useful model as you will always be surprised when it rains.

The *entropy* of a distribution is a measure of its uncertainty that has four properties

- It is zero if the distribution is degenerate (i.e. the outcome is always sunshine)
- It is continuous, so a small change in the distribution will result in a small change in the entropy
- It is higher for distributions with can produce more different outcomes than for distributions that can produce fewer outcomes
- It is additive, so the entropy of a distribution is the sum of the entropies of its components. This means that if we first measure the uncertainty about being male/female and then measure the uncertainty about being a soccer fan or not, the uncertainty of the combinations (male/soccer fan, male/not soccer fan, female/soccer-fan, female/not soccer fan) should the sum of the two uncertainties.

It is easy to show that the entropy defined as the expected value of the log-probabilities of the outcomes satisfies these four properties.

$$
H(p) = -\sum_{k} p_k \log p_k
$$

So the entropy gives us the uncertainty when predicting outcomes using the true distribution. In classification problems, however, we don't know this distribution. Instead, we rely on a model to produce probabilities that we hope are close to the true probabilities. We can ask: how much does the uncertainty increase if we use the wrong (the model's) probabilities (Q) instead of the true probabilities? This is the *cross-entropy*.

$$
H(p, q) = -\sum_{k} p_k \log q_k
$$

It can also be decomposed into the entropy of the true distribution and the Kullback-Leibler divergence between the true distribution and the model distribution.

$$
H(P, Q) = H(p) + \text{KL}(p, q)
$$

In the above expression, H(p) is the entropy of the data-generating distribution, and KL(p, q) is the Kullback-Leibler divergence between the data-generating distribution and the model distribution. The KL divergence is always non-negative, and it is zero if the two distributions are identical. Therefore, the cross-entropy is always greater than or equal to the entropy of the data-generating distribution.

$$
\text{KL} = \sum_{k} p_k (\log p_k - \log q_k) = \sum_{i} p_k \log \frac{p_k}{q_k}
$$

The KL-divergence describes how different P and Q are on average (in units of entropy). You have likely encountered a scaled version of it when studying generalized linear models (GLM) under the name of *deviance*. The deviance is the KL-divergence between the data-generating distribution and the model distribution, scaled by a factor of two.  

In gradient descent, we want to minimize the cross-entropy between the true distribution (the labels) and the model distribution (the predicted probabilities). The loss function for multiclass logistic regression is the cross-entropy loss:

$$
\text{CE}(w) = -\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{y}_{ik}
$$

where $y_{ik}$ is an indicator function that is 1 if observation $i$ belongs to class $k$ and 0 otherwise, and $\hat{y}_{ik}$ is the predicted probability that observation $i$ belongs to class $k$.


In [2]:
import numpy as np

# A small illustration of the cross-entropy loss
# Let p be a probability distribution over 3 classes

p = np.array([0.1, 0.9])
print("p:", p)
print("Entropy:", -np.sum(p * np.log(p)))

p: [0.1 0.9]
Entropy: 0.3250829733914482


In [3]:
q = np.array([1/2, 1/2])
print("q:", q)
print("Entropy:", -np.sum(q * np.log(q)))

q: [0.5 0.5]
Entropy: 0.6931471805599453


In [4]:
q1 = np.array([1/3, 1/3, 1/3])
print("q:", q1)
print("Entropy:", -np.sum(q1 * np.log(q1)))

q: [0.33333333 0.33333333 0.33333333]
Entropy: 1.0986122886681096


In [5]:
import numpy as np

# A small illustration of the cross-entropy loss
# Let p be a probability distribution over 3 classes

p = np.array([0.1, 0.1, 0.8])

# The entropy of p is given by

entropy_p = -np.sum(p * np.log(p))
print(f"Entropy of p: {entropy_p}")

# Values from the distribution of p would be easier to predict than values from the following distribution q

q = np.array([1/3, 1/3, 1/3])

# Its entropy should be higher, reflecting the uncertainty in the distribution.

entropy_q = -np.sum(q * np.log(q))
print(f"Entropy of q: {entropy_q}")

Entropy of p: 0.639031859650177
Entropy of q: 1.0986122886681096


In [6]:
print("P: ", p)
print("Q: ", q)

P:  [0.1 0.1 0.8]
Q:  [0.33333333 0.33333333 0.33333333]


In [7]:
# When we try to predict values from q using q, the cross-entropy is equal to the entropy of q

cross_entropy_q_q = -np.sum(q * np.log(q))

print(f"Cross-entropy of q using q: {cross_entropy_q_q}")

# If we use p to predict values from q, the cross-entropy is higher, as we will be wrong more often:

cross_entropy_p_q = -np.sum(q * np.log(p))

print(f"Cross-entropy of q using p: {cross_entropy_p_q}")

# The KL divergence between q and p measures the extra (above the inherent prediction difficulty)
# difficulty in predicting values from q using p, compared to using q

kl_divergence_q_p = -np.sum(q * np.log(p/q))

print(f"KL divergence between q and p: {kl_divergence_q_p}")

Cross-entropy of q using q: 1.0986122886681096
Cross-entropy of q using p: 1.6094379124341
KL divergence between q and p: 0.5108256237659906


## Forward Pass (Prediction)

As the networks become more complex, it is convenient to represent it in matrix notation and use vectorized operations. To understand it, we must pay special attention to the layout of the data and the shapes of the matrices.

The multi-class logistic regresion model is:

$$
\hat{Y} = \text{softmax}(W^T X)
$$

The shapes of the matrices are:

- $X \in \mathbb{R}^{P \times N}$, where $N$ is the number of observations in the data (or the batch) and $P$ is the number of input features.
- $W \in \mathbb{R}^{N \times K}$, where $K$ is the number of classes.
- $\hat{Y} \in \mathbb{R}^{K \times N}$, where each column of $\hat{Y}$ contains the predicted probabilities for each class of each observation.


The softmax function is applied to each column of the $K \times N$ matrix $W^T X$ to obtain the predicted probabilities for each class of each observation. The cross-entropy loss is then computed by comparing the predicted probabilities with the true labels.

Let's create a small example to illustrate the forward pass and the back-propagation step for a batch of N = 2 observations, each with P = 4 features and K = 3 classes.

The input matrix $X \in \mathbb{R}^{4 \times 2}$ is:

$$
X = \begin{bmatrix}
1 & 2 \\
3 & 4 \\
5 & 6 \\
7 & 8
\end{bmatrix}
$$

The weight matrix $W \in \mathbb{R}^{4 \times 3}$ is:

$$
W = \begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.4 & 0.5 & 0.6 \\
0.7 & 0.8 & 0.9 \\
1.0 & 1.1 & 1.2
\end{bmatrix}
$$

The true labels $y \in \mathbb{R}^{3 \times 2}$ are:

$$
y = \begin{bmatrix}
1 & 0 \\
0 & 1 \\
0 & 0
\end{bmatrix}
$$

The true labels matrix here shows that the first observation belongs to class 0, the second observation belongs to class 1.

The forward pass is:

$$
Z = W^T X = \begin{bmatrix}
0.1 & 0.4 & 0.7 & 1.0 \\
0.2 & 0.5 & 0.8 & 1.1 \\
0.3 & 0.6 & 0.9 & 1.2
\end{bmatrix} \begin{bmatrix}
1 & 2 \\
3 & 4 \\
5 & 6 \\
7 & 8
\end{bmatrix} = \begin{bmatrix}
11.8 & 14.0 \\
13.4 & 16.0 \\
15 & 18
\end{bmatrix}
$$

The softmax function is applied to each column of $Z$ to obtain the predicted probabilities:

$$
\hat{Y} = \text{softmax}(Z) = \begin{bmatrix}
0.03280241 & 0.01587624 \\
0.16247141 & 0.11731043 \\
0.80472617 & 0.86681333
\end{bmatrix}
$$

Note that the predicted probabilities sum to 1 for each observation (column).

The matrix of true labels must have K rows, where K is the number of classes (3 in this case), and N columns, where N is the number of observations (2 in this case). Let the first observation belong to class 0, the second observation to class 2. The matrix of true labels is:

$$
Y = \begin{bmatrix}
1 & 0 \\
0 & 0 \\
0 & 1
\end{bmatrix}
$$

The logarithm of the predicted probabilities is computed element-wise:

$$
\log \hat{Y} = \begin{bmatrix}
-3.4187718 & -4.14313473 \\
-1.8187718 & -2.14113473 \\
-0.2187718 & -0.14313473
\end{bmatrix}
$$


Let's verify these calculations with `numpy`:

In [8]:
import numpy as np

# Input data 
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print("X is", X.shape)
X

X is (4, 2)


array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [9]:
W = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9], [1.0, 1.1, 1.2]])
W

array([[0.1, 0.2, 0.3],
       [0.4, 0.5, 0.6],
       [0.7, 0.8, 0.9],
       [1. , 1.1, 1.2]])

In [10]:
W.T @ X

array([[11.8, 14. ],
       [13.4, 16. ],
       [15. , 18. ]])

In [11]:
np.exp(W.T @ X)

array([[  133252.35294553,  1202604.28416478],
       [  660003.22476616,  8886110.52050787],
       [ 3269017.37247211, 65659969.13733051]])

In [12]:
Y_hat = np.exp(W.T @ X) / np.sum(np.exp(W.T @ X), axis=0, keepdims=True)
Y_hat

array([[0.03280241, 0.01587624],
       [0.16247141, 0.11731043],
       [0.80472617, 0.86681333]])

In [13]:
# Indicator matrix for the true classes

Y = np.array([[1, 0, 0], [0, 0, 1]]).T
Y

array([[1, 0],
       [0, 0],
       [0, 1]])

In [14]:
np.log(Y_hat)

array([[-3.41725321, -4.14293163],
       [-1.81725321, -2.14293163],
       [-0.21725321, -0.14293163]])

In [15]:
Y * np.log(Y_hat)

array([[-3.41725321, -0.        ],
       [-0.        , -0.        ],
       [-0.        , -0.14293163]])

In [16]:
- np.sum(Y * np.log(Y_hat))

3.560184843372646

In [17]:
# The average cross-entropy loss for the batch is

- np.sum(Y * np.log(Y_hat)).sum() / X.shape[1]

1.780092421686323

## Backward Pass (Gradient Descent)

Now that we have computed the loss for the batch in the forward pass, we can compute the gradients of the loss with respect to the weights. Here it is convenient to use matrix notation and vectorized operations.

For a single observation $i$, the predicted probabilities are:

$$
\begin{align*}
z & = W^T x \\
\hat{y} & = \text{softmax}(z)
\end{align*}
$$

The cross-entropy loss is is a scalar value that depends on the weights $W$. The gradient of the loss with respect to the weights is a matrix of the same shape as $W$.

$$
\text{CE}(W) = - \sum_{k=1}^K y_{k} \log \hat{Y}_{k}
$$

The chain rule tells us that the gradient of the loss with respect to the weights is:

$$
\frac{\partial \text{CE}(W)}{\partial W} = \frac{\partial \text{CE}(W)}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial W}
$$

The derivative of the cross-entropy loss with respect to $z$ is actually quite simple. As we are differentiation a mapping from $\mathbb{R}^{K \times 1}$ to $\mathbb{R}$, the derivative is a matrix of the same shape as $z$. It is easy to derive this derivative with respect to a single element of $z$: $z_{k}$. The first thing to notice is that the loss depends on $z_{k}$ through $\hat{y}_{k}$, and $\hat{y}_{k}$ depends on $z_{1}$, $Z_{2}$, ..., $Z_{K}$ because the softmax function divides each element of the column by the _sum of all elements_ of the column. The chain rule tells us that the derivative of the loss with respect to $z_{k}$ is:

$$
\frac{\partial \text{CE}(W)}{\partial Z_{k'}} = \sum_{k=1}^K \frac{y_{k}}{\hat{y}_{k}}\frac{\partial \hat{y}_{k}}{\partial z_{k}}
$$

Note that we are using $k'$ as the index of the class in the derivative to avoid confusion with the summation index $k$.


Now, lets tackle the derivative of the softmax function.

$$
\hat{y}_{k} = \frac{e^{Z_{k}}}{\sum_{j=1}^K e^{z_{j}}}
$$

$$
\begin{align*}
\frac{\partial}{\partial Z_{k'i}} \hat{Y}_{ik} & = \frac{\partial}{\partial Z_{k'i}} \frac{e^{Z_{ik}}}{\sum_{j=1}^K e^{Z_{ij}}} \\
\end{align*}
$$

Here we have two cases to consider. If $k = k'$, then the derivative is:

$$
\begin{align*}
\frac{\partial}{\partial z_{k'}} \hat{y}_{k} & = \frac{\partial}{\partial z_{k'}} \frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \\
& = \frac{e^{z_{k}}\sum_{j=1}^K e^{z_{j}} - e^{z_{k}}e^{z_{k}}}{\left(\sum_{j=1}^K e^{z_{j}}\right)^2} \\
& = \frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \left(1 - \frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}}\right) \\
& = \hat{y}_{k} (1 - \hat{y}_{k})
\end{align*}
$$


The second case is when $k \neq k'$. In this case, the derivative is:

$$
\begin{align*}
\frac{\partial}{\partial z_{k'}} \hat{y}_{k} & = \frac{\partial}{\partial z_{k'}} \frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \\
& = \frac{0 - e^{z_{k}}e^{z_{k'}}}{\left(\sum_{j=1}^K e^{z_{j}}\right)^2} \\
& = -\frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \frac{e^{z_{k'}}}{\sum_{j=1}^K e^{z_{j}}} \\
& = -\hat{y}_{k} \hat{y}_{k'}
\end{align*}
$$


We can combine both cases into a single expression using the Kronecker delta $\delta_{kk'}$ which is 1 if $k = k'$ and 0 otherwise:

$$
\delta_{kk'} = \begin{cases}
1 & \text{if } k = k' \\
0 & \text{if } k \neq k'
\end{cases}
$$

$$
\begin{align*}
\frac{\partial}{\partial z_{k'}} \hat{y}_{k} & = \hat{y}_{k} (\delta_{kk'} - \hat{y}_{k'})
\end{align*}
$$

You can check that this expression is correct by verifying that it gives the correct results for the two cases we considered above.

Now we are ready to substitute this derivative into the expression for the derivative of the loss with respect to $z_{ki}$:

$$
\begin{align*}
\frac{\partial \text{CE}(W)}{\partial \hat{y}_{k'}} & = - \sum_{k=1}^K \frac{y_{k}}{\hat{y}_{k}}\frac{\partial \hat{y}_{k}}{\partial z_{k'}} \\
& = - \sum_{k=1}^K \frac{y_{k}}{\cancel{\hat{y}_{k}}} \cancel{\hat{y}_{k}} (\delta_{kk'} - \hat{y}_{k'}) \\
& = - \sum_{k = 1}^{K} y_{ki} (\delta_{kk'} - \hat{y}_{k'})
\end{align*}
$$

The inner sum simplifies beautifully because of the special structure of $\delta_{kk'}$ and $y_{ki}$. The inner sum is 

$$
\sum_{k = 1}^{K} y_{k} (\delta_{kk'} - \hat{y}_{k'}) = \sum_{k = 1}^{K} y_{k} \delta_{kk'} - \sum_{k = 1}^{K} y_{k} \hat{y}_{k'}
$$

Now you need to consider only two things. In the first sum we are multiplying $y_{k}$ by $\delta_{kk'}$. The Kronecker delta is 1 only when $k = k'$, so the sum is only over the terms where $k = k'$.

$$
\sum_{k = 1}^{K} y_{k} \delta_{kk'} = y_{k'}
$$

For the second sum, notice that the $\hat{y}_{k'}$ does not depend on the summation index $k$. Therefore, we can take it out of the sum.

$$
\sum_{k = 1}^{K} y_{ki} \hat{y}_{k'} = \hat{y}_{k'} \sum_{k = 1}^{K} y_{k} = \hat{y}_{k'}
$$

The last equality is true because the sum is over the elements of the $i$-th row of $Y$, which is a one-hot encoded vector showing the true class of the $i$-th observation. Therefore, its sum over all elements is 1.

In the end, the derivative of the loss with respect to the predicted probabilities is:

$$
\frac{\partial \text{CE}(W)}{\partial z_{k'}} = (-y_{k'} + \hat{y}_{k'}) = \hat{y}_{k'} - y_{k'}
$$

So the $ki$-th element of the gradient of the loss with respect to $z$ is the difference between the predicted probability and the true label. We can write this as a matrix operation:

$$
\frac{\partial \text{CE}(W)}{\partial z} = \hat{y} - y
$$

What remains now is to compute the derivative of $z = W^T x$ with respect to $W$. Here it is helpful to consider the derivative with respect to a single weight $W_{ij}$ and consider a small example.

Let $W$ be a $3 \times 4$ matrix:

$$
z = W^T x = \begin{bmatrix}
w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32} \\
w_{13} & w_{23} & w_{33} \\
w_{14} & w_{24} & w_{34}
\end{bmatrix} \begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix} = \begin{bmatrix}
w_{11}x_1 + w_{21}x_2 + w_{31}x_3 \\
w_{12}x_1 + w_{22}x_2 + w_{32}x_3 \\
w_{13}x_1 + w_{23}x_2 + w_{33}x_3 \\
w_{14}x_1 + w_{24}x_2 + w_{34}x_3
\end{bmatrix} = 
\begin{bmatrix}
z_1 \\
z_2 \\
z_3 \\
z_4
\end{bmatrix}
$$

The derivative of $z_{k}$ with respect to $W_{ij}$ is:

$$
\frac{\partial z_{k}}{\partial W_{ij}} = \begin{cases}
x_{j} & \text{if } i = k \\
0 & \text{if } i \neq k
\end{cases}
$$

So the derivative of the whole vector $z$ with respect to a single weight, say $W_{1j}$ is again a vector of the same shape as $z$:

$$
\frac{\partial z}{\partial W_{1j}} = \begin{bmatrix}
x_{j} \\
0 \\
0 \\
0
\end{bmatrix}
\quad
\frac{\partial z}{\partial W_{2j}} = \begin{bmatrix}
0 \\
x_{j} \\
0 \\
0
\end{bmatrix}
\quad
\frac{\partial z}{\partial W_{3j}} = \begin{bmatrix}
0 \\
0 \\
x_{j} \\
0
\end{bmatrix}
$$

The last result implies that when we take the derivative of the loss with respect to the weights, we will get a matrix of the same shape as $W$
with the $ij$-th element being the product of the $i$-th row of the derivative of the loss with respect to $z$ and the $j$-th element of the input vector $x$ which is the outer product of the prediction error and the input vector.

$$
\frac{\partial \text{CE}(W)}{\partial W} = (\hat{y} - y) x^T
$$


In [19]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

np.outer(a, b)

array([[ 4,  5,  6],
       [ 8, 10, 12],
       [12, 15, 18]])