# Multiclass Logistic Regression

Multiclass logistic regression is a generalization of binary logistic regression to multiple classes. In binary logistic regression, the output variable $y$ is binary, taking values in $\{0, 1\}$. In multiclass logistic regression, the output variable $y$ can take on $K$ different values, where $K > 2$. The model is also known as multinomial logistic regression.

## Model

As in the binary classification case, we have a set of $n$ observations, each with $p$ features. The input data is represented by a matrix $X$ of size $n \times p$, where each row corresponds to an observation and each column corresponds to a feature. The output variable $y$ is represented by a vector of size $n$, where each element is an integer in the range $\{1, 2, \ldots, K\}$ (in `Python` it is convenient for the categories to take values in 
$\{0, 1, \ldots, K - 1\}$).

The model assumes that the probability of an observation $i$ belonging to class $k$ is given by the softmax function:

$$
P(y_i = k | x_i) = \frac{e^{w_k^T x_i}}{\sum_{j=1}^K e^{w_j^T x_i}}
$$

where $w_k$ is the weight vector for class $k$ and $x_i$ is the feature vector for observation $i$. The softmax function ensures that the predicted probabilities sum to 1 over all classes.

The dot product $w_k^T x_i$ is the linear predictor for class $k$ and observation $i$. The likelihood (assuming the observations are independent) is given by:

$$
L(w) = \prod_{i=1}^n \prod_{k=1}^K P(y_i = k | x_i)^{I(y_i = k)}
$$

where $I(y_i = k)$ is an indicator function that is 1 if $y_i = k$ and 0 otherwise. The log-likelihood is:

$$
\ell(w) = \sum_{i=1}^n \sum_{k=1}^K I(y_i = k) \log P(y_i = k | x_i)
$$


## Multiclass Logistic Regression as a Neural Network

The multinomial logistic regression model can be represented as a neural network with a single layer of neurons where each neuron corresponds to a class. The input layer has $p$ neurons, one for each feature, and the output layer has $K$ neurons, one for each class. The weights of the model are represented by the edges connecting the input layer to the output layer (the bias terms are not shown in the diagram). The output of each neuron in the output layer is passed through the softmax function to obtain the predicted probabilities.


```{mermaid}
%%| label: fig-single-neuron-multiclass
%%| fig-width: 6
%%| fig-cap: "ANN model for logistic regression for a single observation"

graph LR
    x1["$$x_{i1}$$"] -->|$$w_1$$| B1(("$$w_{1}^T x_i + b_1$$"))
    x2["$$x_{i2}$$"] -->|$$w_2$$| B2(("$$w_{2}^T x_i + b_2$$"))
    xp["$$x_{ip}$$"] -->|$$w_p$$| B3(("$$w_{p}^T x_i + b_p$$"))
    x1 --> B2
    x1 --> B3
    x2 --> B1
    x2 --> B3
    xp --> B1
    xp --> B2
    B1 --> P1["$$\hat{y}$$"]
    B2 --> P2["$$\hat{y}$$"]
    B3 --> P3["$$\hat{y}$$"]
```

# Entropy and Cross-Entropy

In evaluating a model's accuracy, we need a measure between our model's prediction and a perfect (out-of-sample) prediction. This measure should be able to account for the fact that some outcomes (targets) are easier to predict than others. Consider the task of predicting the weather (sunshine/rain) in a deser, where it almost never rains. A model that always predicts sunshine will be correct most of the time, but it is not a very useful model as you will always be surprised when it rains.

The *entropy* of a distribution is a measure of its uncertainty that has four properties

- It is zero if the distribution is degenerate (i.e. the outcome is always sunshine)
- It is continuous, so a small change in the distribution will result in a small change in the entropy
- It is higher for distributions with can produce more different outcomes than for distributions that can produce fewer outcomes
- It is additive, so the entropy of a distribution is the sum of the entropies of its components. This means that if we first measure the uncertainty about being male/female and then measure the uncertainty about being a soccer fan or not, the uncertainty of the combinations (male/soccer fan, male/not soccer fan, female/soccer-fan, female/not soccer fan) should the sum of the two uncertainties.

It is easy to show that the entropy defined as the expected value of the log-probabilities of the outcomes satisfies these four properties.

$$
H(p) = \sum_{k} p_k \log p_k
$$

So the entropy gives us the uncertainty when predicting outcomes using the true distribution. In classification problems, however, we don't know this distribution. Instead, we rely on a model to produce probabilities that we hope are close to the true probabilities. We can ask: how much does the uncertainty increase if we use the wrong (the model's) probabilities (Q) instead of the true probabilities? This is the *cross-entropy*.

$$
H(p, q) = -\sum_{k} p_k \log q_k
$$

It can also be decomposed into the entropy of the true distribution and the Kullback-Leibler divergence between the true distribution and the model distribution.

$$
H(P, Q) = H(p) + \text{KL}(p, q)
$$

In the above expression, H(p) is the entropy of the data-generating distribution, and KL(p, q) is the Kullback-Leibler divergence between the data-generating distribution and the model distribution. The KL divergence is always non-negative, and it is zero if the two distributions are identical. Therefore, the cross-entropy is always greater than or equal to the entropy of the data-generating distribution.

$$
\text{KL} = \sum_{k} p_k (\log p_k - \log q_k) = \sum_{i} p_k \log \frac{p_k}{q_k}
$$

The KL-divergence describes how different P and Q are on average (in units of entropy). You have likely encountered a scaled version of it when studying generalized linear models (GLM) under the name of *deviance*. The deviance is the KL-divergence between the data-generating distribution and the model distribution, scaled by a factor of two.  

In gradient descent, we want to minimize the cross-entropy between the true distribution (the labels) and the model distribution (the predicted probabilities). The loss function for multiclass logistic regression is the cross-entropy loss:

$$
\text{CE}(w) = -\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{y}_{ik}
$$

where $y_{ik}$ is an indicator function that is 1 if observation $i$ belongs to class $k$ and 0 otherwise, and $\hat{y}_{ik}$ is the predicted probability that observation $i$ belongs to class $k$.


In [7]:
import numpy as np

# A small illustration of the cross-entropy loss
# Let p be a probability distribution over 3 classes

p = np.array([0.1, 0.9])
print("p:", p)
print("Entropy:", -np.sum(p * np.log(p)))

p: [0.1 0.9]
Entropy: 0.3250829733914482


In [8]:
q = np.array([1/2, 1/2])
print("q:", q)
print("Entropy:", -np.sum(q * np.log(q)))

q: [0.5 0.5]
Entropy: 0.6931471805599453


In [9]:
q1 = np.array([1/3, 1/3, 1/3])
print("q:", q1)
print("Entropy:", -np.sum(q1 * np.log(q1)))

q: [0.33333333 0.33333333 0.33333333]
Entropy: 1.0986122886681096


In [None]:
import numpy as np

# A small illustration of the cross-entropy loss
# Let p be a probability distribution over 3 classes

p = np.array([0.1, 0.1, 0.8])

# The entropy of p is given by

entropy_p = -np.sum(p * np.log(p))
print(f"Entropy of p: {entropy_p}")

# Values from the distribution of p would be easier to predict than values from the following distribution q

q = np.array([1/3, 1/3, 1/3])

# Its entropy should be higher, reflecting the uncertainty in the distribution.

entropy_q = -np.sum(q * np.log(q))
print(f"Entropy of q: {entropy_q}")

In [11]:
print("P: ", p)
print("Q: ", q)

P:  [0.1 0.9]
Q:  [0.5 0.5]


In [12]:
# When we try to predict values from q using q, the cross-entropy is equal to the entropy of q

cross_entropy_q_q = -np.sum(q * np.log(q))

print(f"Cross-entropy of q using q: {cross_entropy_q_q}")

# If we use p to predict values from q, the cross-entropy is higher, as we will be wrong more often:

cross_entropy_p_q = -np.sum(q * np.log(p))

print(f"Cross-entropy of q using p: {cross_entropy_p_q}")

# The KL divergence between q and p measures the extra (above the inherent prediction difficulty)
# difficulty in predicting values from q using p, compared to using q

kl_divergence_q_p = -np.sum(q * np.log(p/q))

print(f"KL divergence between q and p: {kl_divergence_q_p}")

Cross-entropy of q using q: 0.6931471805599453
Cross-entropy of q using p: 1.203972804325936
KL divergence between q and p: 0.5108256237659906


## Weight Updates



First, it is convenient to write down the network model as equations and to keep track of the dimensions of the vectors and matrices. The input data is represented by a matrix $X$ of size $n \times p$, where each row corresponds to an observation and each column corresponds to a feature. The output variable $y$ is represented by a matrix of size $n \times K$, where each row corresponds to an observation and each column corresponds to a class. The weight matrix $W$ is of size $p \times K$, where each column corresponds to the weights for a class.

$$
\begin{align*}
z & = W^T X \in \mathbb{R}^K \\
\hat{y} & = \text{softmax}(z) \in [0, 1]^{K}\\
CE(w) & = -\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{y}_{ik} \in \mathbb{R} \\
\end{align*}
$$

In order to use gradient descent to find the optimal weights, we need to compute the gradient of the log-likelihood with respect to the weights.

The gradient of the loss function with respect to the weights is given by a matrix of the same shape as the weight matrix. The element in row $k$ and column $j$ of the gradient matrix is given by:

$$
\frac{\partial CE(w)}{\partial w_{jk}} = -\sum_{i=1}^n x_{ij} (y_{ik} - \hat{y}_{ik})
$$

In matrix form, the gradient is given by:

$$
\frac{\partial CE(w)}{\partial W} = -X^T (Y - \hat{Y})
$$

where $Y$ is the matrix of true labels and $\hat{Y}$ is the matrix of predicted probabilities.
