(C-categorical)=
# Analyzing categorical data

Many of the problems we considered so far involved predicting a *continuous* variable (e.g., the price of a car). In many real-world problems, we are instead interested in predicting a *categorical* variable, i.e., a variable that can take only finitely many values. While the difference may seem minor from a mathematical perspective, different techniques must be used to deal with the categorical setting. 

## Loss function

When working with continuous data, we often use the mean-squared error (MSE) 

$$
\frac{1}{n} \sum_{i=1}^n (\widehat{y}_i - y_i)^2
$$

as a measure of error, where $\widehat{y}_i$ is the predicted value of sample $y_i$. When fitting a model (e.g., linear regression), we then choose the model parameters to minimize that error (i.e., we choose the parameters to fit the training data as best as possible). 

Suppose now that the output $Y$ can only take finitely many values, say $\{1, \dots, K\}$. Think of these as labels for data points. For example, suppose we have pictures of cats, dogs, and birds that we would like to automatically classify using a model. Let us label the images using "1" for cats, "2" for dogs, and "3" for birds. We can then aim to train a model to predict the class of each image as accurately as possible. 

```{note}

Digital images can be seen as a collection of *pixels*, small squares with different colors. An image can therefore be encoded as a matrix whose $(i,j)-th$ entry is a number representing the color of the corresponding pixel. 

```{figure} images/pokemon.png
---
width: 200 px
---
An illustration of the pixels of an image

An image of dimension $m \times n$ can be reshaped into an $mn \times 1$ vector by stacking the columns (or the rows) of the corresponding matrix. We can thus think of an image as a vector. The image classification problem therefore involves taking a vector representing an image as input, and returning the correct image label.

```


Suppose the correct label of a sample $y_1$ is $1$ (i.e., the first image is a cat). In the MSE setting, a predicted value of $3$ (bird) for $y_1$ is worst than a predicted value of $2$ (dog) since $(3-1)^2 > (2-1)^2$. However, here, the labels $1,2,3$ are completely arbitrary so a predicted label of $3$ is not really worse than a predicted label $2$. It therefore makes sense to use a different loss function. 

### Cross-entropy

Instead of predicting the label of a given sample, many categorical models predict a probability distribution on the different labels. In that context, a very common loss function is **cross-entropy**. Cross-entropy can be seen as a measure of distance between two probability distribution. Here, a probability distribution on $\{1,\dots,K\}$ is a collection of numbers $(p_1, \dots, p_K)$ such that 

1. $0 \leq p_i \leq 1$ for all $i=1, \dots, K$, 
2. $\sum_{i=1}^K p_i = 1$. 

```{admonition} Definition (cross-entropy)

Let $p, q$ be two probability distributions on $\{1,\dots, K\}$. The *cross-entropy* $H(p,q)$ is given by 

$$
H(p,q) := -\sum_{i=1}^K p_i \log q_i, 
$$

where we use the convention $0 \cdot \log 0 = 0$. 

```

Now, instead of using labels such as $1, 2, 3, etc.$ for the different categories, we use a **one-hot encoding**. This means that if $y_i$ has label $j$, we represent it as a $K$ dimensional vector with a $1$ in the $j$-th position, and zeros everywhere else:

\begin{align*}
y_i = (0,0,\dots,0, &\underbrace{1},0,0,\dots,0).\\
&j \textrm{-th}
\end{align*}

Equivalently, we can think of $y_i$ as the probability distribution taking value $j$ with probability $1$. We can now compare the predicted probability distribution

$$
\widehat{y}_i = (\widehat{y}_1^{(1)}, \dots, \widehat{y}_i^{(K)})
$$

with $y_i$ using cross-entropy: 

$$
H(y_i, \widehat{y}_i) = -\sum_{j=1}^K y_i^{(j)} \log \widehat{y}_i^{(j)}. 
$$

Finally, we can average the cross-entropy over all the samples to measure how the model is doing

$$
L(y, \widehat{y}) = \frac{1}{n} \sum_{i=1}^n H(y_i, \widehat{y}_i) = -\frac{1}{n} \sum_{i=1}^n \sum_{j=1}^K y_i^{(j)} \log \widehat{y}_i^{(j)}. 
$$

```{note}

Cross-entropy finds its origin in information theory, where it is used to measure the expected number of bits needed to encode samples from a true discrete distribution $p$ when using a coding scheme optimized for a model distribution $q$. It is commonly used in probability theory to compare probability distributions.

```

#### Example

Suppose with work with data having $3$ different categories. Assume one sample has label $y = (1,0,0)$. Consider two possible predictions $(1/3,1/3,1/3)$ and $(1/4, 1/2, 1/4)$. Let us compute which distribution is closest to $y$ with respect to cross-entropy.


In [2]:
import numpy as np

def cross_entropy(p,q):
    K = len(p)
    CE = 0
    for j in range(K):
        if p[j] != 0 and q[j] != 0:
            CE -= p[j]*np.log(q[j])
    return(CE)

y = np.array([1,0,0])
y1 = np.array([1/3,1/3,1/3])
y2 = np.array([1/4,1/2,1/4])

CE1 = cross_entropy(y,y1)
CE2 = cross_entropy(y,y2)

print(CE1)
print(CE2)

1.0986122886681098
1.3862943611198906


We conclude that $(1/3,1/3,1/3)$ is "closer" to $y$ than $(1/4,1/2,1/4)$.

## The nearest neighbors model

A simple approach to predict the labels in the categorical setting is to use the neighbors of a points to guide the prediction. Suppose the predictors $x_1, \dots x_n$ belong to the $N$-dimensional Euclidean space $\mathbb{R}^N$ and assume the response $y_1, \dots, y_n$ are categorical and encoded using a one-hot encoding as described above. For $x \in \mathbb{R}^N$ and an integer $1 \leq k \leq n-1$, let $N_k(x)$ denote the set of $k$ nearest neighbors of $x$ (i.e., the $k$ points closer to $x$ among $x_1, \dots, x_n$). 

```{admonition} Definition ($k$ nearest neighbors predictor)

In the above setting, the *$k$ nearest neighbors predictor* is

$$
\widehat{Y}(x) = \frac{1}{k} \sum_{x_m \in N_k(x)} y_m.
$$

```

In other words, the $k$ nearest neighbors predictor returns the proportion of samples of each category in the $k$ neighborhood of $x$. New points can then be classified using the category with the largest proportion of neighbors. The nearest neighbors approach thus uses a "majority vote" based on the value of the nearest neighbors to make the prediction. This makes sense in scenarios where "similar" points typically have the same label. 

```{figure} images/Fig2p2.png
---
width: 300 px
---
An illustration of the $k$ nearest neighbors classifier with $k=15$. 
```

### Example

Consider the following categorical dataset: 

$$
X = 
\begin{bmatrix}
2 & 5 \\
-1 & 3 \\
4 & 0 \\
0 & -2 \\
3 & 1
\end{bmatrix}, \quad
y =
\begin{bmatrix}
0 \\
2 \\
1 \\
0 \\
2
\end{bmatrix}.
$$

Let us use Python to find the $3$ nearest classifier of the new point $x = [1,1]$. We first enter the data.

In [11]:
import numpy as np

# Feature matrix (5 samples, 2 features)
X = np.array([
    [2, 5],
    [-1, 3],
    [4, 0],
    [0, -2],
    [3, 1]
])

n = X.shape[0] # Number of samples

# Response vector
y = np.array([0, 2, 1, 0, 2])

# Let us convert y to a one-hot encoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
y_one_hot = encoder.fit_transform(y.reshape(-1,1))

We can verify that $y$ was encoded properly:

In [12]:
print(y_one_hot)

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


Let us now compute the 3 nearest neighbors predictor.

In [30]:
# Point to predict
x = np.array([1,1])

# Compute the distance between x and the samples

d = np.zeros((n,1))
for i in range(n):
    d[i] = np.linalg.norm(x - X[i,:])

# Find the index of the 3 nearest neighbors
I = np.argsort(d, axis=0)  # Returns the indices of d from smallest to largest

I3= I[0:3]  # 3 nearest neighbors

# Average the labels of the nearest neighbors
yhat = y_one_hot[I3,:].mean(axis=0)

print(yhat)

print("Final prediction: ")
print(np.argmax(yhat))

[[0.33333333 0.         0.66666667]]
Final prediction: 
2


Since the label with the largest proportion of $3$-neighbors of $x$ is "2", we classify the new point $x$ as "2". In some sense, we are relatively confident in our classification since $2/3$ of the closest neighbors have this label. 

```{note} 
One can think of several variants of nearest neighbors. For example, instead of using the $k$ nearest neighbors, one could use all the neighbors at distance at most $d$ for some given value of $d > 0$. In that case, if the number of such neighbors is small for a given $x$, one can decide to return no classification as there are not enough neighbors to make an informed decision. Another variant involves using all neighbors in the prediction, but to weight them with a weight that decreases with distance, say

$$
\widehat{Y}(x) = \sum_{i=1}^n d_i y_i, 
$$

where $d_i = d_i(x) \geq 0$. The classical $k$ nearest neighbors classifier arises in that way by setting $d_i = 1/k$ for the $k$ nearest neighbors and $d_i = 0$ otherwise. Another possible choice for $d_i$ is 

$$
d_i(x) = \frac{e^{-\|x-x_i\|_2^2}}{\sum_{i=1}^n e^{-\|x-x_i\|_2^2}}, 
$$

where points in the training set have a weight that decreases exponentially fast with distance. The denominator is used to make the weights sum to $1$. 
```

```{admonition} Exercise

Implement your own nearest neighbors classifier in Python. Your function should take $X, y, k, x$ as input and return the $k$ nearest neighbors classifier for $x$. Also implement some of the variants described above.
```

When using nearest neighbors, a value of $k$ needs to be chosen carefully. Although a small $k$ leads to a small training error, the model may not generalize well (large test error). The value of $k$ is typically chosen using [](sec-cross-validation).