# Linear Models for Classification

Linear vs. non-linear classifier: See [Scikit-learn Classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py).


In [None]:
import numpy as np
import pandas as pd

from sklearn import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)

## Geometric Method: Naive Method

Principles:

- Compute classes means $\mathbf{\mu}_1$, $\mathbf{\mu}_2, \mathbf{\mu}_k, \ldots$
- Classify new point $\mathbf{x}$ to the closest mean, i.e.: class $\arg \min_k \|\mathbf{x} - \mathbf{\mu}_k\|_2$

For binary classification, this is equivalent to compute the most discriminative direction as the vector between class mean:
$$
\boxed{\mathbf{w}_{\text{naive}} = \mathbf{\mu}_1 - \mathbf{\mu}_2}
$$

And projecting a new point $\mathbf{x_i}$ along this direction to obtain a score $z_i$:
$$
z_i = \mathbf{x_i}^\top \mathbf{w}_{\text{naive}}
$$

Finally the Classify the point to the closest projected class mean.

Illustration:

![Most discriminant projections, Naive and Fisher methods](images/fisher_linear_disc.png)


## Geometric Method: Fisher's Linear Discriminant

Principles:

- Dimensionality reduction before later classification.
- Find the most discriminant axis.
- Taking account the distribution, assuming same normal distribution for all classes.

Simply compute the **within class covariance** $\mathbf{S_W}$ to rotate the projection direction according to the point (elliptic) distribution:

$$
\boxed{\mathbf{w}_{\text{Fisher}} = \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0})}.
$$

This geometric method does not make any probabilistic assumptions, instead it relies on distances. It looks for the **linear projection** of the data points onto a vector, $\mathbf{w}$, that maximizes the between/within variance ratio, denoted $F(\mathbf{w})$. Under a few assumptions, it will provide the same results as linear discriminant analysis (LDA), explained below.

Suppose two classes of observations, $C_0$ and $C_1$, have means $\mathbf{\mu_0}$ and $\mathbf{\mu_1}$ and the same total within-class scatter ("covariance") matrix,

$$
    \mathbf{S_W} &= \sum_{i\in C_0} (\mathbf{x_i} - \mathbf{\mu_0})(\mathbf{x_i} - \mathbf{\mu_0})^T + \sum_{j\in C_1} (\mathbf{x_j} - \mathbf{\mu_1})(\mathbf{x_j} -\mathbf{\mu_1})^T\\
        &= \mathbf{X_c}^T \mathbf{X_c},
$$

where $\mathbf{X_c}$ is the $(N \times P)$ matrix of data centered on their respective means:

$$
\mathbf{X_c} = \begin{bmatrix}
          \mathbf{X_0} -  \mathbf{\mu_0} \\
          \mathbf{X_1} -  \mathbf{\mu_1} 
      \end{bmatrix},
$$

where $\mathbf{X_0}$ and $\mathbf{X_1}$ are the $(N_0 \times P)$ and $(N_1 \times P)$ matrices of samples of classes $C_0$ and $C_1$.

Let $\mathbf{S_B}$ being the scatter "between-class" matrix, given by

$$
    \mathbf{S_B} = (\mathbf{\mu_1} - \mathbf{\mu_0} )(\mathbf{\mu_1} - \mathbf{\mu_0} )^T.
$$

The linear combination of features $\mathbf{w}^T x$ have means $\mathbf{w}^T \mu_i$ for $i=0,1$, and variance $\mathbf{w}^T 
\mathbf{X^T_c} \mathbf{X_c} \mathbf{w}$. Fisher defined the separation between these two distributions to be the ratio of the 
variance between the classes to the variance within the classes:

$$
F_{\text{Fisher}}(\mathbf{w}) &= \frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}\\
                     &= \frac{(\mathbf{w}^T \mathbf{\mu_1} - \mathbf{w}^T \mathbf{\mu_0})^2}{\mathbf{w}^T  X^T_c \mathbf{X_c} \mathbf{w}}\\
                     &= \frac{(\mathbf{w}^T (\mathbf{\mu_1} - \mathbf{\mu_0}))^2}{\mathbf{w}^T  X^T_c \mathbf{X_c} \mathbf{w}}\\ 
                     &= \frac{\mathbf{w}^T (\mathbf{\mu_1} - \mathbf{\mu_0}) (\mathbf{\mu_1} - \mathbf{\mu_0})^T w}{\mathbf{w}^T X^T_c \mathbf{X_c} \mathbf{w}}\\
                     &= \frac{\mathbf{w}^T \mathbf{S_B} w}{\mathbf{w}^T \mathbf{S_W} \mathbf{w}}.
$$


In the two-class case, the maximum separation occurs by a projection on the $(\mathbf{\mu_1} - \mathbf{\mu_0})$ using the Mahalanobis 
metric $\mathbf{S_W}^{-1}$, so that

$$
\boxed{\mathbf{w} \propto \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0})}.
$$

**Demonstration**

Differentiating $F_{\text{Fisher}}(w)$ with respect to $w$ gives

$$
    \nabla_{\mathbf{w}}F_{\text{Fisher}}(\mathbf{w}) &= 0\\
    \nabla_{\mathbf{w}}\left(\frac{\mathbf{w}^T \mathbf{S_B} w}{\mathbf{w}^T \mathbf{S_W} \mathbf{w}}\right) &= 0\\
    (\mathbf{w}^T \mathbf{S_W} \mathbf{w})(2 \mathbf{S_B} \mathbf{w}) - (\mathbf{w}^T \mathbf{S_B} \mathbf{w})(2 \mathbf{S_W} \mathbf{w}) &= 0\\
    (\mathbf{w}^T \mathbf{S_W} \mathbf{w})(\mathbf{S_B} \mathbf{w}) &= (\mathbf{w}^T \mathbf{S_B} \mathbf{w})(\mathbf{S_W} \mathbf{w})\\
    \mathbf{S_B} \mathbf{w} &= \frac{\mathbf{w}^T \mathbf{S_B} \mathbf{w}}{\mathbf{w}^T \mathbf{S_W} \mathbf{w}}(\mathbf{S_W} \mathbf{w})\\
    \mathbf{S_B} \mathbf{w} &= \lambda (\mathbf{S_W} \mathbf{w})\\
    \mathbf{S_W}^{-1}{\mathbf{S_B}} \mathbf{w} &= \lambda  \mathbf{w}.
$$

Since we do not care about the magnitude of $\mathbf{w}$, only its direction, we replaced the scalar factor $(\mathbf{w}^T \mathbf{S_B} \mathbf{w}) / (\mathbf{w}^T \mathbf{S_W} \mathbf{w})$ by $\lambda$. 

In the multiple-class case, the solutions $w$ are determined by the eigenvectors of $\mathbf{S_W}^{-1}{\mathbf{S_B}}$ that correspond to the $K-1$ largest eigenvalues.

However, in the two-class case (in which $\mathbf{S_B} = (\mathbf{\mu_1} - \mathbf{\mu_0} )(\mathbf{\mu_1} - \mathbf{\mu_0} )^T$) it is easy to show that $\mathbf{w} = \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0})$ is the unique eigenvector of $\mathbf{S_W}^{-1}{\mathbf{S_B}}$:

$$
    \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0} )(\mathbf{\mu_1} - \mathbf{\mu_0} )^T \mathbf{w} &= \lambda  \mathbf{w}\\
    \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0} )(\mathbf{\mu_1} - \mathbf{\mu_0} )^T \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0}) &= \lambda  \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0}),
$$

where here $\lambda = (\mathbf{\mu_1} - \mathbf{\mu_0} )^T \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0})$. Which leads to the result

$$
\mathbf{w} \propto \mathbf{S_W}^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0}).
$$

**The separating hyperplane**

The separating hyperplane is a $P-1$-dimensional hyper surface, orthogonal to the projection vector, $w$. There is no single best way to find the origin of the plane along $w$, or equivalently the classification threshold that determines whether a point should be classified as belonging to $C_0$ or to $C_1$. However, if the projected points have roughly the same distribution, then the threshold can be chosen as the hyperplane exactly between the projections of the two means, i.e. as

$$
T = \mathbf{w} \cdot \frac{1}{2}(\mathbf{\mu_1} - \mathbf{\mu_0}).
$$

## Generative Model: Linear Discriminant Analysis (LDA)

- Probabilistic generalization of Fisher's linear discriminant.
- Generative model of the **conditional distribution** of the input data $\mathbf{x}$ given the label $k$: $p(\mathbf{x}|y=k)$.
- Uses Bayes' rule to provide the **posterior distribution** of the label $k$ given the input data $\mathbf{x}$: $p(y=k|\mathbf{x})$.
- Uses Bayes' rule to fix the threshold based on prior probabilities of classes.


1. First compute the class-**conditional distributions** of $\mathbf{x}$ given class $C_k$: $p(x|C_k) = \mathcal{N}(\mathbf{x}|\mathbf{\mu_k}, \mathbf{S_W})$. Where $\mathcal{N}(\mathbf{x}|\mathbf{\mu_k}, \mathbf{S_W})$ is the multivariate Gaussian distribution defined over a P-dimensional vector $x$ of continuous variables, which is given by

$$
\mathcal{N}(\mathbf{x}|\mathbf{\mu_k}, \mathbf{S_W}) = \frac{1}{(2\pi)^{P/2}|\mathbf{S_W}|^{1/2}}\exp\{-\frac{1}{2} (\mathbf{x} - \mathbf{\mu_k})^T \mathbf{S_W}^{-1}(x - \mathbf{\mu_k})\}
$$

2. Estimate the **prior probabilities** of class $k$, $p(C_k) = N_k/N$.

3. Compute **posterior probabilities** (ie. the probability of a each class given a sample) combining conditional with priors using Bayes' rule:

$$
p(C_k|\mathbf{x}) = \frac{p(C_k) p(\mathbf{x}|C_k)}{p(\mathbf{x})}
$$

Where $p(x)$ is the marginal distribution obtained by summing of classes:
As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the
numerator, because

$$
p(x) = \sum_k p(\mathbf{x}|C_k)p(C_k)
$$

4. Classify $\mathbf{x}$ using the Maximum-a-Posteriori probability: $C_k= \arg \max_{C_k} p(C_k|\mathbf{x})$

LDA is a **generative model** since the class-conditional distributions cal be used to generate samples of each classes.

LDA is useful to deal with imbalanced group sizes (eg.: $N_1 \gg N_0$) since priors probabilities can be used to explicitly re-balance the classification by setting $p(C_0) = p(C_1) = 1/2$ or whatever seems relevant.

LDA can be generalized to the multiclass case with $K>2$.

With  $N_1 = N_0$, LDA lead to the same solution than Fisher's linear discriminant.

**Question:** How many parameters are required to estimate to perform a LDA?

Application with scikit-learn

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Dataset 2 two multivariate normal
n_samples, n_features = 100, 2
mean0, mean1 = np.array([0, 0]), np.array([0, 2])
Cov = np.array([[1, .8],[.8, 1]])
np.random.seed(42)
X0 = np.random.multivariate_normal(mean0, Cov, n_samples)
X1 = np.random.multivariate_normal(mean1, Cov, n_samples)
X = np.vstack([X0, X1])
y = np.array([0] * X0.shape[0] + [1] * X1.shape[0])
m = X.mean(axis=0)

# Naive rule
w_naive = mean1 - mean0
w_naive = w_naive / np.linalg.norm(w_naive, ord=2) * 2
score_naive = np.dot(X, w_naive)

# Fischer rule
Xc = np.vstack([X0 - mean0, X1 - mean1])
Sw = np.dot(Xc.T , Xc)
w_fisher = np.dot(np.linalg.inv(Sw), mean1 - mean0)
w_fisher = w_fisher / np.linalg.norm(w_fisher, ord=2) * 2
score_fisher = np.dot(X, w_fisher)


# LDA with scikit-learn
lda = LDA()
score_lda = lda.fit(X, y).transform(X).ravel()
w_lda = lda.coef_.ravel()
w_lda = w_lda / np.linalg.norm(w_lda, ord=2) * 2
y_pred_lda = lda.predict(X)

errors =  y_pred_lda != y
print("Nb errors=%i, error rate=%.2f" % 
      (errors.sum(), errors.sum() / len(y_pred_lda)))

# Plot
plt.figure(figsize=(fig_w * 0.5, fig_w * 0.5))
data = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]), columns=("x1", "x2", "y"))
ax_ = sns.scatterplot(data=data, x="x1", y="x2", hue="y")
ax_.quiver([m[0]] * 3, [m[1]] * 3,
           [w_naive[0], w_fisher[0], w_lda[0]],
           [w_naive[1], w_fisher[1], w_lda[1]],
           units='xy', scale=1)

scores = [("Naive", score_naive), ("Fisher", score_fisher), ("LDA", score_lda)]
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
#colors = [colors[i] for i in [0, 2]]

#Plot
fig, axes = plt.subplots(1, 3, figsize=(fig_w * 2, fig_h * .5), sharey=True)
for ax, (title, score) in zip(axes, scores):
    for lab in np.unique(y):
        sns.histplot(score[y == lab], ax=ax,  label=f"Label {lab}", kde=True, color=colors[lab])
    ax.set_title(title)
    ax.legend()
fig.suptitle('Projection on discriminative directions', fontsize=16)
plt.tight_layout()
plt.show()

## Logistic Regression

![Linear (logistic) classification](images/linear_logistic.png)[width=15cm]

Principles:

1. Map the output of a linear model: $\mathbf{w}^\top \mathbf{x}$ into a score $z$. This step performs a **projection** or a **rotation** of input sample into a good discriminative one-dimensional sub-space, that best discriminate sample of current class vs sample of other classes.

2. Using an activation funtion $\sigma(.)$, this score (a.k.a decision function) is transformed, to a "posterior probabilities" of a class $k$: $P(y=k| \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x})$.

Properties:

- Consider only the posterior probability of a class $k$: $P(y=k| \mathbf{x})$
- Multiclass Classification problems can be seen as several binary classification problems $y_i \in \{0, 1\}$ where the classifier aims to discriminate the sample of the current class (label 1) versus the samples of other classes (label 0).
- The decision surfaces (orthogonal hyperplan to $\mathbf{w}$) correspond to $\sigma(z)=\text{constant}$, so that $\mathbf{x}^T \mathbf{w}=\text{constant}$ and hence the decision surfaces are linear functions of $\mathbf{x}$, even if the function $\sigma(.)$ is nonlinear.
- A thresholding of the activation (shifted by the bias or intercept) provides the predicted class label.


**Linear Discriminant Analysis (LDA) vs. Logistic Regression**

| Feature                     | **Linear Discriminant Analysis (LDA)**                          | **Logistic Regression**                                  |
|-----------------------------|------------------------------------------------------------------|----------------------------------------------------------|
| **Model Type**              | Generative model                                                 | Discriminative model                                     |
| **What It Models**          | Joint probability: $P(x, y) = P(x \mid y) P(y)$             | Posterior probability: $P(y \mid x)$              |
| **Assumptions**             | Gaussian class-conditional distributions with equal covariance  | No distributional assumption on features                |
| **Decision Boundary**       | Linear (under equal covariance assumption)                      | Linear                                                   |
| **Probability Estimation**  | Uses Bayes' rule + Gaussian likelihood                          | Uses sigmoid of linear function                         |
| **Interpretability**        | Less intuitive, based on data distribution                      | Coefficients directly reflect impact on class log-odds  |
| **Performance**             | Can outperform logistic regression when assumptions hold        | More robust when assumptions (e.g., normality) are violated |
| **Use in Multiclass**       | Naturally extends to multiclass                                | Extends via one-vs-rest or softmax (multinomial logistic) |
| **Regularization**          | Not built-in (requires extensions like shrinkage LDA)           | Easily regularized (e.g., with L1/L2 penalties)         |


**Key Insights**

- **LDA** is **generative**: it models how the data was generated (distribution of features given class), then uses Bayes’ theorem to classify.
- **Logistic regression** is **discriminative**: it models the boundary between classes directly.


### Activation Functions for Classification (Sigmoid and Softmax)

The **sigmoid** and **softmax** functions are closely related—they both transform real-valued inputs (*logits*) into probabilities, but they are used in different settings:


**The sigmoid function** maps real-valued inputs (called *logits*) into the interval $[0, 1]$, making them interpretable as probabilities (for scalar $z$):

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

It has the following properties:

- $\sigma(z) \to 1$ as $z \to +\infty$
- $\sigma(z) \to 0$ as $z \to -\infty$
- $\sigma(0) = 0.5$

In binary classification, we use $\sigma(\mathbf{w}^\top \mathbf{x})$ to estimate the probability of the positive class. This function is differentiable, which is essential for optimization via gradient descent.


**The softmax function** converts raw scores $z_k$ into probabilities:

$$
\hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}
$$

It ensures that output probabilities are positive and sum to 1, making it suitable for use with cross-entropy.


**Softmax function** (for vector $\mathbf{z} \in \mathbb{R}^K$):

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \quad \text{for } i = 1, \dots, K
$$



The **sigmoid function is a special case of the softmax** function when you have **two classes (binary classification)**.
Given two logits $z_0$ and $z_1$, the softmax for class 1 is:

$$
\text{softmax}(z_1) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}}
$$

If we define $z = z_1 - z_0$, then:

$$
\text{softmax}(z_1) = \frac{1}{1 + e^{-(z_1 - z_0)}} = \sigma(z)
$$

So:  **Sigmoid = Softmax over 2 logits, if one logit is fixed as 0 (reference class)**

Intuition

- **Sigmoid** gives the probability of one class (positive class) vs. its complement.
- **Softmax** generalizes this to $K > 2$ classes, ensuring the sum of probabilities is 1.



**Summary**

| Function   | Use Case                         | Output                         |
|------------|----------------------------------|---------------------------------|
| Sigmoid | Binary classification         | Scalar in $(0, 1)$          |
| Softmax | Multiclass classification     | Vector of probabilities summing to 1 over classes |


- Sigmoid and Softmax maps logits to probabilities over classes  
- The **sigmoid function is equivalent to a 2-class softmax**.
- Both map logits to probabilities but are used in different classification settings.
- Softmax ensures a **normalized probability distribution** over multiple classes.

In [None]:
def sigmoid(x): return 1 / (1 + np.exp(-x))

x = np.linspace(-6, 6, 100)
plt.plot(x, sigmoid(x))
plt.grid(True)
plt.title('Sigmoid (Logistic)')

### Loss Functions for Classification

#### Negative Log-Likelihood (NLL) {#test-nll}

The Negative Log-Likelihood (NLL) for a dataset of observations $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where each $x_i$ is an input and $y_i$ is the corresponding label.

**General Formulation of NLL**

Given a **probabilistic model** that predicts $P(y_i \mid x_i; \theta)$, where $\theta$ are the model parameters (e.g., weights in a neural network or logistic regression), the **Negative Log-Likelihood** over the dataset is:

$$
\mathcal{L}_{\text{NLL}}(\theta) = - \sum_{i=1}^n \log P(y_i \mid x_i; \theta)
$$


This expression encourages the model to assign **high probability** to the **correct labels**.


**Binary Classification (Sigmoid Output) a.k.a. Logistic or Binary Cross-Entropy Loss**

For binary labels $y_i \in \{0, 1\}$, and using:

$$
P(y_i = 1 \mid x_i; \theta) = \hat{p}_i = \sigma(z_i)
$$

where $\hat{p}$ is the predicted probability of the positive class (obtained via a sigmoid activation, $\sigma(z_i)$). The NLL becomes:

$$
\mathcal{L}_{\text{NLL}} = -\log \Pi_{i=1}^n \left[\hat{p}_i^{y_i} +  (1 - \hat{p}_i)^{(1 - y_i)} \right]
$$

$$
\mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]
$$

or 

$$
\mathcal{L}_{\text{NLL}}(\mathbf{w}) = \sum_{i=1}^n \left[ \log(1 + e^{-z_i}) + (1 - y_i) z_i \right] \quad \text{with } y_i \in \{0, 1\}.
$$

Recoding output $y_i \in \{-1, +1\}$, the expression simplifies to

$$
\boxed{\mathcal{L}_{\text{NLL}}(\mathbf{w}) = \sum_{i=1}^n \log\left(1 + e^{-y_i \cdot z_i} \right)} \quad \text{with } y_i \in \{-1, +1\}.
$$

For linear models: $z_i=\mathbf{w}^\top \mathbf{x}_i$. For more details, see the 
[Demonstration of Negative Log-Likelihood (NLL)](ref:demonstration-nll) Loss.

This expression is also known as the **logistic loss** or **binary cross-entropy**. It penalizes confident but incorrect predictions very heavily (e.g., predicting $\hat{p} = 0.99$ when $y = 0$).




In [None]:
# Define the logistic loss for binary classification
def logistic_loss(z, y):
    return np.log(1 + np.exp(-y * z))


# Input range (raw scores from the model)
z = np.linspace(-10, 10, 400)

# Logistic loss for y = 1 and y = 0
loss_y1 = logistic_loss(z, 1)
loss_y0 = logistic_loss(z, -1)  # equivalent to logistic loss for y=0

# Plotting
plt.figure()
plt.plot(z, loss_y1, label="Logistic Loss (y = 1)")
plt.plot(z, loss_y0, label="Logistic Loss (y = 0)", linestyle="--")
plt.title("Logistic Loss as a Function of Raw Score z")
plt.xlabel("Model score z")
plt.ylabel("Loss")
plt.legend()
plt.show()

**Gradient Logistic or Binary Cross-Entropy Loss for Linear Models**

For linear models where $z_i=\mathbf{w}^\top \mathbf{x}_i$ the gradient of the NLL according to the coefficients vector $\mathbf{w}$ is:

  $$
  \boxed{\nabla_{\mathbf{w}} \mathcal{L}_{\text{NLL}} = \sum_{i=1}^n (\sigma(\mathbf{w}^\top \mathbf{x}_i) - y_i)\mathbf{x}_i}
  $$
  
**Multiclass Classification (Softmax Output) a.k.a. Cross-Entropy Loss**

Assume $y_i \in \{1, 2, \dots, K\}$, and the model outputs softmax probabilities $\hat{p}_{i,k} = P(y_i = k \mid x_i; \theta)$. Then:

$$
\mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n \log \hat{p}_{i, y_i}
$$

If you use one-hot encoded labels $\mathbf{y}_i$, the NLL is also written as:

$$
\mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n \sum_{k=1}^K y_{i,k} \log(\hat{p}_{i,k})
$$

This is equivalent to the **cross-entropy loss** when combined with softmax.

**Summary**

- Negative Log-Likelihood = general loss for probabilistic models
- Logistic loss = binary cross-entropy
- Cross-Entropy loss = NLL + Softmax (for multiclass problems)
- These losses are convex (for linear models), interpretable, and widely used in training classifiers like logistic regression and neural networks.

See also [Scikit learn doc](https://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation)

### Hinge loss or $\ell_1$ loss

TODO

### Logistic Regression Summary

Logistic regression minimizes the Cross-Entropy loss i.e. NLL with Sigmoid (binary problems) or NLL with Softmax (multiclass problems):

$$
\boxed{\min_{\mathbf{w}}~\text{Logistic}(\mathbf{w}) = \mathcal{L}_{\text{NLL}}(\mathbf{w})}
$$

### Logistic Regression with Scikit-Learn

In [None]:
import sklearn.linear_model as lm
logreg = lm.LogisticRegression(penalty=None).fit(X, y)

logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)

errors =  y_pred_logreg != y
print("Nb errors=%i, error rate=%.2f" % 
      (errors.sum(), errors.sum() / len(y_pred_logreg)))
print(logreg.coef_.round(2))

## Regularization using penalization of coefficients

The penalties use in regression are also used in classification. The only difference is the loss function generally the negative log likelihood (cross-entropy) or the hinge loss. We will explore:

### Summary

- Ridge (also called $\ell_2$) penalty: $\|\mathbf{w}\|_2^2$. It shrinks coefficients toward 0.
- Lasso (also called $\ell_1$) penalty: $\|\mathbf{w}\|_1$. It performs feature selection by setting some coefficients to 0.
- ElasticNet (also called $\ell_1\ell_2$) penalty: $\alpha \left(\rho~\|\mathbf{w}\|_1 + (1-\rho)~\|\mathbf{w}\|_2^2 \right)$. It performs selection of group of correlated features by setting some coefficients to 0.

In [None]:
# Dataset with some correlation
X, y = make_classification(n_samples=100, n_features=10,
                           n_informative=5, n_redundant=3,
                           n_classes=2, random_state=3, shuffle=False)

# Logistic Regression unpenalized
lr = lm.LogisticRegression(penalty=None).fit(X, y)

# Logistic Regression with L2 penalty
l2 = lm.LogisticRegression(penalty='l2', C=.1).fit(X, y)  # lambda = 1 / C!

# Logistic Regression with L1 penalty
# use solver 'saga' to handle L1 penalty. lambda = 1 / C
l1 = lm.LogisticRegression(penalty='l1', C=.1, solver='saga').fit(X, y)

# Logistic Regression with L1/L2 penalties
l1l2 = lm.LogisticRegression(penalty='elasticnet',  C=.1, l1_ratio=0.5,
                             solver='saga').fit(X, y)  # lambda = 1 / C!


print(pd.DataFrame(np.vstack((lr.coef_, l2.coef_, l1.coef_, l1l2.coef_)),
             index=['lr', 'l2', 'l1', 'l1l2']).round(2))

### Understanding the effect of penalty using $\ell_2$-regularization Fisher's linear classification

When the matrix $\mathbf{S_W}$ is not full rank or $P \gg N$, the The Fisher most discriminant projection estimate of the is not unique. This can be solved using a biased version of $\mathbf{S_W}$:
$$
\mathbf{S_W}^{Ridge} = \mathbf{S_W} + \lambda \mathbf{I}
$$

where $I$ is the $P \times P$ identity matrix. This leads to the regularized (ridge) estimator of the Fisher's linear discriminant analysis: 
$$
    \mathbf{w}^{Ridge} \propto (\mathbf{S_W} + \lambda \mathbf{I})^{-1}(\mathbf{\mu_1} - \mathbf{\mu_0})
$$

![The Ridge Fisher most discriminant projection](images/ridge_fisher_linear_disc.png){width=15cm}

Increasing $\lambda$ will:

- Shrinks the coefficients toward zero.
- The covariance will converge toward the diagonal matrix, reducing the contribution of the pairwise covariances.

## $\ell_2$-regularized logistic regression

The **objective function** to be minimized is now the combination of the logistic loss (Negative Log Likelihood) with a penalty of the L2 norm of the weights vector. In the two-class case, using the 0/1 coding we obtain:

$$
\min_{\mathbf{w}}~\text{Logistic L2}(\mathbf{w}) = \mathcal{L}_{\text{NLL}}(\mathbf{w})
 + \lambda~\|\mathbf{w}\|^2
$$

In [None]:
import sklearn.linear_model as lm
lrl2 = lm.LogisticRegression(penalty='l2', C=.1)
# This class implements regularized logistic regression.
# C is the Inverse of regularization strength. Large value => no regularization.

lrl2.fit(X, y)
y_pred_l2 = lrl2.predict(X)
prob_pred_l2 = lrl2.predict_proba(X)

print("Probas of 5 first samples for class 0 and class 1:")
print(prob_pred_l2[:5, :].round(2))

print("Coef vector:")
print(lrl2.coef_.round(2))

# Retrieve proba from coef vector
probas = 1 / (1 + np.exp(- (np.dot(X, lrl2.coef_.T) + lrl2.intercept_))).ravel()
print("Diff", np.max(np.abs(prob_pred_l2[:, 1] - probas)))

errors =  y_pred_l2 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y)))

## Lasso logistic regression ($\ell_1$-regularization)

The **objective function** to be minimized is now the combination of the logistic loss $-\log \mathcal{L}(\mathbf{w})$ with a penalty of the L1 norm of the weights vector. In the two-class case, using the 0/1 coding we obtain:

$$
\min_{\mathbf{w}}~\text{Logistic Lasso}(\mathbf{w}) = \mathcal{L}_{\text{NLL}}(\mathbf{w})
 + \lambda~\|\mathbf{w}\|_1
$$

In [None]:
import sklearn.linear_model as lm
lrl1 = lm.LogisticRegression(penalty='l1', C=.1, solver='saga') # lambda = 1 / C!

# This class implements regularized logistic regression. C is the Inverse of regularization strength.
# Large value => no regularization.

lrl1.fit(X, y)
y_pred_lrl1 = lrl1.predict(X)

errors =  y_pred_lrl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_lrl1)))

print("Coef vector:")
print(lrl1.coef_.round(2))

## Linear Support Vector Machine ($\ell_2$-regularization with Hinge loss)

Support Vector Machine seek for separating hyperplane with maximum margin to enforce robustness against noise. Like logistic regression it is a **discriminative method** that only focuses of predictions.

Here we present the non separable case of Maximum Margin Classifiers with $\pm 1$ coding (ie.: $y_i \ \{-1, +1\}$).

![Linear lar margin classifiers](images/svm.png)

Linear SVM for classification (also called SVM-C or SVC) minimizes:

$$
\min_{\mathbf{w}}~\text{Linear SVM}(\mathbf{w}) = \|\mathbf{w}\|_2^2 +  C~\sum_i^n\max(0, 1 - y_i~ (\mathbf{w} \cdot \mathbf{x_i}))
$$
where $\max(0, 1 - y_i~ (\mathbf{w} \cdot \mathbf{x_i})$ is the **hinge loss**.


Here we introduced the slack variables: $\xi_i$, with $\xi_i = 0$ for points that are on or inside the correct margin boundary and $\xi_i = |y_i - (w \ cdot  \cdot \mathbf{x_i})|$ for other points. Thus:

1. If $y_i (\mathbf{w} \cdot \mathbf{x_i}) \geq 1$ then the point lies outside the margin but on the correct side of the decision boundary. In this case the constraint is thus not active for this point. It does not contribute to the prediction.

2. If $1 > y_i (\mathbf{w} \cdot \mathbf{x_i}) \geq 0$ then the point lies inside the margin and on the correct side of the decision boundary. In this case the constraint is active for this point. It does contribute to the prediction as a support vector.

3. If $0 <  y_i (\mathbf{w} \cdot \mathbf{x_i})$) then the point is on the wrong side of the decision boundary (missclassification). In this case the constraint is active for this point. It does contribute to the prediction as a support vector.


So linear SVM is closed to Ridge logistic regression, using the hinge loss instead of the logistic loss. Both will provide very similar predictions.

In [None]:
from sklearn import svm

svmlin = svm.LinearSVC(C=.1)
# Remark: by default LinearSVC uses squared_hinge as loss
svmlin.fit(X, y)
y_pred_svmlin = svmlin.predict(X)

errors =  y_pred_svmlin != y
print("Nb errors=%i, error rate=%.2f" %
      (errors.sum(), errors.sum() / len(y_pred_svmlin)))
print("Coef vector:")
print(svmlin.coef_.round(2))

## Lasso linear Support Vector Machine ($\ell_1$-regularization)

Linear SVM with l1-regularization:

$$
\min_{\mathbf{w}}~\text{Lasso Linear SVM}(\mathbf{w}) = \|\mathbf{w}\|_1 +  C~\sum_i^n\max(0, 1 - y_i~ (\mathbf{w} \cdot \mathbf{x_i}))
$$

In [None]:
from sklearn import svm

svmlinl1 = svm.LinearSVC(penalty='l1', dual=False)
# Remark: by default LinearSVC uses squared_hinge as loss

svmlinl1.fit(X, y)
y_pred_svmlinl1 = svmlinl1.predict(X)

errors =  y_pred_svmlinl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_svmlinl1)))
print("Coef vector:")
print(svmlinl1.coef_.round(2))

## Elastic-net classification ($\ell_1\ell_2$-regularization)

The **objective function** to be minimized is now the combination of a loss with combination of L1 and L2 penalties:

- For the NLL loss:
$$
\min~\text{Logistic Enet}(\mathbf{w}) = \mathcal{L}_{\text{NLL}}(\mathbf{w}) + \alpha~\left(\rho~\|\mathbf{w}\|_1 + (1-\rho)~\|\mathbf{w}\|_2^2 \right)
$$

- For the Hinge loss
$$
\min~\text{Hinge Enet}(\mathbf{w}) = \text{Hinge loss}(\mathbf{w}) + \alpha~\left(\rho~\|\mathbf{w}\|_1 + (1-\rho)~\|\mathbf{w}\|_2^2 \right)
$$

In [None]:
# Use SGD solver
enetlog = lm.SGDClassifier(loss="log_loss", penalty="elasticnet",
                            alpha=0.1, l1_ratio=0.5, random_state=42)
enetlog.fit(X, y)

# Or saga solver:
# enetloglike = lm.LogisticRegression(penalty='elasticnet',
#                                    C=.1, l1_ratio=0.5, solver='saga')

enethinge = lm.SGDClassifier(loss="hinge", penalty="elasticnet",
                            alpha=0.1, l1_ratio=0.5, random_state=42)
enethinge.fit(X, y)

print("Hinge loss and logistic loss provide almost the same predictions.")
print("Confusion matrix")
metrics.confusion_matrix(enetlog.predict(X), enethinge.predict(X))

print("Decision_function log x hinge losses:")
_ = plt.plot(enetlog.decision_function(X),
             enethinge.decision_function(X), "o")

## Classification performance evaluation metrics

[Wikipedia Sensitivity and specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)

Imagine a study evaluating a new test that screens people for a disease. Each person taking the test either has or does not have the disease. The test outcome can be positive (classifying the person as having the disease) or negative (classifying the person as not having the disease). The test results for each subject may or may not match the subject's actual status. In that setting:

- True positive (TP): Sick people correctly identified as sick

- False positive (FP): Healthy people incorrectly identified as sick

- True negative (TN): Healthy people correctly identified as healthy

- False negative (FN): Sick people incorrectly identified as healthy

- **Accuracy** (ACC):

    ACC = (TP + TN) / (TP + FP + FN + TN)


- **Sensitivity** (SEN) or **recall** of the positive class or true positive rate (TPR) or hit rate:

    SEN = TP / P = TP / (TP+FN)


- **Specificity** (SPC) or **recall** of the negative class or true negative rate:

    SPC = TN / N = TN / (TN+FP) 


- **Precision** or positive predictive value (PPV):

    PPV = TP / (TP + FP)


- **Balanced accuracy** (bACC):is a useful performance measure is the balanced accuracy which avoids inflated performance estimates on imbalanced datasets (Brodersen, et al. (2010). "The balanced accuracy and its posterior distribution"). It is defined as the arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either class:

    bACC = 1/2 * (SEN + SPC) 

- F1 Score (or F-score) which is a weighted average of precision and recall are useful to deal with imbalanced datasets
    
The four outcomes can be formulated in a 2×2 contingency table or [confusion matrix](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)

For more precision see [Scikit-learn doc](http://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]

metrics.accuracy_score(y_true, y_pred)

# The overall precision an recall
metrics.precision_score(y_true, y_pred)
metrics.recall_score(y_true, y_pred)

# Recalls on individual classes: SEN & SPC
recalls = metrics.recall_score(y_true, y_pred, average=None)
recalls[0] # is the recall of class 0: specificity
recalls[1] # is the recall of class 1: sensitivity

# Balanced accuracy
b_acc = recalls.mean()

# The overall precision an recall on each individual class
p, r, f, s = metrics.precision_recall_fscore_support(y_true, y_pred)

### Area Under Curve (AUC) of Receiver operating characteristic (ROC)

Some classifier may have found a good discriminative projection $w$. However if the threshold to decide the final predicted class is poorly adjusted, the performances will highlight an high specificity and a low sensitivity or the contrary.

In this case it is recommended to use the AUC of a ROC analysis which basically provide a measure of overlap of the two classes when points are projected on the discriminative axis. See [Wikipedia: ROC and AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

In [None]:
score_pred = np.array([.1, .2, .3, .4, .5, .6, .7, .8])
y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1])
thres = .9
y_pred = (score_pred > thres).astype(int)

print("With a threshold of %.2f, the rule always predict 0. Predictions:" % thres)
print(y_pred)
metrics.accuracy_score(y_true, y_pred)

# The overall precision an recall on each individual class
r = metrics.recall_score(y_true, y_pred, average=None)
print("Recalls on individual classes are:", r,
      "ie, 100% of specificity, 0% of sensitivity")

# However AUC=1 indicating a perfect separation of the two classes
auc = metrics.roc_auc_score(y_true, score_pred)
print("But the AUC of %.2f demonstrate a good classes separation." % auc)

## Imbalanced classes

Learning with discriminative (logistic regression, SVM) methods is generally based on minimizing the misclassification of training samples, which may be unsuitable for imbalanced datasets where the recognition might be biased in favor of
the most numerous class. This problem can be addressed with a generative approach, which typically requires
more parameters to be determined leading to reduced performances in high dimension.

Dealing with imbalanced class may be addressed by three main ways (see Japkowicz and Stephen (2002) for a review), resampling, reweighting and one class learning.

In **sampling strategies**, either the minority class is oversampled or majority class is undersampled or some combination of the two is deployed. Undersampling (Zhang and Mani, 2003) the majority class would lead to a poor usage of the left-out samples. Sometime one cannot afford such strategy since we are also facing a small sample size problem even for the majority class.
Informed oversampling, which goes beyond a trivial duplication of minority class samples, requires the estimation of class conditional distributions in order to generate synthetic samples. Here generative models are required. An alternative, proposed in (Chawla et al., 2002) generate samples along the line segments joining any/all of the k minority class nearest neighbors. Such procedure blindly generalizes the minority area without regard to the majority class, which may be particularly problematic with high-dimensional and potentially skewed class distribution. 

**Reweighting**, also called cost-sensitive learning, works at an algorithmic level by adjusting the costs of the various classes to counter the class imbalance. Such reweighting can be implemented within SVM (Chang and Lin, 2001) or logistic regression (Friedman et al., 2010) classifiers. Most classifiers of Scikit learn offer such reweighting possibilities. 

The ``class_weight`` parameter can be positioned into the ``"balanced"`` mode which uses the values of $y$ to automatically adjust weights inversely proportional to class frequencies in the input data as $N / (2 N_k)$.

In [None]:
# dataset
X, y = make_classification(n_samples=500,
                           n_features=5,
                           n_informative=2,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=1,
                           shuffle=False)

print(*["#samples of class %i = %i;" % (lev, np.sum(y == lev))
      for lev in np.unique(y)])

print('# No Reweighting balanced dataset')
lr_inter = lm.LogisticRegression(C=1)
lr_inter.fit(X, y)
p, r, f, s = metrics.precision_recall_fscore_support(y, lr_inter.predict(X))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => The predictions are balanced in sensitivity and specificity\n')

# Create imbalanced dataset, by subsampling sample of class 0: keep only 10% of
#  class 0's samples and all class 1's samples.
n0 = int(np.rint(np.sum(y == 0) / 20))
subsample_idx = np.concatenate((np.where(y == 0)[0][:n0], np.where(y == 1)[0]))
Ximb = X[subsample_idx, :]
yimb = y[subsample_idx]
print(*["#samples of class %i = %i;" % (lev, np.sum(yimb == lev)) for lev in
        np.unique(yimb)])

print('# No Reweighting on imbalanced dataset')
lr_inter = lm.LogisticRegression(C=1)
lr_inter.fit(Ximb, yimb)
p, r, f, s = metrics.precision_recall_fscore_support(
    yimb, lr_inter.predict(Ximb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => Sensitivity >> specificity\n')

print('# Reweighting on imbalanced dataset')
lr_inter_reweight = lm.LogisticRegression(C=1, class_weight="balanced")
lr_inter_reweight.fit(Ximb, yimb)
p, r, f, s = metrics.precision_recall_fscore_support(yimb,
                                                     lr_inter_reweight.predict(Ximb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => The predictions are balanced in sensitivity and specificity\n')

## Confidence interval cross-validation

Confidence interval CI classification accuracy measured by cross-validation:
![CI classification](images/classif_accuracy_95ci_sizes.png)


## Significance of classification metrics

**P-value of classification accuracy:** Compare the number of correct classifications (=accuracy $\times N$) to the null hypothesis of Binomial distribution of parameters $p$ (typically 50% of chance level) and $N$ (Number of observations). Is 60% accuracy a significant prediction rate among 50 observations?
Since this is an exact, **two-sided** test of the null hypothesis, the p-value can be divided by two since we test that the accuracy is superior to the chance level.

**P-value of ROC-AUC:** ROC-AUC measures the ranking of the two classes. Therefore non-parametric test can be used to asses the significance of the classes's separation. Mason and Graham (RMetS, 2002) show that the ROC area is equivalent to the Mann–Whitney U-statistic.

Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon, Wilcoxon
rank-sum test or Wilcoxon–Mann–Whitney test) is a nonparametric

In [None]:
import numpy as np
from sklearn import metrics
N_test = 50
X, y = make_classification(n_samples=200, random_state=42,
                           shuffle=False, class_sep=0.80)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                        test_size=N_test, random_state=42)

lr = lm.LogisticRegression().fit(X_train, y_train)

y_pred = lr.predict(X_test)
proba_pred = lr.predict_proba(X_test)[:, 1]

acc = metrics.accuracy_score(y_test, y_pred)
bacc = metrics.balanced_accuracy_score(y_test, y_pred)
auc = metrics.roc_auc_score(y_test, proba_pred)

print("ACC=%.2f, bACC=%.2f, AUC=%.2f," % (acc, bacc, auc))

In [None]:
from scipy import stats

# acc, N = 0.65, 70
k = int(acc * N_test)
acc_test = stats.binomtest(k=k, n=N_test, p=0.5, alternative='greater')
auc_pval = stats.mannwhitneyu(
    proba_pred[y_test == 0], proba_pred[y_test == 1]).pvalue


def is_significant(pval): return True if pval < 0.05 else False


print("ACC=%.2f (pval=%.3f, significance=%r)" %
      (acc, acc_test.pvalue, is_significant(acc_test.pvalue)))
print("AUC=%.2f (pval=%.3f, significance=%r)" %
      (auc, auc_pval, is_significant(auc_pval)))

## Exercise

### Fisher linear discriminant rule

Write a class ``FisherLinearDiscriminant`` that implements the Fisher's linear discriminant analysis. This class must be compliant with the scikit-learn API by providing two methods: 
- ``fit(X, y)`` which fits the model and returns the object itself;
- ``predict(X)`` which returns a vector of the predicted values.
Apply the object on the dataset presented for the LDA.