# Machine Learning I - Linear Models

Andrés F. LOPEZ-LOPERA <br/>
Université Polytechnique Hauts-de-France

---

<div class="alert alert-block alert-warning"> 
    For this lab session, you are free to use the language of your choice but Python is strongly recommended. In this notebook we will focus on Python implementations based on the toolboxes 'numpy', 'matplotlib', 'pandas' and 'sklearn'.
    
</div>

In [None]:
import numpy as np              # toolbox with comprehensive mathematical functions
import matplotlib.pyplot as plt # toolbox for plotting figures
from matplotlib.colors import ListedColormap # to fix some colormaps
import pandas as pd             # toolbox for managing dataframes
import seaborn as sns           # toolbox for visualization

## Introduction

This notebook focuses on exercises related to classification and clustering models. More precisely, we will explore the following three applications:

- Linear models for classification
    - Linear discriminant analysis (LDA)
    - Logistic regression
- Clustering
    - $k$-means
    - Gaussian mixture models (GMMs)

## Exercice 1: Linear models for classification

### Diabetes dataset

In this exercise, we will use the **Iris dataset**, a well-known dataset in the machine learning community. It is readily available in the `sklearn` library. You can find more details in the official documentation here: [Iris dataset Documentation](https://scikit-learn.org/1.5/datasets/toy_dataset.html)

**Question 1 (data Analysis).** Load the **Iris dataset** and create a Pandas DataFrame with appropriately named columns for $(X, y)$. Use Python visualization tools (e.g., those covered in the *Python M1* course) to analyze the dataset.  

Write a brief report (maximum 10 lines) summarizing your findings, making explicit references to the figures included in your analysis.

In [None]:
from sklearn import datasets     # toolbox ML + datasets

# Loading the diabetes dataset
dataset_full = 
dataset_full

In [None]:
# creating a dataframe
dataset = pd.DataFrame()
dataset = dataset.set_axis(dataset_full.feature_names + ['y'], axis = 1)

dataset

In [None]:
# defining the design matrix
pattern_names = # names of the features
target_names =  # names of the classes
n, d =  # (nb of observations, input dimension + output)
K = # nb of classes

print("Classes:", target_names)
print("Patterns:", pattern_names)

In [None]:
## to add your plots here

### Linear discriminant analysis in $\mathbb{R}^d$ for a binary classification

The probability density function of the observation $X$ given the class label $y \in \{0, 1\}$ follows a Gaussian distribution with a common covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$:

\begin{align}
	f_{X| y = k}(x) 
    = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(- \frac{1}{2}(x-\mu_k)^\top \Sigma^{-1} (x-\mu_k)\right),
\end{align}

where $\mu_k \in \mathbb{R}^d$ represents the mean vector of the Gaussian distribution for class $k$. We assume prior probabilities $\mathbb{P}(y = k) = \pi_k$ for all $k \in \{0, 1\}$.  

By applying Bayes' theorem, the log posterior ratio is given by  

\begin{align*}
	\log \left(\frac{\mathbb{P}(y = 1|x)}{\mathbb{P}(y = 0|x)}\right)
	= \delta_1(x) - \delta_0(x)
	= \langle x, \beta \rangle + \beta_0,
\end{align*}

where $\delta_k$ is the linear discriminant function defined as  

\begin{align*}
    \delta_k(x) 
	&= \langle x, \Sigma^{-1} \mu_k \rangle - \frac{1}{2} \mu_k^\top \Sigma^{-1} \mu_k + \log \left(\pi_k \right).
\end{align*}

Within this context, the classification rule is given by

\begin{equation*}
    \eta_{\theta}(x) = \mathbb{1}_{\mathbb{P}(y = 1|x) \geq \mathbb{P}(y = 0|x)}
	\quad
	\Leftrightarrow
	\quad
	\eta_{\theta}(x) = \mathbb{1}_{\delta_1(x) \geq \delta_0(x)}.
\end{equation*}

#### Parameter estimators

The parameters of the LDA model can be estimated using the maximum likelihood method. The first-order optimality conditions yield the following estimators:  

1. **Class Prior Probabilities:** The empirical estimates of the prior probabilities are given by  

   $$
   \hat{\pi}_k = \frac{n_k}{n}, \quad k \in \{0,1\},
   $$
   
   where $n_k$ is the number of observations in class $k$, and $n$ is the total number of observations.

2. **Class Means:** The maximum likelihood estimates of the class means are  
   
   $$
   \hat{\mu}_k = \sum_{i: y_i = k} \frac{x_i}{n_k}, \quad k \in \{0,1\}.
   $$

3. **Covariance Matrix:** Since LDA assumes a common covariance matrix across both classes, the pooled sample covariance estimate is  
   
   $$
    \hat{\Sigma}_k = \sum_{i : y_i = k} \frac{(x_i-\hat\mu_k)(x_i-\hat\mu_k)^\top}{n_k-1}, \quad
	\hat{\Sigma} = \sum_{k = 0}^{1}  \frac{n_k-1}{n-2} \hat{\Sigma}_k, \quad k \in \{0,1\}.
   $$

These estimates can then be used to construct the linear discriminant functions and derive the corresponding decision rule.  

**Question 2 (Implemention of LDA in $\mathbb{R}^2$ from Scratch).** Consider a binary classification problem involving only the `setosa` and `versicolor` classes (i.e., exclude data related to `virginica`). Focus on the features `petal length (cm)` and `petal width (cm)`.  

Using the formulas derived in the course, compute the estimators of the following parameters:  
- Class priors: $\hat{\pi}_k$  
- Class means: $\hat{\mu}_k$  
- Class covariance matrices: $\hat{\Sigma}_k$  
- Shared covariance matrix: $\hat{\Sigma}$  

In a single panel, create a scatterplot of the data points and display the estimated means $\hat{\mu}_k$. Additionally, visualize the estimated local covariances $\hat{\Sigma}_k$ by plotting ellipsoids using four contour levels. The `Ellipse` function from `matplotlib.patches` can be useful for this purpose.  

Finally, apply the LDA classification rule to compute predictions and report the model's accuracy.

In [None]:
def plot_points(X, y, target_names):
    for k in range(len(target_names)):
        ax.scatter(X[y == k, 0], X[y == k, 1], label = target_names[k], s = 20)
    ax.set(xlabel = dataset_full.feature_names[2],
           ylabel = dataset_full.feature_names[3])
    plt.legend(loc=4);

In [None]:
from sklearn.metrics import accuracy_score
from matplotlib.patches import Ellipse

# Function to plot Gaussian ellipses
def plot_LDA_ellipses(means, covariances, ax):
    colors = sns.color_palette("colorblind", 2)
    for i, (mean, covar) in enumerate(zip(means, covariances)):
        eigenvalues, eigenvectors = 
        order = eigenvalues.argsort()[::-1]
        eigenvalues, eigenvectors = eigenvalues[order], eigenvectors[:, order]
        angle = np.degrees(np.arctan2(*eigenvectors[:, 0][::-1]))
        volume = 
        width, height = 
        for ncontour in range(4):
            ellipse = Ellipse(mean, (ncontour+1)*width, (ncontour+1)*height, angle = angle, 
                              edgecolor = colors[i], facecolor = colors[i], lw = 2, alpha = 0.2)
            ax.add_patch(ellipse);

In [None]:
pattern_names_2d = # defining target features
target_names_bin = # defining target classes

X = # extracting X related to the petal 
y = # extracting y  

idx_keep = # discarding data with class "setosa" 
y = y[idx_keep]    # to ensure y \in {0, 1} 
X = X[idx_keep, :]
nbin = X.shape[0]  # (nb of observations, input dimension + output)

In [None]:
pi_hat = [None] * 2                # pi parameters
mu_hat = [None] * 2                # mean parameters
Sigma_hat_local = [None] * 2       # local covariance parameters
Sigma_hat_shared = np.zeros((2,2)) # shared covariance parameter

for k in range(2):
    pi_hat[k] = 
    mu_hat[k] = 
    Sigma_hat_local[k] = 
    Sigma_hat_shared += 

In [None]:
fig, ax = plt.subplots()
plot_points(X, y, target_names_bin)
plot_LDA_ellipses(mu_hat, Sigma_hat_local, ax)

In [None]:
y_pred = 
print("accuracy:", np.sum(y_pred == y) / nbin)
print("accuracy (sklearn):", accuracy_score(y_pred, y))

(too add your comments here)

**Question 3 (plotting the decision boundary).** The function `frontier` visualizes the decision boundary of a classifier by evaluating it on a grid of points, considering the range of input variables in the dataset $X$. It requires the following arguments:  
- `X` : feature matrix $X \in \mathbb{R}^{n \times d}$  
- `y` : fround truth labels $y \in \mathbb{N}$  
- `resolution` : number of points per dimension to construct the grid  
- `f` : classification function $f: \mathbb{R}^{d} \to \mathbb{N}$, which assigns a class label to a given input  

Modify your previous code to ensure compatibility with `frontier`, then use it to visualize the decision boundary. Finally, display the results and analyze your observations.  

In [None]:
def frontier(f, X, y, step = 100):
    """ function for plotting the decision frontier of a binary classifier f """    
    # converting data to np.arrays
    X, y = np.array(X), np.array(y)
    
    # defining a grid of test points for evaluation
    eps = 0.1
    min_x0, max_x0 = np.min(X[:, 0])-eps, np.max(X[:, 0])+eps
    min_x1, max_x1 = np.min(X[:, 1])-eps, np.max(X[:, 1])+eps
    x1, x2 = np.meshgrid(np.arange(min_x0, max_x0, (max_x0 - min_x0) / step),
                         np.arange(min_x1, max_x1, (max_x1 - min_x1) / step))
    
    # computing the predictions for each point in the test grid
    z = np.array([f(vec) for vec in np.c_[x1.ravel(), x2.ravel()]]).reshape(x1.shape)
    pred_labels = np.unique(z)
    
    #  defining colormap to have similar colors to the points
    color_blind_list = sns.color_palette("colorblind", pred_labels.shape[0])
    sns.set_palette(color_blind_list)
    cmap = ListedColormap(color_blind_list)
    
    # plotting prediction map 
    plt.imshow(z, origin = 'lower', extent = [min_x0, max_x0, min_x1, max_x1],
               interpolation = 'mitchell', alpha = 0.80, cmap = cmap, aspect='auto')
    ax = plt.gca()
    cbar = plt.colorbar(ticks = pred_labels)
    cbar.ax.set_yticklabels(pred_labels)

    labels = np.unique(y).shape[0]
    color_blind_list = sns.color_palette("colorblind", labels)
    for i, label in enumerate(y):
        plt.scatter(X[i, 0], X[i, 1],
                    color = color_blind_list[int(y[i])], s = 80)
    plt.xlim([min_x0, max_x0])
    plt.ylim([min_x1, max_x1]);

In [None]:
def f(x):
    """ My LDA classifier """
    return()

nb_steps = 200 # resolution for the evaluation grid

fig = plt.figure()
title = "Accuracy (LDA) " + \
        ": {:.2f}".format(accuracy_score(y, y_pred))
frontier(f, X, y, nb_steps)
plt.title(title)
plt.show()

**Question 4 (LDA using sklearn).** Repeat **Question 3**, this time using the `LinearDiscriminantAnalysis` class from `sklearn` ([LDA Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)). Compare the decision boundaries obtained with those from the previous method. Analyze any differences you observe. 

  

Adapt the code for a classification problem involving three classes: `setosa`, `versicolor`, and `virginica` from the Iris dataset. Report your findings.  

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf_LDA = 
y_LDA =

def f(x):
    """Classifier"""
    return()

fig = plt.figure()
title = "Accuracy (LDA) " + \
        ": {:.2f}".format(accuracy_score(y, y_LDA))
frontier(f, X, y, nb_steps)
plt.title(title)
plt.show()

In [None]:
# Three-class model
X = # extracting X related to the petal 
y = # extracting y  


**Question 5 (Logistic regression).** The logistic classifier is also available in `sklearn` ([LogisticRegression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)). Repeat **Question 4**, this time using the `LogisticRegression` classiffier. Compare the results with those obtained using LDA. Analyze any differences in decision boundaries and classification performance.  

In [None]:
from sklearn.linear_model import LogisticRegression

clf_logistic = LogisticRegression()

## Exercice 2: Clustering

**Question 1 ($k$ means from scratch).** Consider the **Iris dataset**. As in **Exercise 1**, focus on the features related to `petal length (cm)` and `petal width (cm)`. Implement Lloyd's algorithm for $k$-means clustering as discussed in the course for $K$ classes. In your experiments, consider $K = 3$. Initialize randomly the centroids of the algorithm. During the process, plot the results of each iteration to visualize the clustering progress. Define a stopping criterion for the algorithm.

Repeat the previous procedure for different random initializations and different values of $K$. What can you conclude?

In [None]:
def dist_euclidean(x, y):
    return()

def my_kmeans_plot(X, centroids, n_iter_max):
    colors = sns.color_palette("colorblind", len(centroids))
    
    for n in range(n_iter_max):
        # Step 1: assigning points to classes according to the neareast centroid
        pred = 
            
        # Step 2: updating the centroids
        centroids =
        
        _, ax = plt.subplots()
        for k in range(len(target_names)):
            ax.scatter(X[pred == k, 0], X[pred == k, 1], label = target_names[k], s = 20)
            ax.scatter(centroids[k, 0], centroids[k, 1], marker = 'x', 
                       s = 80, linewidths = 3, color = colors[k], zorder = 10)
        ax.set(xlabel = dataset_full.feature_names[2],
               ylabel = dataset_full.feature_names[3])
        
    return((pred, centroids))

In [None]:
X = # extracting X related to the petal 
n_clusters = # defining nb of clusters

# random initialization of the centroids
centroids = 

y_kmeans, centroids_pred = my_kmeans_plot(X, centroids, 5)

**Question 2 (plotting the decision boundary).** Using the function `frontier`, visualize the decision boundary of the $k$-means clustering algorithm. Given the true labels available in the dataset, compute the accuracy of the model. You may need to adjust the predicted labels to account for any identification issues.

In [None]:
def f(x):
    """ k-means """
    return() 

fig = plt.figure()
plt.show()

**Question 3 (GMM using sklearn).** Repeat **Questions 1-2** using a Gaussian Mixture Model (GMM) with three mixture components. You can use the `GaussianMixture` function available in `sklearn` ([GMM Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)).

For displaying the results, first plot the corresponding centers and ellipsoids of each Gaussian component. You may need to adapt the `plot_LDA_ellipses` function from **Exercise 1 (Question 2)** to pass the means and covariances of the GMM. 

Next, visualize the decision boundary using the `frontier` function from **Exercise 1 (Question 3)**. What can you conclude?

In [None]:
from matplotlib.patches import Ellipse
colors = sns.color_palette("colorblind", n_clusters)

# Function to plot Gaussian ellipses
def plot_gmm_ellipses():

In [None]:
from sklearn.mixture import GaussianMixture

# fitting a Gaussian Mixture Model
gmm = 
y_gmm =

# plotting points and ellipses
fig, ax = plt.subplots()
plt.show()

In [None]:
print("weights:", gmm.weights_)
print("means:", gmm.means_)
print("covariances:", gmm.covariances_)

In [None]:
def f(x):
    """ k-means """
    return() 

# plotting the decision boundary
fig = plt.figure()
plt.show()