# Machine Learning I - Linear Models

Andrés F. LOPEZ-LOPERA <br/>
Université Polytechnique Hauts-de-France

---

<div class="alert alert-block alert-warning"> 
    For this lab session, you are free to use the language of your choice but Python is strongly recommended. In this notebook we will focus on Python implementations based on the toolboxes 'numpy', 'matplotlib', 'pandas' and 'sklearn'.
    
</div>

In [None]:
import numpy as np              # toolbox with comprehensive mathematical functions
import matplotlib.pyplot as plt # toolbox for plotting figures
import pandas as pd             # toolbox for managing dataframes

## Introduction

This notebook focuses on exercises related to linear regression models. More precisely, we will explore the following three applications:

- Linear regression using least squares (LS)
- Linear regression with Ridge and Lasso regularization
- Image compression using Singular Value Decomposition (SVD)

## Exercice 1: Linear regression using least squares (LS)

### Diabetes dataset

In this exercise, we will work with the diabetes dataset, which is frequently used in the machine learning community. This dataset is conveniently available in the `sklearn` toolbox. You can find more details in the documentation here: <a href="https://scikit-learn.org/1.5/datasets/toy_dataset.html" > https://scikit-learn.org/1.5/datasets/toy_dataset.html </a>

**Question 1 (data analysis).** Load the diabetes dataset and create a Pandas DataFrame with appropriately named columns for $(X, y)$.

Before applying any machine learning techniques, it is important to analyze first the data. This helps build familiarity with the dataset and provides better context for interpreting the results. Feel free to use any Python visualization tools in your analysis (see, e.g., the ones studied in the "Python M1" course).

Write a short report (max 10 lines) sumarizing your analysis and findings. Don't forget to make reference to the figures proposed in your analysis.

In [None]:
from sklearn import datasets     # toolbox ML + datasets

# Loading the diabetes dataset
dataset_full = 
names_patterns = # extracting names of the features
dataset_full

For an easier visualization and manupulation of the dataset, we can opt creating a (pandas) dataframe.

In [None]:
# creating a dataframe
dataset = pd.DataFrame()
dataset = dataset.set_axis()

dataset

### Least squares in $\mathbb{R}^d$

We consider the linear model

\begin{equation}
    y(x) = \beta_0 + \sum_{j = 1}^{d} \beta_j x_j + \varepsilon,
\end{equation}

with $x = (x_1, \ldots, x_d) \in \mathbb{R}^{d}$ (co-variables), $\beta = (\beta_0, \beta_1, \ldots, \beta_d) \in \mathbb{R}^{p}$ (coefficient parameters) and $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ (additive Gaussian noise).

If $X^\top X$ is a full rank matrix, then one can show that the LS estimator of $\beta$ is given by

\begin{equation*}
    \widehat{\beta} = (X^\top X)^{-1} X^\top y,
\end{equation*}

Once $\widehat{\beta}$ is computed, then the prediction $\widehat{y}$ of $y$ at $x \in \mathbb{R}^d$ is given by

\begin{equation}
    y(x) = \widehat{\beta}_0 + \sum_{j = 1}^{d} \widehat{\beta}_j x_j.
\end{equation}

**Question 2 (LS in  $\mathbb{R}$ from scratch).** Consider a simple linear regression model for each variable: `age`, `bmi` and `bp`. Using the formulas derived in the course, compute the LS estimator for $\beta$.

On the same plot, display the observed data $y$ along with the predicted values $\widehat{y}$. Then, compute and print the following performance metrics: $\operatorname{MSE}$ (Mean Squared Error), $\operatorname{SMSE}$ (Standardized Mean Squared Error), and $R^2$ (coefficient of determination). For the latter metrics, propose proper python functions with a short description explaining the input/output arguments. 

Analyze and discuss the results. What observations can you make from the metrics and plots? What conclusions can you draw?

In [None]:
# defining the design matrix
X = dataset[names_patterns]
y = np.array(dataset['y'])
n, d = X.shape # (nb of observations, input dimension)

In [None]:
# defining functions for computing the performance indicators
def MSE(y, y_pred):
    """ Compute the Mean Squared Error (MSE) """
    return()

def SMSE(y, y_pred):
    """ Compute the Standardized Mean Squared Error (SMSE) """    
    return()

def R2(y, y_pred):
    """ Compute the coefficient of determination $R^2$ """
    return()

In [None]:
names_patterns_short = ['age', 'bmi', 'bp']

for pattern in names_patterns_short:
    # extracting data for the target pattern
    x = 
    
    # computing the LS estimator
    beta = 
    
    # computing the predictions
    y_pred = 
    
    # plotting the prediction
    fig = plt.figure()
    plt.show()
    
    print(pattern)
    print("beta =", beta)
    print("[MSE, SMSE, R2] =", [MSE(y, y_pred), SMSE(y, y_pred), R2(y, y_pred)])

**Question 3 (LS in  $\mathbb{R}^d$ from scratch).** Now, consider a linear regression model that accounts for the three variables: `age`, `bmi` and `bp`. Compute the LS estimator for $\beta$.

Create a scatter plot comparing the predicted values $\widehat{y}$ to the ground truth observations $y$. This plot is known as a *calibration plot*. What patterns or insights can you observe?

Finally, compute and print the performance metrics: $\operatorname{MSE}$ (Mean Squared Error), $\operatorname{SMSE}$ (Standardized Mean Squared Error), and $R^2$ (coefficient of determination). Discuss your findings.

In [None]:
# extracting data for the patterns ["age", "bmi", "bp"]
pattern = ["age", "bmi", "bp"]
x = 
beta = 
y_pred = 
  
# plotting predictions vs observations
fig = plt.figure()
plt.show()
    
print("beta =", beta)
print("[MSE, SMSE, R2] =", [MSE(y, y_pred), SMSE(y, y_pred), R2(y, y_pred)])

**Question 4 (LS using sklearn).** Repeat **Questions 1-3** using the utilities provided in `sklearn`. Refer to the following documentation for guidance:

- <a href="https://scikit-learn.org/1.5/modules/linear_model.html" > https://scikit-learn.org/1.5/modules/linear_model.html </a>
- <a href="https://scikit-learn.org/1.6/modules/model_evaluation.html" > https://scikit-learn.org/1.6/modules/model_evaluation.html </a>

Compare the results from your implementations with those obtained using `sklearn`. Do you notice any differences? If so, what might explain them?

In [None]:
from sklearn import 
from sklearn.metrics import 

In [None]:
# single linear regression


In [None]:
# multiple linear regression


**Question 5 (model selection).** Propose a new linear model but this time considering all the input variables available in the dataset. What observations and conclusions can you draw? Are all the input variables necessary for the model? Justify your answer by applying a forward (or backward) selection routine for model selection.

## Exercice 2: Linear regression via Ridge and Lasso


### Generating toy data

As discussed in the course, we can generate toy data by sampling $X \in \mathbb{R}^{n \times p}$ from a Gaussian distribution. We can assume for instance that $x_{i,j}$ are independent and identically distributed (i.i.d.) random variables following the distribution $x_{i,j} \sim \mathcal{N}(0, 1)$, where $i \in {1, \ldots, n}$ and $j \in {1, \ldots, p}$.

Given a fixed $\beta_\star \in \mathbb{R}^p$, we then have that:

\begin{equation}
    y = X \beta_\star.
\end{equation}

**Question 1 (data generation).** Consider $\beta_\star \in \mathbb{R}^p$ as a vector where all elements are zero, except for the first eight (8) terms, which are defined by your birthday sequence: $\beta = (d_1, d_2, m_1, m_2, y_1, y_2, y_3, y_4, 0, \ldots, 0)$, where $(d_1 d_2)$, $(m_1 m_2)$, and $(y_1 y_2 y_3 y_4)$ represent the day, month, and year of your birth, respectively.

Generate a toy dataset by sampling a (pseudo-)random matrix $X \in \mathbb{R}^{51 \times 50}$ with $x_{i,j} \sim \mathcal{N}(0, 1)$. To ensure reproducibility, set a random seed for the generation of $X$. Compute the target values $y$ using the linear model, and then add additive noise to the observations:

\begin{equation}
    y = X \beta_\star + \varepsilon,
\end{equation}

where $\varepsilon \sim \mathcal{N}(0, \tau^2 I)$ with $\tau^2 = 1$.

The resulting dataset $(X, y)$ will be used for the subsequent analysis.

In [None]:
# defining the "size" of the problem
p =  # nb features 
n =  # nb obs

# generating pseudo-random data for a given beta
beta_true = 

X = 
y_true = 

# adding noise
var_sigma = 
noise = 
y = 

**Question 2 (Ridge using SVD).** Using the SVD framework, compute the Ridge estimator for different values of the regularization parameter $\lambda \in [10^{-2}, 10^{4}]$. Consider a grid of $100$ values of $\lambda$'s in a log-scale. By following the example shown in the course, display the evolution of each coefficient $\beta_i$ for all $i \in {1, \ldots, p}$ as a function of $\lambda$ in a single plot.

For this procedure, implement a function `ridge_path` that takes as input $X$, $y$ and a vector containing the regularization parameters $\lambda$. The function should return a matrix containing the estimated coefficients $\beta$ for each value of $\lambda$. Be sure to include a brief description of the function, this is always useful when allowing people to use your implementations!

Compute the predictions $\widehat{y}$ obtained when considering $\widehat{\lambda}_{CV}$. Display a calibration plot together to some error indicators. What patterns or insights can you observe? What can be said about the cardinality of $\widehat{\beta}_{\widehat\lambda_{CV}}$ ?

In [None]:
def ridge_path(X, y, alphas):
    """ compute the ridge path for a list of tuning parameters """
    
    return beta_ridge

# defining the grid for the Ridge parameter (\lambda in the course)
alpha_max = 
alpha_min = 
n_alphas = 

alphas = 

In [None]:
beta_ridge = ridge_path(X, y, alphas)

fig = plt.figure(figsize=(12, 8))
plt.show()

**Question 3 (cross-validation).** Estimate $\lambda$ via hold-out validation with $K = 2, 5, 10$. You can use the function `RidgeCV` from `sklearn.linear_model`. Display again the evolution of $\beta$ as a function of $\lambda$, while plotting the value of the estimated $\widehat{\lambda}_{CV}$.

In [None]:
from sklearn.linear_model import RidgeCV


In [None]:
# plotting predictions vs observations
y_pred = 
fig = plt.figure()
plt.show()

print("[MSE, SMSE, R2] =", )
print("|beta| =", )

**Question 4 (Lasso).** Repeat **Questions 2-3**, this time using Lasso regression. You can use the functions available in `sklearn`. For more details, refer to the documentation here: <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Lasso.html" > https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Lasso.html </a>

**Question 5 (LSLasso).** Implement LSLasso and compare the results with those obtained using Ridge and Lasso regression.

## Exercise 3: Image compression via SVD

In this exercise, we will explore how Singular Value Decomposition (SVD) can be used to "compress" a graphical figure. By representing the figure as a matrix and applying the SVD, we can identify the closest matrix of lower rank to the original. This method serves as a foundation for efficient compression techniques.

In [None]:
import numpy as np
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

### Load, manipulate and display the image

This NASA photo comes from the Hubble telescope and presents a dramatic picture of this extra-galactic formation.

![TarantulaNebula.jpg](images/TarantulaNebula.jpg)

In [None]:
nasa = mpimg.imread("images/TarantulaNebula.jpg")
print(nasa.shape)
plt.imshow(nasa) 
plt.show()

**Question 1.** Extract and comment the information relative to the nasa matrix: dimensions, min and max values of the elements,...

We can also transform the RBG image into a greyscale image. To simplify this exercise, we turn the nasa image into a greyscale with ordinary double precision values 0-255

In [None]:
img_greyscale = nasa[:,:,0] + nasa[:,:,1] + nasa[:,:,2]   # to sump up red + green + blue channels
img_greyscale = img_greyscale*255 / np.max(img_greyscale) # to make this bebright white

plt.imshow(img_greyscale, cmap='gray') 
plt.show()

# Remark:
# imshow() can take as entry an image of RGB values (shape=(dim1, dim2, 3)) or an image of 
# scalar value (shape=(dim1, dim2)). If we pass an image of RGB values, the cmap parameter will be ignore.
# If we pass an image of scalar value and let the default value of the cmap parameter, imshow() will
# map scalar data to colors, so to have a greyscale_img we need to set cmap to 'gray'.

### SVD decomposition

For any matrix $X \in \mathbb{R}^{n \times p}$, there exists two orthonormal matrices $U = [u_1, \ldots, u_n] \in \mathbb{R}^{n \times n}$ and $V = [v_1, \ldots, v_p] \in \mathbb{R}^{p \times p}$ such that

\begin{equation*}
	X = U \Sigma V^\top, %\in \bbR^{n \times p},
\end{equation*}

with $\Sigma = \operatorname{diag}(s_1, \ldots, s_{r})$, $s_1 \geq \cdots \geq s_{r} > 0$, and $r = \operatorname{rang}(X)$.

Here, 
- The unitary matrix $V$ contains the « right-singular vectors » of $X$.
- The unitary matrix $U$ contains the « left-singular vectors » of $X$.
- The diagonal matrix $\Sigma$ contains the singular values of the $X$. They correspond to the roots of the eigenvalues of $X^\top X$. The number of non-zero singular values is equal to the rank of $X$. As a convention we order $\Sigma_{i,i}$ by decreasing order.

We call this factorization the sigular value decomposition (SVD) of $X$.

**Question 2.** Perform the SVD decomposition of the greyscale image by using the `svd` command from the numpy linear algrebra library. Check the dimensions of the decomposition outputs. </p>

See the documentation here: <a href="https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html" > https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html </a>

In [None]:
# Apply the SVD to the img_greyscale 
U, s, Vt = 

In [None]:
print(U)
print(np.shape(U))

In [None]:
print(np.shape(s))
plt.semilogy(s,'*')

In [None]:
print(Vt)
print(np.shape(Vt))

### Reconstruction of the image using some of the singular values

We can verify that the image $X$ can be reconstructed from the SVD decomposition:

In [None]:
#First we will construct the S matrix from the s vector (which corresponds to the diagonal terms)
S =
print('matrix S', S)

In [None]:
img_greyscale_recomposed = 

plt.imshow(img_greyscale_recomposed, cmap='gray')
plt.title('Reconstructed full SVD')
plt.show()

print("MSE:", np.mean((img_greyscale - img_greyscale_recomposed)**2))

**Question 3.** Compare three recomposed matrices:

- using the first $r$ elements (reduced SVD)
- using the first 300 elements
- using the first 50 elements

Plot the associated images and comment the obtained images.

**Question 4.** Design an automatic rule for the selection of the number of singular values used in the reconstruction of the image. Justify your choice.

**Question 5 (bonus).** Implement the SVD from scratch by considering the spectral theorem studied in the course for the diagonalization of a symmetric matrix. Compare the results with those obtained using the command `np.linalg.svd`. Discuss the differences if necessary.