<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>|

# *Wstęp do uczenia maszynowego* - Notebook 9, version for students
**Author: Michał Ciach, Dorota Celińska-Kopczyńska**  
Exercises denoted with a star \* are optional. They may be more difficult or time-consuming.  


## Description

In today's class, we will learn more advanced approaches to linear modelling that can handle correlated predictors and noisy data. We will cover the Ridge regression, the LASSO, and the Principal Component Regression.   

## Data & module loading

In [1]:
import pandas as pd
import plotly.express as px
import numpy as np
import numpy.random as rd
import sklearn.datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoLars, ElasticNet
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.decomposition import PCA
from scipy.stats import uniform, norm
import statsmodels.api as sm

Below, we load and pre-process the `diabetes` data frame. We permute the data and add the interactions between the blood measurement variables.    
We also remove the `sex` variable, because it is a categorical variable, not a numerical one.  
Although using LASSO with categorical variables is possible, it is much too advanced for this class.  

In [3]:
diabetes = sklearn.datasets.load_diabetes(as_frame=True, scaled=False)['frame']
diabetes.drop('sex', axis=1, inplace=True)
diabetes = diabetes.sample(frac=1)
blood_variables = ['s%i' % i for i in range(1, 7)]
# Note that the following code will append the columns to the end of the data frame,
# so the `target` variable will be hidden in the middle
for svar1_id, svar1 in enumerate(blood_variables):
  for svar2 in blood_variables[(svar1_id+1):]:
    diabetes[svar1+svar2] = diabetes[svar1]*diabetes[svar2]
diabetes

Unnamed: 0,age,bmi,bp,s1,s2,s3,s4,s5,s6,target,...,s2s3,s2s4,s2s5,s2s6,s3s4,s3s5,s3s6,s4s5,s4s6,s5s6
58,60.0,20.4,105.00,198.0,78.4,99.0,2.00,4.6347,79.0,170.0,...,7761.6,156.800,363.36048,6193.6,198.00,458.8353,7821.0,9.269400,158.00,366.1413
154,67.0,28.3,93.00,204.0,132.2,49.0,4.00,4.7362,92.0,197.0,...,6477.8,528.800,626.12564,12162.4,196.00,232.0738,4508.0,18.944800,368.00,435.7304
88,34.0,22.6,75.00,166.0,91.8,60.0,3.00,4.2627,108.0,42.0,...,5508.0,275.400,391.31586,9914.4,180.00,255.7620,6480.0,12.788100,324.00,460.3716
373,28.0,25.4,93.00,141.0,79.0,49.0,3.00,4.1744,91.0,168.0,...,3871.0,237.000,329.77760,7189.0,147.00,204.5456,4459.0,12.523200,273.00,379.8704
158,45.0,20.3,74.33,190.0,126.2,49.0,3.88,4.3041,79.0,96.0,...,6183.8,489.656,543.17742,9969.8,190.12,210.9009,3871.0,16.699908,306.52,340.0239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66,46.0,24.7,85.00,174.0,123.2,30.0,6.00,4.6444,96.0,150.0,...,3696.0,739.200,572.19008,11827.2,180.00,139.3320,2880.0,27.866400,576.00,445.8624
211,74.0,29.8,101.00,171.0,104.8,50.0,3.00,4.3944,86.0,70.0,...,5240.0,314.400,460.53312,9012.8,150.00,219.7200,4300.0,13.183200,258.00,377.9184
152,39.0,26.3,115.00,218.0,158.2,32.0,7.00,4.9345,109.0,292.0,...,5062.4,1107.400,780.63790,17243.8,224.00,157.9040,3488.0,34.541500,763.00,537.8605
75,46.0,23.5,87.00,181.0,114.8,44.0,4.00,4.7095,98.0,42.0,...,5051.2,459.200,540.65060,11250.4,176.00,207.2180,4312.0,18.838000,392.00,461.5310


## Regularized linear models

When fitting linear models to the data, i.e. estimating the values of the $\beta$ parameters, we encounter the same *variance - bias tradeoff* as with any other estimation procedure. The estimators of $\beta$, which we denote as $\hat{\beta}$, are random variables. This is because they're calculated using the the dependent variable $Y$, which, due to the presence of random errors, is a random variable $Y\sim \mathcal{N}(X\beta, \sigma^2)$, where $\sigma^2$ is the variance of the error term $\epsilon$ (remember the definition of a linear model). As all random variables, the variables $\hat{\beta}$ have their expected value and variance. For an Ordinary Least Squares regression, we have $\mathbb{E}\hat{\beta} = \beta$, so this particular kind of estimation is unbiased - but there are other ways, which we will see in a moment. First, however, note that this consideration means that we also have a variance - bias tradeoff when we compute predicted values for new data - the predictions are random variables, because they are based on the observed values of $Y$, so they have their expected value and variance.

This tradeoff is closely connected to the problem of *overfitting*. This happens when our model fits the training data too well and it learns to predict not only the trends, but also the random errors.
When the model is overfitted, it mimics the errors from the training data set when we use it for forecasting - and because of this, it is biased (in the sense that after the training is completed, the model may systemically under- or overestimate the predicted values for new data). This usually happens when our training data set is too small, e.g. when the number of observations is similar to the number of predictors, or when the training data set is selected incorrectly, e.g. when we forgot to randomly permute the observations, or when the training data is selected from some specific sub-population of observations.

One of the approaches to deal with overfitting and the bias - variance tradeoff is to use *regularized* linear models. These are the alternative methods of estimating $\beta$ which I mentioned above. Most commonly used ones are the $L_2$-regularized model, called the Ridge regression, and the $L_1$-regularized model, called the LASSO:

$$\hat{\beta}_{\text{Ridge}} = \text{argmin}_\beta ||Y - X\beta||^2 + \alpha ||\beta||^2$$

$$\hat \beta_{\text{LASSO}} = \text{argmin}_\beta ||Y - X\beta||^2 + \alpha |\beta|$$

Although they seem similar, they have different properties and applications.  

An important caveat is that we typically do not include the intercept, $\beta_0$, in the penalty term - we only regularize the $\beta$ coefficients that correspond to measurements in $X$. This is because the intercept typically doesn't overfit - it is equal to the average value of $Y$ when all the predictors in $X$ are equal to zero, and averages are not that prone to overfitting. Introducing a penalty for the intercept therefore would introduce a bias in the estimation a term that is otherwise easy to estimate, so it does more harm than good.  

In practice, we can avoid all these problems and difficult considerations about the intercept by simply centering all the variables - i.e. subtracting their means from them. This includes both the independent variables $X$ and the dependent variable $Y$. After centering the whole data set, the average value of $Y$ when $X=0$ is equal to zero - so the intercept simply disappears. If you feel like you're not convinced by this explanation, look at the bottom of this script for a mathematical one.  

In the cell below, we have prepared a data set for you to use in the next few exercises. We have taken the numerical variables from  the diabetes data set from scikit, added interactions and polynomials and then centered and scaled the variables. Note that since all the variables are centered (including the response `Y`), we don't need to include intercepts in our linear models - they will always be equal to zero. Also note that the observations were already permuted in the *Data & modules* section to avoid biasing cross-valiation.    

**Exercise 1.** In this exercise, we will compare the ordinary least squares, Ridge regression and Lasso to predict one year progression of diabetes. We will use the `diabetes` data set, which is loaded in the *Data & modules* section and already initially preprocessed.

Is this pre-processing sufficient for regularized linear models?   
0. Inspect the pre-processed `diabetes` data frame.  
  0.1. Do all the variables have similar scales (i.e. are they expressed in similar units, which take values from a similar range)?  
  0.2. Look at the equations for $\hat{\beta}_{\text{Ridge}}$ and $\hat{\beta}_{\text{LASSO}}$. Do you think that it is important that the variable in `diabetes` have similar scales? Can different scales of variables influence the estimation?   
  0.3. Are there any variables that do not need to be scaled? Are there any variables that shouldn't be scaled? Why/why not?     
  0.4. Are the variables centered (i.e. do they have zero mean)? Do we want them to be centered? Are there any variables that don't need to, or should not, be centered?   
  0.5. Center and scale the data appropriately. You can either do it manually or using the `StandardScaler` funcion from `scikit`. Save the modified data in a new data frame to avoid modifying the original data (e.g. by using the `df.copy()` method before scaling).  

After the necessary inspection of the data and an appropriate pre-processing, we can proceed with modeling.   
1. For a baseline model (i.e. a simple model to which we will compare the improved ones), create an ordinary least squares linear model (with an intercept) explaining the disease progression using all the other variables. Use the `statsmodels` library (the documentation is available [here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html); check the "Examples" section of the documentation to see how to fit a simple model). Check the summary of the model.   
  1.1. Are the coefficients jointly significant (in other words, is the *whole model* significant)? To check this, inspect the p-value of the F test, shown in the `Prob (F-statistic)` value in the summary header.  
  1.2. Which coefficients are individually statistically significant? Is the model simple to interpret?   
  1.3. What is the size of the model (i.e., the number of non-zero $\beta$ coefficients)? Are there any coefficients that are equal to zero?  
  1.4. Do the results agree with your conclusions from Notebook 8, where we studied model seletion techniques?
2. Evaluate the test RMSE of the model using 10-fold cross validation. For this, you will need to create the model again, this time as a `LinearRegression` function from `scikit-learn`. Adjust the keyword `fit_intercept` depending on how you centered the data. For cross validation, use the `cross_val_score` function from `scikit-learn` with parameters `cv=10`, `scoring='neg_root_mean_squared_error'`.    
3. Now, fit a Ridge regression model to the data, by using the `Ridge` function from `scikit-learn`. Adjust the keyword `fit_intercept` depending on how you centered the data. Use any value of the `alpha` parameter that you like (for example 0.1, 1, or 10) - we will see how to choose it better in Exercise 3.  
  3.1. Evaluate the RMSE using cross-validation. You can use the `cross_val_score` function from `scikit-learn`. Is it lower or higher than for the OLS? How about its standard deviation?     
  3.2. Inspect the coefficients of the model. Are there any coefficients equal to zero? Are the coefficients similar or different to the ones from OLS? Are they smaller in magnitude? Do they have the same signs?   
  3.3. Does the model obtained with the Ridge regression have a similar interpretation to the one obtained with OLS? Answer by comparing the coefficients and checking which ones are positive or negative and their magnitudes.   
4. Now, fit a Lasso model to the data, by using either the `Lasso` or the `LassoLars` function from `scikit-learn` (they implement two different algorithms to calculate the coefficients). Use any value of the `alpha` parameter that you like (for example 0.1, 1, or 10) - we will see how to choose it better in Exercise 3.  
  Note that, as opposed to Ridge and OLS, the coefficients of the Lasso model are calculated numerically using iterative procedures. If you run into convergence problems, you can try using a different algorithm (`Lasso` or `LassoLars`) or increasing the numbers of iterations by adjusting the `max_iter` parameter.  
  4.1. Evaluate the RMSE using cross-validation. Is it lower or higher than the one for the OLS and Ridge?   
  4.2. Inspect the coefficients of the model. Are there any coefficients that equal zero?  
  4.3. Does the fitted model have a similar interpretation to any of the two previous ones?    
5. Now, let's compare the performance of a regularized linear models to the performance of a forward (or backward) model selection in OLS.  
  5.1. Use the `SequentialFeatureSelector` function from `scikit-learn` to select a set of features for prediction with OLS, in the direction (forward/backward) of your choice (you can copy your solution fron Notebook 8).  
  5.2. Fit an OLS model on the selected features.  
  5.3. Evaluate the RMSE using cross-validation and compare it to the other three approaches.   
  5.4. Inspect the coefficients of the model and compare them to the other approaches, including the interpretation of the model.   
  5.5.\* Use the `statsmodels` library to fit an OLS model on the selected features and check how model selection has influenced its statistics. In particular, are there more coefficients which are individually significant (measured with the t-test)? Are the coefficients jointly significant (measured with the F-test)?  
6. \*\* Try to combine two or more modelling techniques used in this exercise and see if you can obtain an even better model.    
7. Which of the tested approaches would you choose for this data set, and why?

In [4]:
""" 0. Inspect the pre-processed diabetes data frame."""
# 0.1. Do all the variables have similar scales
#      (i.e. are they expressed in similar units, which take values from a similar range)?
print(diabetes.describe())
print("No, variables do not have similar scales, they are not expressed in similar units and their values range differently.")

# 0.2. Look at the equations for  𝛽̂ Ridge  and  𝛽̂ LASSO .
#      Do you think that it is important that the variable in diabetes have similar scales?
#      Can different scales of variables influence the estimation?
print("Yes, it is important. In Ridge and LASSO regression, the regularization terms can be affected by the scale of the variable.")

# 0.3. Are there any variables that do not need to be scaled?
#      Are there any variables that shouldn't be scaled? Why/why not?
print("Target should not be scaled.")

# 0.4. Are the variables centered (i.e. do they have zero mean)?
#      Do we want them to be centered?
#      Are there any variables that don't need to, or should not, be centered?
print("Yes, we want them to be centered to ensure that the intercept is correctly calculated.")
print("Target should not be centered.")

# 0.5. Center and scale the data appropriately.
#      You can either do it manually or using the StandardScaler funcion from scikit.
#      Save the modified data in a new data frame to avoid modifying the original data (e.g. by using the df.copy() method before scaling).

diabetes_scaled = diabetes.copy()
features_to_scale = diabetes.columns.drop('target')

scaler = StandardScaler()
diabetes_scaled[features_to_scale] = scaler.fit_transform(diabetes_scaled[features_to_scale])

print(diabetes_scaled.describe())

              age         bmi          bp          s1          s2          s3  \
count  442.000000  442.000000  442.000000  442.000000  442.000000  442.000000   
mean    48.518100   26.375792   94.647014  189.140271  115.439140   49.788462   
std     13.109028    4.418122   13.831283   34.608052   30.413081   12.934202   
min     19.000000   18.000000   62.000000   97.000000   41.600000   22.000000   
25%     38.250000   23.200000   84.000000  164.250000   96.050000   40.250000   
50%     50.000000   25.700000   93.000000  186.000000  113.000000   48.000000   
75%     59.000000   29.275000  105.000000  209.750000  134.500000   57.750000   
max     79.000000   42.200000  133.000000  301.000000  242.400000   99.000000   

               s4          s5          s6      target  ...          s2s3  \
count  442.000000  442.000000  442.000000  442.000000  ...    442.000000   
mean     4.070249    4.641411   91.260181  152.133484  ...   5670.432692   
std      1.290450    0.522391   11.496335 

In [7]:
# 1
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']
X = sm.add_constant(X)
model_0 = sm.OLS(y, X)
results_0 = model_0.fit()
print(results_0.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.524
Model:                            OLS   Adj. R-squared:                  0.497
Method:                 Least Squares   F-statistic:                     19.14
Date:                Thu, 30 May 2024   Prob (F-statistic):           5.24e-53
Time:                        17:47:08   Log-Likelihood:                -2383.0
No. Observations:                 442   AIC:                             4816.
Df Residuals:                     417   BIC:                             4918.
Df Model:                          24                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.601     58.488      0.0

In [27]:
# 2
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']

k = 10
scores = cross_val_score(LinearRegression(fit_intercept=True), X, y, cv=k, scoring='neg_root_mean_squared_error')

mean_rmse = -scores.mean()
std_rmse = scores.std()


print(f"OLS mean RMSE: {mean_rmse}, Std Dev RMSE: {std_rmse}")

OLS mean RMSE: 56.64405649468391, Std Dev RMSE: 2.814783527571737


In [25]:
# 3
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']

k = 10
scores = cross_val_score(Ridge(fit_intercept=True, alpha=1), X, y, cv=k, scoring='neg_root_mean_squared_error')

mean_rmse = -scores.mean()
std_rmse = scores.std()


print(f"Ridge mean RMSE: {mean_rmse}, Std Dev RMSE: {std_rmse}")

Ridge mean RMSE: 55.94173427375843, Std Dev RMSE: 3.67183356661185


In [28]:
# 4
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']

k = 10
scores = cross_val_score(Lasso(fit_intercept=True, alpha=1), X, y, cv=k, scoring='neg_root_mean_squared_error')

mean_rmse = -scores.mean()
std_rmse = scores.std()


print(f"Lasso mean RMSE: {mean_rmse}, Std Dev RMSE: {std_rmse}")

Lasso mean RMSE: 55.470788443342585, Std Dev RMSE: 3.8302174042496295


In [31]:
# 5
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']

model_forward = LinearRegression(fit_intercept=True)
sfs = SequentialFeatureSelector(model_forward, n_features_to_select='auto', direction='forward', cv=10, tol=1e-100)

# Fit the SequentialFeatureSelector
sfs.fit(X, y)

# Get the selected features
selected_features = sfs.get_support(indices=True)
selected_feature_names = X.columns[selected_features]
print("Selected features:", selected_feature_names)

# Fit the model with the selected features
model_forward.fit(X.iloc[:, selected_features], y)

# Evaluate the new model using cross-validation
cv_scores = cross_val_score(model_forward, X.iloc[:, selected_features], y, cv=10, scoring='neg_root_mean_squared_error')

# Calculate mean and standard deviation of RMSE scores
mean_rmse = -cv_scores.mean()
std_rmse = cv_scores.std()

# Print the results
print("Model with Selected Features:")
print(f"Mean RMSE: {mean_rmse}, Std Dev RMSE: {std_rmse}")

X =  diabetes_scaled.iloc[:, selected_features]
y = diabetes_scaled['target']
X = sm.add_constant(X)
model_forward_sm = sm.OLS(y, X)
results_forward_sm = model_forward_sm.fit()
print(results_forward_sm.summary())


Selected features: Index(['bmi', 'bp', 's5', 's1s3'], dtype='object')
Model with Selected Features:
Mean RMSE: 55.09543320219444, Std Dev RMSE: 3.925567070436146
                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.487
Model:                            OLS   Adj. R-squared:                  0.482
Method:                 Least Squares   F-statistic:                     103.5
Date:                Thu, 30 May 2024   Prob (F-statistic):           5.84e-62
Time:                        18:14:48   Log-Likelihood:                -2399.8
No. Observations:                 442   AIC:                             4810.
Df Residuals:                     437   BIC:                             4830.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t     

In [None]:
# Put your code here


                age           bmi            bp            s1            s2  \
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02   
mean  -8.037814e-18  1.044916e-16  1.054963e-15 -2.973991e-16 -5.777179e-16   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -2.251738e+00 -1.895781e+00 -2.360375e+00 -2.662394e+00 -2.427874e+00   
25%   -7.832846e-01 -7.188104e-01 -7.697777e-01 -7.192046e-01 -6.375263e-01   
50%    1.130443e-01 -1.529591e-01 -1.190789e-01 -9.073818e-02 -8.020037e-02   
75%    7.995940e-01  6.562083e-01  7.485196e-01  5.955183e-01  6.267323e-01   
max    2.325260e+00  3.581660e+00  2.772916e+00  3.232188e+00  4.174548e+00   

                 s3            s4            s5            s6      target  \
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  442.000000   
mean  -1.085105e-16 -1.547279e-16  2.371155e-16  2.250588e-16  152.133484   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000

**Exercise 2.** In the previous exercise, Lasso and Ridge regularized models obtained a comparable performance to each other and to an OLS model constructed after a forward feature selection. This is not always the case! To highlight some of the differences between these techniques, in this exercise we will compare the performance of the Ridge and Lasso regularizations on simulated data.

1. Simulate a $240\times 100$ matrix of independent variables $X$ from the uniform distribution $Unif(0, 1)$. We will use the upper half of the matrix as the training data set, and the lower half as the test one.
2. Simulate 6 dependent variables $Y_1, \dots, Y_6$, corresponding to the following two models:
  
    1: $\beta_1, \dots, \beta_{100}=1$,  
    2: $\beta_1, \dots, \beta_{10}=1,\ \beta_{11}, \dots, \beta_{100}=0$,

  and three values of the standard deviation of the error: $\sigma = 0.01,\ 0.1,\ 1$. Each of the $Y_i$ variables should be a vector of length 240 corresponding to a given model and a given standard deviation of the error.  

3. For each of the dependent variables, use the first 120 observations to fit a linear model, a Ridge regression model with $\alpha=3$, and a LASSO model with $\alpha=0.1$. Use the model to predict the values of the dependent variable for the lower half of the $X$ matrix and calculate the RMSE.

4. Create a bar plot showing the RMSE for the three algorithms depending on the data set. What is the advantage of each regularization? Are there cases when no regularization should be used? Is there a single regularization that works best for all data sets? Run your code several times to compare the results for different $X$ matrices.


In [None]:
# Put your code here


**Exercise 3.** In the previous two exercises, we have used Ridge and Lasso with seemingly arbitrary regularization parameters. In this exercise, we will finally see how to properly select the values of these parameters. The answer is, again, cross-validation. We create a range of parameter values to inspect, e.g. with `np.linspace(0, 1, num=100)`, and, for each value, we perform a cross-validation to estimate the prediction error. Finally, we select the parameter which gives the lowest error.  

1. Use the `GridSearchCV` function from `scikit` to find the optimal regularization parameter for the Ridge regression on the diabetes dataset that you pre-processed in Exercise 1. Pay attention to the `cv` and `scoring` keyword arguments. Set any reasonable range of the parameter values to inspect - we will adjust it in a moment anyway.  
  1.1. Create a data frame with the results of the grid search. Use it to create a plot of the mean test score agains the value of the regularization parameter.   
  1.2. Based on the plot, see if you need to adjust the range of the parameter values to find the (approximate) optimum. Adjust it accordingly.  
  1.3. What is the optimal RMSE?   
2. Use the `GridSearchCV` function to optimize the Lasso parameter. Repeat the steps from point 1.  
  2.1. Do you encounter numerical problems for small values of the regularization parameter? If so, how to choose the optimal value?  
  2.2. Is the optimal value of the parameter the same as for the Ridge regression? Is this a feature of the two models, or of this particular data set?   
  2.3. Is the optimal RMSE lower or higher than for the Ridge regression in this case? Do you suspect that this is a feature of Lasso, or of this particular data set?   
3. \* An *Elastic net* is a straightforward combination of Ridge and Lasso regularizations - we simply add two penalty terms to the MSE term. Use appropriate `scikit` functions to fit an Elastic net model to the city income data. Use cross validation to find optimal regularization parameters. Try to come up with a way to visualize the dependence of RMSE on the value of the parameters. Can you find a set of parameters which result in a lower RMSE than the Ridge and Lasso models? Is the method numerically stable for low values of the regularization parameters?   

In [43]:
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']
params = {'alpha': (np.linspace(0, 80, num=100))}
ridge_model = GridSearchCV(Ridge(), params, cv = 10, scoring='neg_root_mean_squared_error')
ridge_model.fit(X, y)
print(ridge_model.best_params_)
print(ridge_model.best_score_)

results = pd.DataFrame(ridge_model.cv_results_)

# Extract the relevant columns
results = results[['param_alpha', 'mean_test_score']]

# Convert mean_test_score from negative MSE to positive MSE for clarity
results['mean_test_score'] = results['mean_test_score']

# Create the plot using Plotly Express
fig = px.scatter(results, x='param_alpha', y='mean_test_score',
              labels={'param_alpha': 'Regularization parameter (alpha)', 'mean_test_score': 'Mean test score (MSE)'},
              title='RIDGE: Mean Test Score vs. Regularization Parameter (alpha)')
fig.show()

{'alpha': 27.474747474747474}
-55.6011582698341


In [46]:
X =  diabetes_scaled[diabetes_scaled.columns.drop('target')]
y = diabetes_scaled['target']
params = {'alpha': (np.linspace(1, 80, num=100))}
lasso_model = GridSearchCV(Lasso(max_iter=10000), params, cv = 10, scoring='neg_root_mean_squared_error')
lasso_model.fit(X, y)
print(lasso_model.best_params_)
print(lasso_model.best_score_)

results = pd.DataFrame(lasso_model.cv_results_)

# Extract the relevant columns
results = results[['param_alpha', 'mean_test_score']]

# Convert mean_test_score from negative MSE to positive MSE for clarity
results['mean_test_score'] = results['mean_test_score']

# Create the plot using Plotly Express
fig = px.scatter(results, x='param_alpha', y='mean_test_score',
              labels={'param_alpha': 'Regularization parameter (alpha)', 'mean_test_score': 'Mean test score (MSE)'},
              title='LASSO: Mean Test Score vs. Regularization Parameter (alpha)')
fig.show()

{'alpha': 1.797979797979798}
-55.45086051435933


In [None]:
# Put your code here


## Principal Component Analysis

One of the most common techniques of exploratory data analysis is the Principal Component Analysis (PCA). PCA is a transformation of the features (the columns of the data set) into *principal components*, which are weighted sums of the columns:
$$PC_i = w_{i1} X_1 + \dots + w_{ik}X_k,$$
where $X_1, \dots, X_k$ are the columns of our data frame and $w_{ij}$ is the weight associated with $i$-th principal component and $j$-th column.

The first principal component corresponds to the direction of the largest variance of the data set - that is, the direction in which our data set is "biggest".  Each next principal component is selected to maximize the variance in a subspace perpendicular to all the previous components. PCA allows us to detect trends in high-dimentional data, reduce the number of dimentions, and visualize the data set on a two-dimentional scatter plot. This can be used, for example, for a rough check if the data contains clusters.  

The transformation is defined so that for each principal component, the vector of weights has a unit length: $w_{i1}^2 + \dots + w_{ik}^2 = 1$. The components of these vectors, $w_{ij}$ are also called *loadings*, and $w_{i}$ are called *loading vectors*.  

Did I just write that $w_{ij}$ are called loadings? I'm sorry, I meant to write that *a lot authors* call them loadings. For example, that's the name used in *The Elements of Statistical Learning*, one of the most influential sources in statistics. But other authors use the term *loadings* to refer to the vectors of weights multiplied by the standard deviation of the data projected onto the associated component (i.e. $w_i \cdot \text{sd}(PC_i)$). For example, that's the name used on Wikipedia, also one of the most influential sources in statistics. In this course, we follow the convention from *The Elements of Statistical Learning*.

To add to the confusion, there are even more naming conventions: some authors call the $w_{i}$ the principal components, and call $PC_i$ the *scores* or the *principal component scores*. Some other authors call $w_{i}$ the *principal axes* or *principal directions*. Unfortunately, the terminology used in PCA is *unbelievably* inconsistent (see the discussions [here](https://stats.stackexchange.com/questions/88118/what-exactly-is-called-principal-component-in-pca), [here](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues), [here](https://stats.stackexchange.com/questions/125684/how-does-fundamental-theorem-of-factor-analysis-apply-to-pca-or-how-are-pca-l), [here](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another), [here](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca)...).  

Because of this inconsistency, you cannot rely too much on terminology when you read about PCA from different sources. You need to understand the mathematics and figure out from context what people mean by all these terms. To make it simpler, remember that we have four main mathematical concepts:
 - the $w_{i}$, which are vectors that point to the directions of the maximum variances in the data (i.e. the directions in which the data is "the biggest"). In this course, we call this the *principal directions* or *loadings*;
 - the $PC_i$, which are columns of the data points projected onto the $w_i$ vectors (i.e. the transformed observations). In this course, we call this the *principal components*;
 - the $\text{sd}(PC_i)$, which is the variance of the data after projecting it onto $w_i$ (i.e. the variance "explained" by the direction $w_i$). In this course, we call this the *variances*;
 - the $w_i \cdot \text{sd}(PC_i)$, which is a hybrid concept that combines the i-th direction of the biggest variance with the amount of variance explained by that direction. In this course, we don't use a special name.   

PCA always assumes that the data is **centered** - meaning that values in each column sum up to zero. Many implementations center the data by themselves.  




**Exercise 5.** On the plot below, what do the arrows correspond to?

<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/1024px-GaussianScatterPCA.svg.png' height="400">

*Principal component analysis.
[Source](https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.svg)*
</center>

**Exercise 5.** In this exercise, we will use the Principal Component Analysis to analyze the blood serum measurements of patients from the `diabetes` data set (columns s1-s6). The goal of this analysis is to plot the whole data on a 2-dimentional plot. This allows us to check the structure of the data and detect which measuements are correlated. We will do this without looking at the target variable - this is an *unsupervised* technique. We're not interested in the progression of diabetes in this case, but rather in detecting potential problems with our data set such as the presence of distinct groups of patients or redundant variables.  

1. Create a new data frame containing the blood measurement variables from `diabetes` (the corresponding columns are in the `blood_variables` list from the *Data & modules* section). Use the `df.copy()` method from Pandas to avoid modifying the original data.  
2. Perform a principal component analysis of the blood variables. Use the `PCA` method from `sklearn` with `n_components=6`. Check in the documentation if you need to center the data yourself.   
2. Check if the weights of the transformation, stored in the `components_` attribute of the resulting variable, correspond to unit-length vectors (which some author call loadings) or to vectors scaled by the explained variance (which some other authors call loadings). You can do it by calculating the sum of their squares.  
3. Create a bar plot showing the percentage of variance explained by each principal component.  
4. Try to figure out whether the first principal component represents some meaningful quantity, i.e. whether you can explain its meaning in medical terms. The descriptions of the s1-s6 variables can be found in the [documentation of the data set](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).  
5. Now, perform a centering and scaling of the data, i.e. subtract the mean from each column and then divide each column by the standard deviation of its values, and fit the PCA again or, in case you already worked on scaled data, try to do a PCA without scaling. Repeat points 2-4. Which results changed? Why? In particular, did the interpretation of the principal components change? Why?
6. Create a scatter plot of PC1 vs PC2, and a scatter plot of PC3 vs PC4. You can obtain the values of the principal components by either running the `PCA.fit_transform()` function or first running the `PCA.fit()` and then the `PCA.transform()` functions.   
  7.1. Does the sampled population appear homogeneous, or are there distinct sub-groups?  
  7.2. Do you see any outlying observations?  
7. Color the points of the scatter plots by the corresponding value of the target variable. Do you see any patterns? Compare scatter plots for PCA with and without scaling of the variables.  

     



In [None]:
# Put your code here


Sums of squared component coordinates before scaling: [1. 1. 1. 1. 1. 1.]
Components before scaling:
[[ 0.75257852  0.65220101 -0.01975516  0.01769197  0.00522842  0.08676868]
 [ 0.35994526 -0.35518498  0.82285241 -0.05482102 -0.00619895 -0.25328478]
 [ 0.24286683 -0.39845592 -0.00531336  0.00919538  0.02239053  0.88410094]
 [ 0.49340926 -0.5373713  -0.56145643  0.05766962  0.05044966 -0.38298186]
 [-0.02270569  0.00212056  0.0763452   0.99537234 -0.0536979  -0.00134085]
 [ 0.03335315 -0.0306259  -0.03786161 -0.05005204 -0.99699763  0.00257769]]
Explained variance ratios before scaling:
[8.37673796e-01 9.24511100e-02 5.04738979e-02 1.93105480e-02
 7.91362392e-05 1.15121729e-05]
Correlation between the magnitude of the PC1 component and the mean of the columns:
[[1.        0.8804702]
 [0.8804702 1.       ]]
Correlation between the magnitude of the PC1 component and the standard deviation of the columns:
[[1.         0.93781923]
 [0.93781923 1.        ]]


Sums of squared component coordinates after scaling: [1. 1. 1. 1. 1. 1.]
Components after scaling:
[[ 0.42707351  0.43991107 -0.30325166  0.5009118   0.4148502   0.33032042]
 [ 0.546628    0.41805638  0.66155304 -0.21703282 -0.15116962 -0.13718154]
 [ 0.00885114  0.31590905 -0.34804522  0.28820177 -0.33991153 -0.7618317 ]
 [ 0.11565274 -0.28958975  0.13748803 -0.04133145  0.76843945 -0.54004794]
 [-0.0360978  -0.36394909  0.47693931  0.78029534 -0.17290591  0.00310363]
 [ 0.70996415 -0.56177168 -0.32074362 -0.09140435 -0.26289078  0.00454855]]
Explained variance ratios after scaling:
[0.54594331 0.21808828 0.13336092 0.08775054 0.01342019 0.00143675]



Patient with the highest PC1:
age          50.000000
bmi          29.600000
bp           94.330000
s1          300.000000
s2          242.400000
s3           33.000000
s4            9.090000
s5            4.812200
s6          109.000000
target       84.000000
s1s2      72720.000000
s1s3       9900.000000
s1s4       2727.000000
s1s5       1443.660000
s1s6      32700.000000
s2s3       7999.200000
s2s4       2203.416000
s2s5       1166.477280
s2s6      26421.600000
s3s4        299.970000
s3s5        158.802600
s3s6       3597.000000
s4s5         43.742898
s4s6        990.810000
s5s6        524.529800
Name: 123, dtype: float64

Patient with the lowest PC1:
age         30.0000
bmi         21.3000
bp          87.0000
s1         134.0000
s2          63.0000
s3          63.0000
s4           2.0000
s5           3.6889
s6          66.0000
target     143.0000
s1s2      8442.0000
s1s3      8442.0000
s1s4       268.0000
s1s5       494.3126
s1s6      8844.0000
s2s3      3969.0000
s2s4       126.000

## Principal component regression

Performing PCA and selecting the first few components as independent variables for linear regression is another way to reduce problems with collinearity in the data.  

**Exercise 6.** In this exercise, we will use PCA to reduce the dimentionality and collinearity in the model for prediction of the progression of diabetes.  

1. Perform PCA on the `diabetes` data with interactions between the blood measurement variables (as in Exercise 1).      
2. Select the first few principal components that jointly explain >=95% of the variability in the data.  
3. Create a linear model to predict the progression of diabetes based on  the selected principal components. Compare the cross-validated RMSEs of this model and the best model from Exercise 1.  
4. \* Use cross-validation to select the optimal number of principal components for regression.  
5. Select the first 10 principal components. Use LASSO to select from them the optimal subset for prediction (use the `cross_val_score` function to estimate the penalty parameter of LASSO). How many components were selected? What is the corresponding cross-validated RMSE?   
  5.1.\* Compare the results when you fit LASSO to the first 14 principal components. Did anything change? Why/why not? Pay a particular attention to the selected model. Can adding new columns to the data change which of the old columns get selected? Do you suspect that the answer is the same for a stepwise model selection?       

Do you need to scale the variables in this exercise?

In [None]:
# Put your code here


Principal components:
[[ 0.0911  0.0679  0.128   0.1057  0.2438  0.2436 -0.1132  0.2401  0.195
   0.1611  0.2477  0.0403  0.2693  0.2558  0.2564  0.1035  0.2597  0.27
   0.2601  0.2338 -0.0333 -0.0399  0.2449  0.2474  0.2058 -0.    ]
 [ 0.0189 -0.1518 -0.102  -0.0334  0.1952  0.0978  0.3754 -0.1901 -0.0715
  -0.0415  0.138   0.4112 -0.0595  0.1051  0.1152  0.3699 -0.0617  0.0596
   0.0612  0.1799  0.3794  0.3613 -0.1717 -0.169  -0.0682  0.    ]
 [ 0.241   0.0342  0.2202  0.3427 -0.0831 -0.2464  0.0616 -0.1067  0.2722
   0.4175 -0.1994  0.0087 -0.1232  0.0526  0.1407 -0.1398 -0.2048 -0.1233
  -0.0504 -0.1117  0.1881  0.2647 -0.0071  0.0455  0.4084 -0.    ]
 [-0.4425 -0.5906  0.1093 -0.2697  0.0335 -0.1567  0.005   0.0503  0.3831
  -0.1254 -0.075   0.0087  0.0653  0.1981 -0.0293 -0.1246 -0.0397  0.0054
  -0.1711  0.0486  0.1674 -0.0685  0.1668  0.0159  0.1426  0.    ]
 [-0.3899 -0.0718 -0.1821 -0.3921 -0.0836  0.0334 -0.0047 -0.0147 -0.2837
   0.5159 -0.0049 -0.0748 -0.0209 -0.1792  0.21

LASSO CV RMSE: 54.8477893026939 SD: 4.055191797364895
Component	Coefficient
PC0     	9.295972094949136
PC1     	-8.867625398400039
PC2     	18.139320546323027
PC3     	14.11790179926799
PC4     	-13.69950388207271
PC5     	19.503151208833675
PC6     	3.871125445067289
PC7     	-0.6595272686875172
PC8     	-1.9243796032929081
PC9     	0.0
PC10     	0.0
PC11     	-20.783065548317502
PC12     	-15.203158494569012
PC13     	0.0


**Exercise 7.** Quiz time!  
  1. Assign a proper technique to the rows of the table below. Select from LASSO, Ridge, OLS and PCA.

| Property   | Technique |
|------------------------------------------------------------------------------------------------|-----------|
| This technique combines correlated variables using weighted sums.                               |           |
| When some independent variables are highly correlated, this technique fails.                    |           |
| When some independent variables are correlated, this technique can select one of them.          |           |
| When some independent variables are correlated, this technique gives them similar coefficients. |           |

2. Fill the following tables with Yes, No, or Not Applicable. If you think that none of those answers is correct, explain your reasoning.  

 Standardization of X in the table refers to centering the columns of the matrix of independent variables and and dividing them by their standard deviations. Standardization of Y refers to centering the dependent variable and dividing it by its standard deviation.  

| Technique | Standardization of X changes model coefficients or weights | Standardization of X changes RMSE| Standardization of X changes the set of selected variables |
|-----------|------------------------------|---------------------|----|
| OLS       |                              |                     |   |
| LASSO     |                              |                     |   |
| Ridge     |                              |                      |   |
| PCA       |                              |                      |   |
| PC regression |                          |                      |    |  

--------

| Technique | Standardization of Y changes model coefficients or weights | Standardization of Y changes RMSE| Standardization of Y changes the set of selected variables |
|-----------|------------------------------|---------------------|----|
| OLS       |                 |                     |   |
| LASSO     |                              |                     |   |
| Ridge     |                              |                      |   |
| PCA       |                              |                      |   |
| PC regression |                          |                      |    |  

## How centering the data removes the intercept

This is a short explanation on how we can remove the intercept from our model by centering the data. Suppose our data comes from a linear model $Y_i = \beta_0 + x_i^T\beta + \epsilon_i$, where $x_i, \beta \in \mathbb{R}^k$, and $\mathbb{E}\epsilon_i = 0$ for all i. We have:

$$\bar{Y} := \frac{1}{N}\sum_{i=1}^N Y_i = \frac{1}{N}\sum_{i=1}^N \left( \beta_0 + x_i^T\beta + \epsilon_i\right)$$

$$\frac{1}{N}\sum_{i=1}^N \left( \beta_0 + x_i^T\beta + \epsilon_i\right) = \beta_0 + \frac{1}{N} \sum_{i=1}^N x_i^T\beta + \frac{1}{N} \sum_{i=1}^N \epsilon_i$$

Define $\bar{x} =  \frac{1}{N} \sum_{i=1}^N x_i$, a vector of the mean values of the columns of $X$ (where $x_i = X[i, :]$), and $\bar{\epsilon} =  \frac{1}{N} \sum_{i=1}^N \epsilon_i$. This gives us

$$\bar{Y} = \beta_0 + \bar{x}^T \beta + \bar{\epsilon}.$$

Now, if $\bar{x} = 0$ (because of the centering), $\bar{Y}$ is an unbiased estimator of $\beta_0$ (because $\mathbb{E} \bar{Y} = \mathbb{E} \left(\beta_0 + \bar{\epsilon} \right) = \beta_0 + \mathbb{E}\bar{\epsilon} = \beta_0$). Because of this, if we centered the $X$ matrix, then we can estimate the intercept simply as $\bar{Y}$ and then remove it from $Y$ and fit a model without intercept.  


<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>|