# Laboratory 6 - Variable Selection and Regularization

## Dataset Information

The dataset contains health-related data for predicting diabetes. Below are the features:

* `Pregnancies`: Number of pregnancies
* `Glucose`: Plasma glucose concentration
* `BloodPressure`: Diastolic blood pressure (mm Hg)
* `SkinThickness`: Triceps skin fold thickness (mm)
* `Insulin`: 2-hour serum insulin (mu U/ml)
* `BMI`: Body mass index
* `DiabetesPedigreeFunction`: Diabetes pedigree function
* `Age`: Age (years)
* `Outcome`: Target variable (1 = Diabetes, 0 = No Diabetes)

The goal is to train a model for predicting the probability that a patient has diabetes given their healthcare data. We start by loading and cleaning the data using polars.

In [None]:
# Standard imports
import numpy as np
from itertools import chain, combinations

# Data manipulation
import pandas as pd
import polars as pl

# Sklearn imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector

# Plotting packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Download the data.
!gdown https://drive.google.com/uc?id=1-_YcEl0q5LsDXRq5eix9K4gjSq78Ffd5

Downloading...
From: https://drive.google.com/uc?id=1-_YcEl0q5LsDXRq5eix9K4gjSq78Ffd5
To: /content/diabetes.csv
  0% 0.00/23.1k [00:00<?, ?B/s]100% 23.1k/23.1k [00:00<00:00, 14.1MB/s]


In [None]:
# Load the dataset
data = pl.read_csv('diabetes.csv')
data.head()

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
i64,i64,i64,i64,i64,f64,f64,i64,i64
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1


In [None]:
# Show descriptive statistics
data.describe()

statistic,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
"""null_count""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
"""std""",3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
"""min""",0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
"""25%""",1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
"""50%""",3.0,117.0,72.0,23.0,32.0,32.0,0.374,29.0,0.0
"""75%""",6.0,140.0,80.0,32.0,127.0,36.6,0.626,41.0,1.0
"""max""",17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
# Get train and test splits
X_train, X_test, y_train, y_test = train_test_split(data.drop('Outcome'),
                            data.select('Outcome'),
                            test_size = 0.3,
                            random_state = 42)

In [None]:
# BloodPressure, BMI, and Glucose cannot assume zero values! Treat them as missing values.
# Get the train median values
bp_median = X_train.select('BloodPressure').median()
bmi_median = X_train.select('BMI').median()
glucose_median = X_train.select('Glucose').median()

# Apply to both train and test sets
X_train = X_train.with_columns(pl.col('BloodPressure').replace(0, bp_median))
X_train = X_train.with_columns(pl.col('BMI').replace(0, bmi_median))
X_train = X_train.with_columns(pl.col('Glucose').replace(0, glucose_median))
X_test = X_test.with_columns(pl.col('BloodPressure').replace(0, bp_median))
X_test = X_test.with_columns(pl.col('BMI').replace(0, bmi_median))
X_test = X_test.with_columns(pl.col('Glucose').replace(0, glucose_median))

# Check replacement
X_train.select('BloodPressure', 'BMI', 'Glucose').describe()

statistic,BloodPressure,BMI,Glucose
str,f64,f64,f64
"""count""",537.0,537.0,537.0
"""null_count""",0.0,0.0,0.0
"""mean""",72.232775,32.273557,121.938547
"""std""",12.204867,6.964647,30.142292
"""min""",24.0,18.2,44.0
"""25%""",64.0,27.1,100.0
"""50%""",72.0,32.0,117.0
"""75%""",80.0,36.5,139.0
"""max""",122.0,67.1,199.0


In [None]:
import multiprocessing

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

2

## Sequential Feature Selection

Now we are ready to apply different feature selections methods to the data. Let's start by using sequential feature elimination (forward and backward selection). For this:

1. We will use the [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) operator within a pipeline to conduct a cross-validated search of the best subset of variables, for both a Forward and a Backward search.
2. Finding the subset of variables on both methods, and see if there are disagreement within them to make an exhaustive search.
3. Train the final model and apply it to the test set to get the performance.

In [None]:
# Create a pipeline that scales the features and trains a logistic regression
selector_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('sfs', SequentialFeatureSelector(LogisticRegression(solver = 'lbfgs',
                               penalty = None,
                               max_iter = 1000,
                               verbose = 0,
                               random_state = 42,
                               n_jobs = 1,
                               class_weight = 'balanced'),
                      direction = 'forward',
                      scoring = 'roc_auc',
                      cv=5,
                      tol=0.001,
                      n_jobs = -1))
    ])

# Fit the pipeline
selector_pipe.fit(X_train, y_train.to_numpy().ravel())

The core operator is the `SequentialFeatureSelector` operator, that takes:
1. The model we are selecting from, in this case, a Logistic Regression.
2. The direction of search, can be `'forward'` or `'backward'`.
3. The scoring we use to select the best model. By default, it uses the score method that is default in the model. For Logistic Regression, that would be accuracy. We replace that by the `'roc_auc'`. Note that sklearn uses a positive scorer, with higher values being better.
4. The `'cv'` parameter can be either a cross-validator (as the last lab) or directly a number with the number of cross-validation cuts to use. We use five here.
5. The `'n_jobs'` parameter tells sklearn to use all cores.
6. The `tol` parameter models how strict we are to be to add or remove a variable. Here we say to stop if we do not gain at least 0.001 of AUC.

Let's check the results.

In [None]:
# Get the selected features
  selected_forward = selector_pipe.named_steps['sfs'].get_feature_names_out(input_features = X_train.columns)
selected_forward

array(['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age'], dtype=object)

Only four variables were selected! We can now do the same, but with backwards elimination.

In [None]:
# Create a pipeline that scales the features and trains a logistic regression
selector_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('sfs', SequentialFeatureSelector(LogisticRegression(solver = 'lbfgs',
                                                    penalty = None,
                                                    max_iter = 1000,
                                                    verbose = 0,
                                                    random_state = 42,
                                                    n_jobs = 1,
                                                    class_weight = 'balanced'),
                          direction = 'backward',
                          scoring = 'roc_auc',
                          cv=5,
                          tol=0.001,
                          n_jobs = -1))

    ])

# Fit the pipeline
selector_pipe.fit(X_train, y_train.to_numpy().ravel())

# Get the selected features
selected_backward = selector_pipe.named_steps['sfs'].get_feature_names_out(input_features = X_train.columns)
selected_backward

array(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'BMI',
       'DiabetesPedigreeFunction', 'Age'], dtype=object)

We can see that backwards is *much* more optimistic of the variables that should be included. In fact, it only drops one (Insulin)!

We have our search space now. We must keep at least `['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']`. And we must conduct an exhaustive search using the remaining three variables (`[Pregnancies, BloodPressure, 'SkinThickness']`). Unfortunately, there is no exhaustive search method in sklearn, but we can create our own easily.

In [None]:
# Perform exhaustive search over the three variables that remain.
exhaust_vars = ['Pregnancies', 'BloodPressure', 'SkinThickness']

# Get all combinations of the three variables
n_features = len(exhaust_vars)
subsets = chain.from_iterable(combinations(range(n_features), k + 1)
               for k in range(n_features))

best_score = -np.inf
best_subset = None

# Create a pipeline with scaler and logistic regression
sequential_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(solver = 'lbfgs',
                                  penalty = None,
                                  max_iter = 1000,
                                  verbose = 0,
                                  random_state = 42,
                                  n_jobs = -1,
                                  class_weight = 'balanced'))
])

# Iterate over all subsets
for subset in subsets:
    # Combine the selected forward features with the subset, from the exhaustive search
    subset_vars = list(selected_forward) + [exhaust_vars[i] for i in subset]
    score = cross_val_score(sequential_pipe,
                            X_train.select(subset_vars),
                             y_train.to_numpy().ravel(),
                             cv=5).mean()
    if score > best_score:
        best_score, best_subset = score, subset

# Get the best subset
best_subset_vars = list(selected_forward) + [exhaust_vars[i] for i in best_subset]
best_subset_vars

['Glucose',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'BloodPressure',
 'SkinThickness']

Our best subset then uses just two of the variables selected. Let's apply this model to the test set and get the final bootstrapped CI.

In [None]:
# Train a logistic regression with the best subset
sequential_pipe.fit(X_train.select(best_subset_vars), y_train.to_numpy().ravel())

# Apply the model to the test set
y_pred = sequential_pipe.predict(X_test.select(best_subset_vars))
y_prob = sequential_pipe.predict_proba(X_test.select(best_subset_vars))[:, 1]

# Get the accuracy and AUC scores for the test set
accuracy = accuracy_score(y_test.to_pandas(), y_pred)
auc = roc_auc_score(y_test.to_pandas(), y_prob)

# Create a bootstrap measurement for accuracy and AUC. We will use 100 bootstraps. Normally we would use 1000 or more.
n_bootstraps = 100
bootstrapped_accuracy = np.zeros(n_bootstraps)
bootstrapped_auc = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Get the indices for the bootstrap sample
    idx = np.random.choice(len(y_test), len(y_test), replace=True)

    # Get the accuracy of the bootstrap sample
    bootstrapped_accuracy[i] = accuracy_score(y_test.to_pandas().iloc[idx], y_pred[idx])

    # Get the AUC of the bootstrap sample
    bootstrapped_auc[i] = roc_auc_score(y_test.to_pandas().iloc[idx], y_prob[idx])

# Get the differences between the bootstrapped values and the original values
accuracy_diff = bootstrapped_accuracy - accuracy
auc_diff = bootstrapped_auc - auc

# Calculate the 95% confidence interval for the accuracy and AUC
accuracy_ci = np.percentile(accuracy_diff, [2.5, 97.5])
auc_ci = np.percentile(auc_diff, [2.5, 97.5])

# Print the results. Centre the values around the original values.
print(f"The 95% confidence interval for the accuracy (sequential) is [{(accuracy - accuracy_ci[1])*100:.2f}%, {(accuracy - accuracy_ci[0])*100:.2f}%]")
print(f"The 95% confidence interval for the AUC (sequential) is [{(auc - auc_ci[1]):.2f}, {(auc - auc_ci[0]):.2f}]")

The 95% confidence interval for the accuracy (sequential) is [64.94%, 76.42%]
The 95% confidence interval for the AUC (sequential) is [0.73, 0.84]


Let's print the coefficients.

In [None]:
# Get the coefficients.
coefs = sequential_pipe.named_steps['logreg'].coef_[0]

# Create a dataframe with the coefficients and the variable names
coefs_df = pd.DataFrame({'Variable': best_subset_vars, 'Coefficient': coefs})

# Print the coefficients
coefs_df

Unnamed: 0,Variable,Coefficient
0,Glucose,1.20272
1,BMI,0.841162
2,DiabetesPedigreeFunction,0.163302
3,Age,0.535483
4,BloodPressure,-0.126777
5,SkinThickness,-0.125021


Great! Now we are ready to use some more sophisticated methods: L1 LASSO, L2 Ridge, and ElasticNet regularization.

## Regularization

Now we can repeat the analysis, but we will use regularization. scikit-learn already comes with several operators that implement regularized regressions:

- [`ElasticNet`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) implements the ElasticNet linear regressor. This one can be used for both LASSO and Ridge, setting the respective parameters to 0. There are both independent versions of these models ([`LASSO`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) and [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)).
- Logistic Regression allows for both `C` and `l1_ratio` parameters, accounting for the overall and L1 penalties, respectively. That is, the `C` parameter is the inverse regularization parameter ($1/\lambda$ using the lecture notation), while `l1_ratio` measures how much of the regularization goes for LASSO. `C` $\in (-\infty, \infty)$, `l1_ratio` $\in [0,1]$. Smaller values of $C$ mean stronger regularization.

However, the far more interesting [ElasticNetCV](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.ElasticNetCV.html) and [`LogisticRegressionCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) implement both Ridge and LASSO **and** search for the best parameters using cross-validation, making them a far better method for most uses.

Let's use [`LogisticRegressionCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) to implement a logistic regression with Ridge, LASSO, and ElasticNet penalization. Let's start with a Ridge model.

In [None]:
# Create a pipeline using LogisticRegressionCV to get the best model.
ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegressionCV(solver = 'saga',
                                    Cs = 10,
                                    penalty = 'elasticnet',
                                    max_iter = 1000,
                                    verbose = 0,
                                    random_state = 42,
                                    n_jobs = -1,
                                    class_weight = 'balanced',
                                    l1_ratios = [0],
                                    cv = 5))
])

# Fit the pipeline
ridge_pipe.fit(X_train, y_train.to_numpy().ravel())

# Apply the model to the test set
y_pred = ridge_pipe.predict(X_test)
y_prob = ridge_pipe.predict_proba(X_test)[:, 1]

# Get the accuracy and AUC scores for the test set
accuracy = accuracy_score(y_test.to_pandas(), y_pred)
auc = roc_auc_score(y_test.to_pandas(), y_prob)

# Create a bootstrap measurement for accuracy and AUC. We will use 100 bootstraps. Normally we would use 1000 or more.
n_bootstraps = 100
bootstrapped_accuracy = np.zeros(n_bootstraps)
bootstrapped_auc = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Get the indices for the bootstrap sample
    idx = np.random.choice(len(y_test), len(y_test), replace=True)

    # Get the accuracy of the bootstrap sample
    bootstrapped_accuracy[i] = accuracy_score(y_test.to_pandas().iloc[idx], y_pred[idx])

    # Get the AUC of the bootstrap sample
    bootstrapped_auc[i] = roc_auc_score(y_test.to_pandas().iloc[idx], y_prob[idx])

# Get the differences between the bootstrapped values and the original values
accuracy_diff = bootstrapped_accuracy - accuracy
auc_diff = bootstrapped_auc - auc

# Calculate the 95% confidence interval for the accuracy and AUC
accuracy_ci = np.percentile(accuracy_diff, [2.5, 97.5])
auc_ci = np.percentile(auc_diff, [2.5, 97.5])

# Print the results. Centre the values around the original values.
print(f"The 95% confidence interval for the accuracy (Ridge) is [{(accuracy - accuracy_ci[1])*100:.2f}%, {(accuracy - accuracy_ci[0])*100:.2f}%]")
print(f"The 95% confidence interval for the AUC (Ridge) is [{(auc - auc_ci[1]):.2f}, {(auc - auc_ci[0]):.2f}]")

The 95% confidence interval for the accuracy (Ridge) is [65.37%, 76.62%]
The 95% confidence interval for the AUC (Ridge) is [0.75, 0.86]


We get a warning as we used the L2 penalty and gave it an `l1_ratio` value of 0. We can supress it by passing `elasticnet` as the method. Let's check the variables selected and the values of the parameters.

In [None]:
# Get the variables and coefficients
variables = X_train.columns
coefficients = ridge_pipe.named_steps['logreg'].coef_[0]

# Create a dataframe with the variables and coefficients
coef_df = pd.DataFrame({'Variable': variables, 'Coefficient': coefficients})

# Get the regression hyperparameters
C = ridge_pipe.named_steps['logreg'].C_[0]
penalty = ridge_pipe.named_steps['logreg'].penalty

# Print the hyperparameters
print(f"The best hyperparameters for the model are C = {C:.3f} for an L2 penalty")
coef_df

The best hyperparameters for the model are C = 21.544 for an L2 penalty


Unnamed: 0,Variable,Coefficient
0,Pregnancies,0.224411
1,Glucose,1.274672
2,BloodPressure,-0.137595
3,SkinThickness,-0.068687
4,Insulin,-0.147398
5,BMI,0.838635
6,DiabetesPedigreeFunction,0.176809
7,Age,0.410403


We can see the shrinkage effect now. Some variables have a smaller coefficient than before. Note how the Ridge regression uses all variables. Do we see a change with LASSO?

In [None]:
# Create a pipeline using LogisticRegressionCV to get the best model.
lasso_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegressionCV(solver = 'saga',
                                    Cs = 10,
                                    penalty = 'l1',
                                    max_iter = 1000,
                                    verbose = 0,
                                    random_state = 42,
                                    n_jobs = -1,
                                    class_weight = 'balanced',
                                    #l1_ratios = [1],
                                    cv = 5))
])

# Fit the pipeline
lasso_pipe.fit(X_train, y_train.to_numpy().ravel())

# Apply the model to the test set
y_pred = lasso_pipe.predict(X_test)
y_prob = lasso_pipe.predict_proba(X_test)[:, 1]

# Get the accuracy and AUC scores for the test set
accuracy = accuracy_score(y_test.to_pandas(), y_pred)
auc = roc_auc_score(y_test.to_pandas(), y_prob)

# Create a bootstrap measurement for accuracy and AUC. We will use 100 bootstraps. Normally we would use 1000 or more.
n_bootstraps = 100
bootstrapped_accuracy = np.zeros(n_bootstraps)
bootstrapped_auc = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Get the indices for the bootstrap sample
    idx = np.random.choice(len(y_test), len(y_test), replace=True)

    # Get the accuracy of the bootstrap sample
    bootstrapped_accuracy[i] = accuracy_score(y_test.to_pandas().iloc[idx], y_pred[idx])

    # Get the AUC of the bootstrap sample
    bootstrapped_auc[i] = roc_auc_score(y_test.to_pandas().iloc[idx], y_prob[idx])

# Get the differences between the bootstrapped values and the original values
accuracy_diff = bootstrapped_accuracy - accuracy
auc_diff = bootstrapped_auc - auc

# Calculate the 95% confidence interval for the accuracy and AUC
accuracy_ci = np.percentile(accuracy_diff, [2.5, 97.5])
auc_ci = np.percentile(auc_diff, [2.5, 97.5])

# Print the results. Centre the values around the original values.
print(f"The 95% confidence interval for the accuracy (LASSO) is [{(accuracy - accuracy_ci[1])*100:.2f}%, {(accuracy - accuracy_ci[0])*100:.2f}%]")
print(f"The 95% confidence interval for the AUC (LASSO) is [{(auc - auc_ci[1]):.2f}, {(auc - auc_ci[0]):.2f}]")

The 95% confidence interval for the accuracy (LASSO) is [64.94%, 75.55%]
The 95% confidence interval for the AUC (LASSO) is [0.76, 0.86]


In [None]:
# Get the variables and coefficients
variables = X_train.columns
coefficients = lasso_pipe.named_steps['logreg'].coef_[0]

# Create a dataframe with the variables and coefficients
coef_df = pd.DataFrame({'Variable': variables, 'Coefficient': coefficients})

# Get the regression hyperparameters
C = lasso_pipe.named_steps['logreg'].C_[0]
penalty = lasso_pipe.named_steps['logreg'].penalty

# Print the hyperparameters
print(f"The best hyperparameters for the model are C = {C:.3f} for an l1 penalty")
coef_df

The best hyperparameters for the model are C = 2.783 for an l1 penalty


Unnamed: 0,Variable,Coefficient
0,Pregnancies,0.220719
1,Glucose,1.264861
2,BloodPressure,-0.127721
3,SkinThickness,-0.064128
4,Insulin,-0.141025
5,BMI,0.827819
6,DiabetesPedigreeFunction,0.170349
7,Age,0.405606


It seems that all models perform similarly. We can see a slight edge on LASSO, but it is not statistically significant. Let's finally train an elastic net and see what the model decides.

In [None]:
# Create a pipeline using LogisticRegressionCV to get the best model.
elastic_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegressionCV(solver = 'saga',
                    Cs = 10,
                    penalty = 'elasticnet',
                    max_iter = 1000,
                    verbose = 0,
                    random_state = 42,
                    n_jobs = -1,
                    class_weight = 'balanced',
                    l1_ratios = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1],
                    cv = 5))
])

# Fit the pipeline
elastic_pipe.fit(X_train, y_train.to_numpy().ravel())

# Apply the model to the test set
y_pred = elastic_pipe.predict(X_test)
y_prob = elastic_pipe.predict_proba(X_test)[:, 1]

# Get the accuracy and AUC scores for the test set
accuracy = accuracy_score(y_test.to_pandas(), y_pred)
auc = roc_auc_score(y_test.to_pandas(), y_prob)

# Create a bootstrap measurement for accuracy and AUC. We will use 100 bootstraps. Normally we would use 1000 or more.
n_bootstraps = 100
bootstrapped_accuracy = np.zeros(n_bootstraps)
bootstrapped_auc = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Get the indices for the bootstrap sample
    idx = np.random.choice(len(y_test), len(y_test), replace=True)

    # Get the accuracy of the bootstrap sample
    bootstrapped_accuracy[i] = accuracy_score(y_test.to_pandas().iloc[idx], y_pred[idx])

    # Get the AUC of the bootstrap sample
    bootstrapped_auc[i] = roc_auc_score(y_test.to_pandas().iloc[idx], y_prob[idx])

# Get the differences between the bootstrapped values and the original values
accuracy_diff = bootstrapped_accuracy - accuracy
auc_diff = bootstrapped_auc - auc

# Calculate the 95% confidence interval for the accuracy and AUC
accuracy_ci = np.percentile(accuracy_diff, [2.5, 97.5])
auc_ci = np.percentile(auc_diff, [2.5, 97.5])

# Print the results. Centre the values around the original values.
print(f"The 95% confidence interval for the accuracy (LASSO) is [{(accuracy - accuracy_ci[1])*100:.2f}%, {(accuracy - accuracy_ci[0])*100:.2f}%]")
print(f"The 95% confidence interval for the AUC (LASSO) is [{(auc - auc_ci[1]):.2f}, {(auc - auc_ci[0]):.2f}]")

# Get the variables and coefficients
variables = X_train.columns
coefficients = elastic_pipe.named_steps['logreg'].coef_[0]

# Create a dataframe with the variables and coefficients
coef_df = pd.DataFrame({'Variable': variables, 'Coefficient': coefficients})

# Get the regression hyperparameters
C = elastic_pipe.named_steps['logreg'].C_[0]
l1_ratio = elastic_pipe.named_steps['logreg'].l1_ratio_[0]
penalty = elastic_pipe.named_steps['logreg'].penalty

# Print the hyperparameters
print(f"The best hyperparameters for the model are C = {C:.3f} and l1_ratio = {l1_ratio} for an ElasticNet penalty")
coef_df

The 95% confidence interval for the accuracy (LASSO) is [65.80%, 75.76%]
The 95% confidence interval for the AUC (LASSO) is [0.76, 0.85]
The best hyperparameters for the model are C = 21.544 and l1_ratio = 0 for an ElasticNet penalty


Unnamed: 0,Variable,Coefficient
0,Pregnancies,0.224411
1,Glucose,1.274672
2,BloodPressure,-0.137595
3,SkinThickness,-0.068687
4,Insulin,-0.147398
5,BMI,0.838635
6,DiabetesPedigreeFunction,0.176809
7,Age,0.410403


Surprisingly, the ElasticNet provides a Ridge regression (`l1_ratio=0`), with identical results to it. It is not very surprising, as the small sample size and the previous results seem to suggest there is not much gain when removing variables.

To compare the models side-by-side, let's calculate one ROC curve for every model in one plot.

In [None]:
# Calculate the ROC curves for all models
fpr_seq, tpr_seq, _ = roc_curve(y_test.to_pandas(), sequential_pipe.predict_proba(X_test.select(best_subset_vars))[:, 1])
fpr_ridge, tpr_ridge, _ = roc_curve(y_test.to_pandas(), ridge_pipe.predict_proba(X_test)[:, 1])
fpr_lasso, tpr_lasso, _ = roc_curve(y_test.to_pandas(), lasso_pipe.predict_proba(X_test)[:, 1])
fpr_elastic, tpr_elastic, _ = roc_curve(y_test.to_pandas(), elastic_pipe.predict_proba(X_test)[:, 1])

# Plot the ROC curves
plt.figure(figsize=(10, 8))
plt.plot(fpr_seq, tpr_seq, label = 'Sequential')
plt.plot(fpr_ridge, tpr_ridge, label = 'Ridge')
plt.plot(fpr_lasso, tpr_lasso, label = 'LASSO')
plt.plot(fpr_elastic, tpr_elastic, label = 'ElasticNet')

plt.plot([0, 1], [0, 1], 'k--', label = 'Random')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curves for all models')
plt.legend()
plt.show()


Here we can see what we suspected. There is no model that is a clear winner. ElasticNet has a slight edge across most of the curve, but this difference is not statistically significant at all. The sequential model seems to be the weakest. This is also not surprising, we know that greedy algorithms are not the best selection method. In fact, around half the curve, there is a significant drop in quality for the sequential models. Between 0.3 and 0.4 FPR, however, it is the sequential model that has a slight, but most likely not significant, edge. It also seems to be that LASSO is dominated by ElasticNet/Ridge across most of the curve.

Now we know how to make very strong models! Note that using Ridge or ElasticNet, we can ignore correlations. For LASSO and Sequential ones, we should study correlated variables before training the models, and remove those with significant correlation.