## Supervised machine learning: Regression examples

### Data and metadata

Dataset from sklearn reference datasets

In [1]:
from sklearn import datasets

In [2]:
diabetes = datasets.load_diabetes()

### Dataset description

- Dataframe containing 442 rows and 10 columns, representing the age, sex, body mass index (BMI), average blood pressure and six blood serum measurements of diabetes patients
- goal: predict a quantitative measure of disease progression one year after baseline from the variables measured

In [3]:
print(diabetes.DESCR) 

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

### Data

In [5]:
diabetes.data

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

In [6]:
diabetes.data.shape

(442, 10)

### Target (the variable we want to predict)

In [9]:
diabetes.target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

### Split the dataset into train and test

In [13]:
from sklearn.model_selection import train_test_split

diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=1)
    #Splitting the diabetes dataset into training and testing sets with a random seed for reproducibility

print(diabetes_X_train.shape)
print(diabetes_X_test.shape)

(353, 10)
(89, 10)


### Decision tree based regression model

In [19]:
from sklearn.tree import DecisionTreeRegressor

regr_tree = DecisionTreeRegressor()
regr_tree = regr_tree.fit(diabetes_X_train, diabetes_y_train) #fitting the model to the training data

tree_y_pred = regr_tree.predict(diabetes_X_test) # making predictions on the testing data

print("R2:", regr_tree.score(diabetes_X_test, diabetes_y_test))
        # R2: a measure of how well the model explains the variance in the target variable 
        # A higher R2 score indicates a better fit, meaning the model's predictions are closer to the actual target values.

R2: -0.2148625135374207


The R2 score of approximately -0.215 indicates that the decision tree regression model trained performs poorly on the testing data.

#### More metrics

In [20]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

print("MSE:", mean_squared_error(diabetes_y_test, tree_y_pred))
    # average squared difference between the target values (diabetes_y_test) and the predicted values (tree_y_pred).
    # overall accuracy of the model's predictions
print("MAE:", mean_absolute_error(diabetes_y_test, tree_y_pred))
    # average absolute difference between the diabetes_y_test and the predicted values tree_y_pred.
    # MAE provides a measure of prediction accuracy, but it gives equal weight to all errors and is less sensitive to outliers.
    
print("R2:", r2_score(diabetes_y_test, tree_y_pred))

MSE: 6473.966292134832
MAE: 61.60674157303371
R2: -0.2148625135374207


MSE: 6473.97

A higher MSE indicates larger prediction errors, and in this case, it's a relatively high value.

MAE: 61.61

A lower MAE is generally better, but this value suggests that the model's predictions have a relatively large average absolute error.

### Linear regression

In [23]:
from sklearn import linear_model

regr_model = linear_model.LinearRegression()
regr_model = regr_model.fit(diabetes_X_train, diabetes_y_train) #fitting the model to the training data

regr_y_pred = regr_model.predict(diabetes_X_test)

#### Metrics 

In [24]:
print("R2:", regr_model.score(diabetes_X_test, diabetes_y_test))
print("MSE:", mean_squared_error(diabetes_y_test, regr_y_pred))
print("MAE:", mean_absolute_error(diabetes_y_test, regr_y_pred))

R2: 0.43843604017332694
MSE: 2992.5576814529445
MAE: 41.974875685462315


### SVM (Support Vector Regression)

In [25]:
from sklearn.svm import SVR
regr_svm = SVR(kernel = "rbf", C = 100)

regr_svm = regr_svm.fit(diabetes_X_train, diabetes_y_train) #fitting the model to the training data

svm_pred = regr_svm.predict(diabetes_X_test)

#### Metrics

In [26]:
print("R2:", regr_svm.score(diabetes_X_test, diabetes_y_test))
print("MSE:", mean_squared_error(diabetes_y_test, svm_pred))
print("MAE:", mean_absolute_error(diabetes_y_test, svm_pred))
print("R2:", r2_score(diabetes_y_test, svm_pred))

R2: 0.3083249276182749
MSE: 3685.9159401260745
MAE: 45.2333388001873
R2: 0.3083249276182749


## Cross validation: Linear Regression Model

In [27]:
from sklearn.model_selection import cross_val_score

scores_linear_cv = cross_val_score(estimator=regr_model, X=diabetes.data, y=diabetes.target, cv=5)
    # performing Cross-Validation with Linear Regression

print('R2 values:', scores_linear_cv)
print('Mean R2:', scores_linear_cv.mean()) # mean R2 score across all the folds

R2 values: [0.42955643 0.52259828 0.4826784  0.42650827 0.55024923]
Mean R2: 0.48231812211149394


On average, the linear regression model explained around 48.23% of the variance in the target variable (in this case, diabetes severity) 

#### other score metrics (MSE)

In [21]:
scores_linear_mse = cross_val_score(regr_model, diabetes.data, diabetes.target, cv=5, scoring='neg_mean_squared_error')

print('Negative MSE values:', scores_linear_mse)
print('Mean MSE:', -1*scores_linear_mse.mean())

Negative MSE values: [-2779.92210988 -3028.84335258 -3237.70099059 -3008.69133019
 -2910.20693327]
Mean MSE: 2993.072943299886


### Leave-One-Out Cross-Validation (LOOCV): Linear Regression Model

In [28]:
from sklearn.model_selection import LeaveOneOut

loo_cv = LeaveOneOut() 
    # LOOCV is a special case of cross-validation where each data point is used as a test set once while the rest of the data points are used for training

scores_loo = cross_val_score(regr_model, diabetes.data, diabetes.target, cv=loo_cv, scoring='neg_mean_squared_error')

print('Mean accuracy:', scores_loo.mean())

Mean accuracy: -3001.7462317329464


Negative MSE values are used in sci kit; higher negative values correspond to lower (and better) MSE.

### Resampling (ShuffleSplit cross-validation ): Linear Regression Model

In [30]:
from sklearn.model_selection import ShuffleSplit

data_split = ShuffleSplit(n_splits=30, test_size=0.3, random_state=1)

scores_ss = cross_val_score(regr_model, diabetes.data, diabetes.target, cv=data_split)

print('Mean R2:', scores_ss.mean())

Mean R2: 0.48040581282261186


## Ensemble models to improve overall predictive performance, robustness, and generalization.

### Bagging (training multiple base models independently on random subsets (with replacement) of the training data)

In [31]:
from sklearn.ensemble import BaggingRegressor

bagged_model = BaggingRegressor(regr_model, max_samples=0.5, max_features=0.5)

scores_bag = cross_val_score(bagged_model, diabetes.data, diabetes.target, cv=5)

print('R2 values:', scores_bag)
print('Mean R2: %0.2f' % scores_bag.mean())

R2 values: [0.34451301 0.50740874 0.47756454 0.4320375  0.48856484]
Mean R2: 0.45


### Random Forests (multiple decision trees to create a more accurate and robust predictive model)

In [32]:
from sklearn.ensemble import RandomForestRegressor

rf_regr = RandomForestRegressor()

scores_rf = cross_val_score(rf_regr, diabetes.data, diabetes.target, cv=5)

print('R2 values:', scores_rf)
print('Mean R2: %0.2f' % scores_rf.mean())

R2 values: [0.36856327 0.49930084 0.40880181 0.3885778  0.42361865]
Mean R2: 0.42


### Boosting

In [33]:
from sklearn.ensemble import AdaBoostRegressor

ada_reg = AdaBoostRegressor(n_estimators=100)

scores_ada_reg = cross_val_score(ada_reg, diabetes.data, diabetes.target, cv=5)

print('R2 values:', scores_ada_reg)
print('Mean R2: %0.2f' % scores_ada_reg.mean())

R2 values: [0.37981909 0.46670793 0.42697884 0.34242086 0.43470139]
Mean R2: 0.41


### Gradient Boosting

In [34]:
from sklearn.ensemble import GradientBoostingRegressor

params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'squared_error'}

gradient_boost = GradientBoostingRegressor(**params)

scores_mse = cross_val_score(gradient_boost, diabetes.data, diabetes.target, scoring = "neg_mean_squared_error", cv=5)
print('Negative MSE values:', scores_mse)
meanMSE = -1.0*scores_mse.mean()
print('Mean MSE: %0.2f' % meanMSE)

Negative MSE values: [-3077.53201177 -3194.25296095 -3806.10236419 -3541.70273589
 -3854.11301257]
Mean MSE: 3494.74


### Voting Regressors

In [35]:
from sklearn.ensemble import VotingRegressor

voting_cls = VotingRegressor(estimators=[('lr', regr_model), ('rf', rf_regr), ('svm', regr_svm)])

for clf, label in zip([regr_model, rf_regr, regr_svm, voting_cls], ['Linear', "RF", "SVM", 'Ensemble']):
    scores = cross_val_score(clf, diabetes.data, diabetes.target, cv=5)
    print("R2: %0.2f (std %0.2f) [%s]" % (scores.mean(), scores.std(), label))

R2: 0.48 (std 0.05) [Linear]
R2: 0.43 (std 0.05) [RF]
R2: 0.43 (std 0.10) [SVM]
R2: 0.48 (std 0.06) [Ensemble]


## Feature selection

### SelectKBest: Support Vector Machine Regression model 

In [36]:
from sklearn.feature_selection import SelectKBest, f_regression

filt_kb = SelectKBest(f_regression, k=6).fit_transform(diabetes.data, diabetes.target)
print(filt_kb.shape)

scores_kb = cross_val_score(regr_svm, filt_kb, diabetes.target, cv=10)
print('Mean R2: %0.2f' % scores_kb.mean())

scores = cross_val_score(regr_svm, diabetes.data, diabetes.target, cv=10)
print('Old R2: %0.2f' % scores.mean())

(442, 6)
Mean R2: 0.37
Old R2: 0.41


In [38]:
from sklearn.feature_selection import SelectKBest, f_regression

# Create the SelectKBest feature selector
selector = SelectKBest(f_regression, k=6)

# Fit the selector to your data and target
filt_kb = selector.fit_transform(diabetes.data, diabetes.target)

# Get the mask of selected features
selected_features_mask = selector.get_support()

# Print the indices or names of selected features
selected_feature_indices = [i for i, selected in enumerate(selected_features_mask) if selected]
selected_feature_names = [diabetes.feature_names[i] for i in selected_feature_indices]

print("Selected feature indices:", selected_feature_indices)
print("Selected feature names:", selected_feature_names)

Selected feature indices: [2, 3, 6, 7, 8, 9]
Selected feature names: ['bmi', 'bp', 's3', 's4', 's5', 's6']


### Wrapper: recursive feature elimination (RFE): Support Vector Machine Regression model

In [39]:
from sklearn.feature_selection import RFE

In [40]:
svr = SVR(kernel = "linear", C=100.)

rfe = RFE(estimator=svr, n_features_to_select=8, step=2)

scores_rfe = cross_val_score(rfe, diabetes.data, diabetes.target, cv=10)
print('Mean R2: %0.2f' % scores_rfe.mean())

Mean R2: 0.44


## Hyperparameter optimisation: Support Vector Machine Regression model

Finding the best settings or configurations for a machine learning model. 

In [41]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':['linear', 'rbf'], 'C':[1, 10, 100, 1000, 10000, 100000], 'gamma':[1e-3, 1e-4, 1e-5, 1e-6]}

svm_model_d = SVR()
opt_model_d = GridSearchCV(svm_model_d, parameters)

opt_model_d.fit(diabetes.data, diabetes.target)
print(opt_model_d.best_estimator_)

scores_opt = cross_val_score(opt_model_d, diabetes.data, diabetes.target, cv = 5).mean()
scores_opt

SVR(C=1000, gamma=0.001, kernel='linear')


0.46525737220642405