# Tutorial 04: ML modelling
by Dr Ivan Olier-Caparroso, last updated: 01/02/22

## Introduction
We will use the *Pima Indians Diabetes* dataset. We have use it in previous tutorials. Please refer to them for a description of the data. Note that the version we will use in this tutorial contains missing values. In this sense, it is not equivalent to the ones used in previous tutorials. In this occasion, the data is already pre-processed and split into training and test subsets.

We will train several ML models and test their performances via a  Kaggle competition. Kaggle competitions follow the typical idea that the targets (or outputs) of the test set are hidden to you. You should use the training set to create your models and make predictions on the test set, which then are uploaded to Kaggle for it to estimate the model performance score and rank your model against other competitors. In this tutorial, we will produce several ML models, generate a predictions file for each of them using the test set, and upload the files to Kaggle.

More information about Kaggle can be found in its website (https://www.kaggle.com/) and in it Wikipedia's page (https://en.wikipedia.org/wiki/Kaggle)

The data is available in the Kaggle competition, with link:

https://www.kaggle.com/c/tutorial-04

and the invitation to join the competition is available here:

https://www.kaggle.com/t/fb5e796e3f4c4c568f4f31522f09ed32

Let's assume you have downloaded the data already:

In [None]:
import os
from google.colab import files
from google.colab import drive
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
uploaded = files.upload()


Saving pima_test.csv to pima_test.csv
Saving pima_train.csv to pima_train.csv
Saving predictions_example.csv to predictions_example.csv


In [None]:
dset_trn = pd.read_csv('pima_train.csv')
dset_tst = pd.read_csv('pima_test.csv')

* Let's have a look at the training set:

In [None]:
dset_trn.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Id
0,0,131.0,66.0,40.0,,34.3,0.196,22,1,397
1,4,114.0,64.0,,,28.9,0.126,24,0,474
2,9,122.0,56.0,,,33.3,1.114,33,1,131
3,10,111.0,70.0,27.0,,27.5,0.141,40,1,667
4,12,84.0,72.0,31.0,,29.7,0.297,46,1,510


* and at the test set:

In [None]:
dset_tst.head() # Note there is no 'Outcome'

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Id
0,1,85.0,66.0,29.0,,26.6,0.351,31,1
1,0,137.0,40.0,35.0,168.0,43.1,2.288,33,4
2,5,116.0,74.0,,,25.6,0.201,30,5
3,10,168.0,74.0,,,38.0,0.537,34,11
4,1,103.0,30.0,38.0,83.0,43.3,0.183,33,18


Notice that the above dataframes contain rows with missing values. We must decide how to handle them as part of the data pre-processing. *Scikit-learn* offers several alternatives to handle missing values. Please refer to its documentation for more information: https://scikit-learn.org/stable/modules/impute.html. 

First, let's have a look at the level of missingness in the data. We can use the combination of the pandas functions `isna` and `sum` to count the number of missing values in each column:

In [None]:
dset_trn.isna().sum()     # shows the number of missings per column

Pregnancies                   0
Glucose                       4
BloodPressure                26
SkinThickness               162
Insulin                     266
BMI                           8
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
Id                            0
dtype: int64

Or, in proportions, as follows:

In [None]:
n_obs_trn = dset_trn.shape[0]
dset_trn.isna().sum()/n_obs_trn

Pregnancies                 0.000000
Glucose                     0.007435
BloodPressure               0.048327
SkinThickness               0.301115
Insulin                     0.494424
BMI                         0.014870
DiabetesPedigreeFunction    0.000000
Age                         0.000000
Outcome                     0.000000
Id                          0.000000
dtype: float64

Notice that the level of missingness of the *Insulin* variable is almost 50\%, so it is sensible to drop that variable.

In [None]:
dset_trn.drop(columns='Insulin', inplace=True)
dset_tst.drop(columns='Insulin', inplace=True)

 For the rest, we can adopt a *simple missing value imputation* that considers every column separately. We will impute the *mean* value, which is perhaps the simplest strategy. *Scikit-learn* implements a simple imputer through the class `SimpleImputer`. Again, refer to the above documentation for more details. Before we impute the missing values, we should convert the data to (`numpy`) arrays, which are what `sklearn` handles.

In [None]:
X_train = dset_trn.drop(columns=['Id','Outcome']).to_numpy()
X_test = dset_tst.drop(columns='Id').to_numpy()
y_train = dset_trn['Outcome'].to_numpy()


In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean').fit(X_train)
X_train = imp.transform(X_train)
X_test = imp.transform(X_test)

Now, we can scale the data as we did in previous tutorials:

In [None]:
from sklearn import preprocessing

mm_scaler = preprocessing.StandardScaler()  # creates a Transformer (set of preprocessing operations)
X_train = mm_scaler.fit_transform(X_train)  # estimates transformation parameters using the training set, and applies to them
X_test = mm_scaler.transform(X_test) # transforms test set using training set parameters.


https://scikit-learn.org/stable/modules/grid_search.html

## Multiple logistic regression

Let's now produce a multiple logistic regression model. *Scikit-learn* implements logistic regression with penalisation, being *ridge penalisation (L2)* the default option. It also implemets several ways to solve the optimisation problem (which is called maximum likelihood). In principle, `saga` is the option to go unless the amount of data is not enough. For more details follow this link: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. The `sklearn` implementation of logistic regression has several hyperparameter that we might like to tune. In particular, we can tune the penalisation factor ($\lambda$ in the lecture slides), which in `sklearn` is defined by the *inverse of regularisation strength*. `sklearn` implements a *cross-validated* version of logistic regression that is helpful if we would like to tune a hyperparameter. This is done with the class `LogisticRegressionCV` (instead of `LogisticRegression` that assumes you know your hyperparameters already). Refer to its documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV.

In [None]:
from sklearn.linear_model import LogisticRegressionCV

lr_mdl = LogisticRegressionCV(penalty='l2', #default penalty
                              scoring='roc_auc',
                              solver='saga', #solving the optimisation problem (maximum likelihood) 
                              Cs = 6) #grid of Cs values are chosen in a logarithmic scale between 1e-4 and 1e4.

Now, we can just use the function `fit` to train the model. In this case, `sklearn` will:

1. tune the *inverse of regularisation strength* hyperparameter (C) by testing 6 values between 1e-4 to 1e4 using cross-validation (which the number of folds will be 5 by default), and
2. use the whole `X_train` to train a logistic regression model using the optimal C.


In [None]:
lr_mdl.fit(X_train, y_train)

LogisticRegressionCV(Cs=6, scoring='roc_auc', solver='saga')

We can query several attributes such as the coefficients (or $\beta$ s), intercept ($\beta_0$), optimal C, and the performance scores for each hyperparameter value and CV fold:

In [None]:
print("Coeficients (betas): ", lr_mdl.coef_)
print("Intercept: ", lr_mdl.intercept_)
print("Optimal inverse of regularization parameter value: ", lr_mdl.C_)
print('Model performance scores [AUC]: ', lr_mdl.scores_)

Coeficients (betas):  [[ 0.31892261  0.95725913 -0.07526795  0.09631392  0.71166728  0.2673513
   0.18009998]]
Intercept:  [-0.81942342]
Optimal inverse of regularization parameter value:  [0.15848932]
Model performance scores [AUC]:  {1: array([[0.83120301, 0.82857143, 0.82969925, 0.82706767, 0.82706767,
        0.82706767],
       [0.80639098, 0.8093985 , 0.81353383, 0.81428571, 0.81390977,
        0.81390977],
       [0.78270677, 0.79586466, 0.81278195, 0.81466165, 0.81466165,
        0.81466165],
       [0.8020595 , 0.80549199, 0.80244088, 0.79824561, 0.79786423,
        0.79786423],
       [0.86384439, 0.87299771, 0.89588101, 0.89969489, 0.89969489,
        0.89969489]])}


Finally, we use `predict_proba` as in previous tutorials to predict the probabilities of the outcomes and form a dataframe that could be uploaded to Kaggle:

In [None]:
y_tst_lr = lr_mdl.predict_proba(X_test)

dset_sol_lr = pd.DataFrame({"Id": dset_tst.Id, "Predicted" : y_tst_lr[:,1]})
dset_sol_lr.head()

Unnamed: 0,Id,Predicted
0,1,0.050062
1,4,0.888402
2,5,0.132097
3,11,0.86053
4,18,0.390431


We save the data using the CSV file format:

In [None]:
dset_sol_lr.to_csv("pima_sol_lr.csv", index=False)

## Support Vector Machines (SVM):

SVMs are implemented in `sklearn` with the classes `SVC` and `SVR`, for classification and regression, respectively. The `SVC()` function can be used to fit a *support vector classifier* (linear SVM) when the argument `kernel="linear"` is used. The `c` argument allows us to specify the cost of a violation to the margin. When the `c` argument is **small**, then the margins will be wide and many support vectors will be on the margin or will violate the margin. When the `c` argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.

We can use the `SVC()` function to fit the support vector classifier for a given value of the `cost` parameter as follows:

In [None]:
from sklearn.svm import SVC

svc = SVC(C=1, kernel='linear')
svc.fit(X_train, y_train)

SVC(C=1, kernel='linear')

Again, we use `fit` to train the model. In the above case, we assumed fixed hyperparameter values. 

The `sklearn.grid_search` module includes the class `GridSearchCV` to use the *grid search* strategy for hyperparameter tuning. It performs cross-validation or any other resampling method. For instance, the following code implements hyperparameter tuning using a grid search to find an optimal value for the cost `c`:


In [None]:
from sklearn.model_selection import GridSearchCV

# Select the optimal C parameter by cross-validation
tuned_parameters = [{'C': [0.001, 0.01, 0.1, 1, 5, 10, 100]}]
svc = GridSearchCV(SVC(kernel='linear',probability=True), tuned_parameters, cv=10, scoring='roc_auc')
svc.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=SVC(kernel='linear', probability=True),
             param_grid=[{'C': [0.001, 0.01, 0.1, 1, 5, 10, 100]}],
             scoring='roc_auc')

We can easily access the cross-validation errors for each of these models:

In [None]:
svc.cv_results_

{'mean_fit_time': array([0.02496965, 0.02365448, 0.02428138, 0.03492167, 0.07159429,
        0.11784496, 0.73245852]),
 'mean_score_time': array([0.00214601, 0.00202584, 0.00193896, 0.00195456, 0.00205903,
        0.00190654, 0.00213878]),
 'mean_test_score': array([0.82525874, 0.82785935, 0.82820433, 0.82775763, 0.82794781,
        0.82779743, 0.82764706]),
 'param_C': masked_array(data=[0.001, 0.01, 0.1, 1, 5, 10, 100],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.001},
  {'C': 0.01},
  {'C': 0.1},
  {'C': 1},
  {'C': 5},
  {'C': 10},
  {'C': 100}],
 'rank_test_score': array([7, 3, 1, 5, 2, 4, 6], dtype=int32),
 'split0_test_score': array([0.81052632, 0.80150376, 0.80150376, 0.80300752, 0.80150376,
        0.80150376, 0.80150376]),
 'split1_test_score': array([0.85413534, 0.84661654, 0.8556391 , 0.85864662, 0.85864662,
        0.85864662, 0.85864662]),
 'split2_test_score': array([0.79849

The `GridSearchCV()` function stores the best parameters obtained, which can be accessed as follows:

In [None]:
svc.best_params_

{'C': 0.1}

Similarly to the logistic regression model, we can easily produce a prediction file that contains the predicted probabilities on the test set:

In [None]:
y_tst_svc = svc.predict_proba(X_test)

dset_sol_svc = pd.DataFrame({"Id": dset_tst.Id, "Predicted" : y_tst_svc[:,1]})
dset_sol_svc.to_csv("pima_sol_svc_linear.csv", index=False)




We can also build a SVM (non-linear) using a non-linear kernel function such as polynomial and radial basis functions (RBF). To fit an SVM with a polynomial kernel we use `kernel="poly"`, and to fit an SVM with a radial kernel we use `kernel="rbf"`. In the former case we also use the `degree` argument to specify a degree for the polynomial kernel, and in the latter case we use `gamma` to specify a value of $\gamma$ for the radial basis kernel. 

As before, we can use perform a grid search to tune the hyperparameters (such as kernel type, degree, $\gamma$, cost, etc)

In [None]:
svm_hyp_grid = [
    {'C': [0.1, 1, 10, 100], 'kernel': ['linear']},
    {'C': [0.1, 1, 10, 100], 'degree': [2, 4, 8], 'kernel': ['poly']},
    {'C': [0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 0.5, 1, 100], 'kernel': ['rbf']}
]

mdls_svm = GridSearchCV(SVC(probability=True), param_grid=svm_hyp_grid, cv=5, scoring='roc_auc').fit(X_train, y_train)

So, the best set of hyperparameters is:

In [None]:
mdls_svm.best_params_

{'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}

Using a similar code as before, we produce a new prediction file:

In [None]:
y_tst_svm = mdls_svm.predict_proba(X_test)

dset_sol_svm = pd.DataFrame({"Id": dset_tst.Id, "Predicted" : y_tst_svm[:,1]})
dset_sol_svm.to_csv("pima_sol_svm.csv", index=False)


If, instead of performing a grid search, a random search is what we want, the code would be very similar. Instead of using `GridSearchCV`, we should use `RandomizedSearchCV`. More details here: https://scikit-learn.org/stable/modules/grid_search.html.

## Random Forest (RF) and Gradient Boosting Machines (GBM):

RF and GBM are available in `sklearn.ensemble`. The work with them is similar to SVM (and most of the ML algorithms implemented in Scikit-learn). RF is implemented with the class `RandomForestClassifier` whilst GBM, with `GradientBoostingClassifier`. Both methods have several hyperparameters that we might want to tune. Typically, we tune the following RF hyperparameters:

* `max_features` - The number of features to consider when looking for the best split.
* `min_samples_leaf` - The minimum number of samples required to be at a leaf node.

and the following GBM hyperparameters:

* `learning_rate`- Learning rate shrinks the contribution of each tree by learning_rate.
* `min_samples_split` - The minimum number of samples required to split an internal node

For more details about `RandomForestClassifier` and `GradientBoostingClassifier` visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html and https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.

Let's have a look at the code for an RF model:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_hyp_grid = {'max_features':[0.1, 0.3, 0.5, 0.7, 0.9],
               'min_samples_leaf':range(3,18,3)}
               
mdl_rf = GridSearchCV(RandomForestClassifier(), param_grid=rf_hyp_grid, cv=5, scoring='roc_auc').fit(X_train, y_train)
mdl_rf.best_params_

{'max_features': 0.9, 'min_samples_leaf': 12}

Now, we produce the prediction file:

In [None]:
y_tst_rf = mdl_rf.predict_proba(X_test)

dset_sol_rf = pd.DataFrame({"Id": dset_tst.Id, "Predicted" : y_tst_rf[:,1]})
dset_sol_rf.to_csv("/drive/My Drive/pima_sol_rf.csv", index=False)

FileNotFoundError: ignored

Now, it is the turn for GBM:

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbm_hyp_grid = {'learning_rate':[0.001, 0.01, 0.1, 0.5],
               'min_samples_split':range(3,18,3)}
               
mdl_gbm = GridSearchCV(GradientBoostingClassifier(), param_grid=gbm_hyp_grid, cv=5, scoring='roc_auc').fit(X_train, y_train)
mdl_gbm.best_params_

{'learning_rate': 0.1, 'min_samples_split': 15}

... and it prediction file:

In [None]:
y_tst_gbm = mdl_gbm.predict_proba(X_test)

dset_sol_gbm = pd.DataFrame({"Id": dset_tst.Id, "Predicted" : y_tst_gbm[:,1]})
dset_sol_gbm.to_csv("pima_sol_gbm.csv", index=False)

## Kaggle competition

Now that you have several prediction files originated from having tested several machine learning algorithms, it is time to see how they perform against the test set. To do this, we will upload them to a new Kaggle competition for this tutorial. The link to the competition is the following:

https://www.kaggle.com/c/tutorial-10-competition

And the invitation to join the competition is the following link:


## Exercise 2
* Which model performed the best? Why do you think that model was the best one? Try a wider range of hyperparameter for each algorithm to see whether there is any improvement. Also, you could consider using other ML algorithms (e.g. Neural Networks, Decision Trees, k-NN)