# Tutorial 03 (part B): Working with ML algorithms in Scikit-learn
By Dr Ivan Olier-Caparroso, 30/01/22


## Task
We will perform a two-class classification task on the *South African Heart* dataset (SAHeart). The dataset is publicly available, just type its name in Google. It is also available in the module GitHub's data repository (https://raw.githubusercontent.com/iaolier/7021DATSCI/main/data/SAheart.csv). The dataset is a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of Coronary Heart Disease (CHD). Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal. 

This is the set of variables in the dataset:

* sbp - systolic blood pressure
* tobacco - cumulative tobacco (kg)
* ldl - low densiity lipoprotein cholesterol
* adiposity
* famhist - family history of heart disease (Present, Absent)
* typea - type-A behavior
* obesity
* alcohol - current alcohol consumption
* age - age at onset
* chd - response, coronary heart disease

The aim is to predict the risk of CHD as a function of the other variables. This is essentially a classification task with two classes: CHD/No CHD (coded as 1 and 0, respectively).

## Activities

1. Clean the data (if needed) and convert binary, any categorical variable. 
2. Split the data into training and test subsets, following out-of-sample resampling strategy. Leave 30% for testing.
3. Standardise the data splits (remove mean and divide by standard deviation).

-- Default score for the model evaluation is the AUC.

4. Perform hyperparameter tuning on logistic regression (LR) and select the best possible model. *Scikit-learn* has several LR implementations (or solvers), that implements the maximum-likelihood algorithm. we commonly work with the *saga* solver, which allows for the *elastic net* penalisation, which is a generalisation of the *ridge* and *lasso* penalisations we studied. Ridge pensalisation uses the $\mathcal{l}_{1}$-norm, whilst lasso, the $\mathcal{l}_{2}$-norm. Elastic net allows for any norm between $\mathcal{l}_{1}$ and $\mathcal{l}_{2}$. In addition, the solver uses a complexity parameter `C` in a similar fashion as *SVM*. More details can be found here: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression. Use several `C` and `l1_ratio`.
5. Perform hyperparameter tuning on k-nearest neighbour (KNN) and select the best possible model. Use several `k` values.
6. Perform hyperparameter tuning on support vector machines (SVM) and select the best possible model. Use several `C` values, and several kernel functions (and kernel hyperparameters).
7. Report test AUCs of optimised models and indicate the best one. Comments?

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

## Dataset

Let's read the data:

In [None]:
SAHeart = pd.read_csv('https://raw.githubusercontent.com/iaolier/7021DATSCI/main/data/SAheart.csv')
SAHeart.head()

In [None]:
SAHeart.dtypes

In [None]:
pd.plotting.scatter_matrix(SAHeart, alpha=0.2, figsize=(10, 10))
plt.show()

In [None]:
SAHeart.describe(include="all")

In [None]:
import seaborn as sns

In [None]:
sns.histplot(SAHeart["tobacco"])

In [None]:
sns.histplot(SAHeart["alcohol"])

In [None]:
sns.histplot(SAHeart["age"])

In [None]:
dummies = pd.get_dummies(SAHeart[['famhist']])
SAHeart = SAHeart.drop('famhist', axis=1)
SAHeart = pd.concat([SAHeart, dummies[['famhist_Present']]], axis=1)
SAHeart.head()

In [None]:
pd.plotting.scatter_matrix(SAHeart, alpha=0.2, figsize=(10, 10))
plt.show()

## Out-of-sample resampling strategy

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X, X_test, y, y_test = train_test_split(SAHeart.drop('chd', axis = 1), SAHeart.chd, test_size = 0.3, random_state=1)

We further split the training set, so we have a validation set.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 1)

## Data transformation

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
X_train[:5,:]

## Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
m_lr = LogisticRegression().fit(X_train, y_train)

In [None]:
# predict probabilities
lr_val_probs = m_lr.predict_proba(X_val)
# keep probabilities for the positive outcome only
lr_val_probs = lr_val_probs[:, 1]

In [None]:
# training set
lr_trn_probs = m_lr.predict_proba(X_train)
lr_trn_probs = lr_trn_probs[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
print('Logistic model: AUC: Training = %.3f, Validation = %.3f' % (roc_auc_score(y_train, lr_trn_probs),
                                                                   roc_auc_score(y_val, lr_val_probs)))

## Support Vector Machines

In [None]:
from sklearn.svm import SVC

In [None]:
m_svm = SVC(probability=True).fit(X_train, y_train)

In [None]:
# predict probabilities
svm_val_probs = m_svm.predict_proba(X_val)
# keep probabilities for the positive outcome only
svm_val_probs = svm_val_probs[:, 1]

In [None]:
# training set
svm_trn_probs = m_svm.predict_proba(X_train)
svm_trn_probs = svm_trn_probs[:, 1]

In [None]:
print('SVM model: AUC: Training = %.3f, Validation = %.3f' % (roc_auc_score(y_train, svm_trn_probs),
                                                              roc_auc_score(y_val, svm_val_probs)))

## Hyper-parameter tuning

### Logistic regression

In [None]:
LogisticRegression().get_params()

In [None]:
param_grid = [
    {'C' : [0.1, 1.0, 10, 100], 'l1_ratio' : [0, 0.25, 0.5, 0.75, 1]}
]

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
clf = GridSearchCV(LogisticRegression(penalty='elasticnet', solver='saga', random_state=1, max_iter=10000),
                  param_grid,
                  scoring = 'roc_auc')
clf.fit(X, y)

In [None]:
clf.best_params_

In [None]:
print("Grid scores on validation set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

### Support Vector Machines

In [None]:
SVC().get_params()

In [None]:
param_grid = [
    {'C' : [0.1, 1.0, 10, 100], 'kernel' : ['rbf'], 'gamma' : [0.0001, 0.001, 0.01, 0.1, 1]}
]

In [None]:
clf = GridSearchCV(SVC(probability=True, random_state=1),
                  param_grid,
                  scoring = 'roc_auc')
clf.fit(X, y)

In [None]:
clf.best_params_

In [None]:
print("Grid scores on validation set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

### Exercise 1:
Use a random search instead of a grid search to find the optimal hyperparameters. Compare with the above results. Any comments?

### Exercise 2:
Implemment an artificial neural network on the SAHeart data. Tune several ANN hyperparameters such as learning rate, number of neurons per hidden layer and number of hidden layers. Is there any ANN that performs better than the above implementations?

### Exercise 3:
From the above model performances, identify any model that might be overfitting or underfitting.