## Table of contents

- [Classification](#Classification)
    - [Dataset][#Dataset]
- [Regression](#Regression)
- [Exercises](#Exercises)

In [None]:
# Important modules
import numpy as np
import matplotlib
# Reshape figures to a larger figsize
matplotlib.rcParams['figure.figsize'] = [14,14]

## References

Here are some additional references to guide you while self-learning. Each reference is tagged with the type of information contained, e.g. text or video

References for classification: 
- https://scikit-learn.org/stable/modules/linear_model.html#classification for linear models
- https://scikit-learn.org/stable/modules/tree.html#classification for decision trees
https://scikit-learn.org/stable/datasets/toy_dataset.html#toy-datasets for toy datasets
https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-data for preprocessing
https://scikit-learn.org/stable/modules/decomposition.html for decomposition
https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection for feature selection
https://scikit-learn.org/stable/modules/linear_model.html#regression for regression using linear models
https://scikit-learn.org/stable/model_selection.html#model-selection-and-evaluation for model selection and hyperparameter tuning
https://scikit-learn.org/stable/modules/compose.html#build-a-pipeline for pipelines




## Introduction

Scikit-learn is an open-source library in Python highly used for machine learning tasks such as data preprocessing, classification, regression, and clustering. It is built on top of other scientific libraries such as NumPy, SciPy, and Matplotlib. It contains several tools for preparing data and building intelligent models with it. It is also user-friendly and efficient for both beginners and experts in data science.

This notebook covers different tools to perform supervised (classification and regression) and unsupervised learning (clustering). Therefore, it contains some basic machine learning concepts. However, the goal is not to teach about the algorithms and procedures used in this field, but rather to give the user an idea of what can be done with this Python library.

## Dataset

### Load Data

In order to explore the library, firstly we load a dataset from the list of Toy datasets available in the scikit-learn library. The dataset that we are going to use firstly is the Breast cancer wisconsin (diagnostic) dataset. This dataset contains 569 samples with 30 different features. Each sample is classified either as malignant or benign.

In [None]:
from sklearn import datasets
data = datasets.load_breast_cancer()
features = data['data']
targets = data['target']
feature_names = data['feature_names']
target_names = data['target_names']

### Train-Validation-Test Split

Train-validation-test split is fundamental to produce machine learning models. Machine learning algorithms are very powerful and can easily memorise the data used for training. Therefore, it is highly recommended to split the data into three sets: a training set, a validation set, and a test set. The first one is used for optimising the model. The second one is used to perform hyperparameter optimisation, i.e., we train the models several times using different settings and select the one that performs the best in the validation set. Finally, the model should be tested in a out-of-sample set (test set) in order to get the performance of the model in a situation similar to a real-life scenario.

Using ``` model_selection.train_test_split```, we first divide the dataset in train and test sets. Then, we further divide the train set into a smaller train set and a validation set. Usually, the ratio between the different sets is 60% for training, 20% for validation, and 20% for test.

In [None]:
from sklearn import model_selection
features_train, features_test, targets_train, targets_test = model_selection.train_test_split(features, targets, test_size = 0.20, random_state = 42)
features_train, features_val, targets_train, targets_val = model_selection.train_test_split(features_train, targets_train, test_size = 0.25, random_state = 42)

## Data Preprocessing

Before developing the models, data should be preprocessed. Be careful that toy datasets are usually already very clean and ready to use. However, in real-life situations data contain lots of errors that should be handled. Furthermore, datasets usually contain raw data. Therefore, feature extraction must be performed to get information that may be used to develop a machine learning model. However, as machine learning is not the main focus of the workshop, we will skip this task and use a dataset that already contains some features.

### Standardising data

Standardisation is the process of rescaling the features so that they have a mean of 0 and a standard deviation of 1. It is also known as Z-score normalisation. The formula is given by 

$$
z = (x - μ) / σ
$$

This procedure is very useful for centering the data around the mean, helping the algorithms converge faster during training and for scaling and comparing different variables preventing features with larger values from dominating in certain algorithms or models.
It should be noted that this scaling procedure should only be used in numerical features. Other techniques should be used in the case of other types of features, e.g. categorical features.

First, we calculate the mean and the standard deviation of the train set using ```StandardScaler().fit()``` and then using ```StandardScaler().transform()``` we scale the different sets. Only train set is used for fitting the standard scaler in order to not have data leakage.

In [None]:
from sklearn import preprocessing
standard_scaler = preprocessing.StandardScaler()
standard_scaler.fit(features_train)
features_train_standardised = standard_scaler.transform(features_train)
features_val_standardised = standard_scaler.transform(features_val)
features_test_standardised = standard_scaler.transform(features_test)

### Dimensionality reduction

#### Feature selection

Feature selection is the process of choosing the most relevant and informative features from the available ones. Its goal is to improve model performance, reduce overfitting, decrease computational overhead, and enhance interpretability. There are different ways of performing it. In this case, we use a filter method, which removes the less informative variables without considering the used machine learning models.

Using ```SelectKBest()``` with the ```f_classif``` feature selection function, we select the best 10 features (randomly chosen number). Again, for preparing the method, we use the train set. Then, we use the already fitted model to transform the different sets. Afterwards, using ```SelectKBest().get_support()```, we get the indexes of the selected features (the most informative ones).

In [None]:
from sklearn import feature_selection
feature_selection_function = feature_selection.f_classif
select_best_features = feature_selection.SelectKBest(score_func = feature_selection_function, k = 10)
select_best_features.fit(features_train_standardised, targets_train)
features_train_selected = select_best_features.transform(features_train_standardised)
features_val_selected = select_best_features.transform(features_val_standardised)
features_test_selected = select_best_features.transform(features_test_standardised)

selected_features_indexes = select_best_features.get_support(indices=True)
selected_features_names = feature_names[selected_features_indexes]


#### Feature projection

Feature projection, also known as dimensionality reduction transforms the features into a new set of variables. In this case, we are going to use Principal Component Analysis (PCA), which transforms the variables into orthogonal variables called principal variables. It is usually for reducing the number of dimensions while retaining most of the variability as possible, facilitating data visualisation in lower-dimensional spaces. However, note that once the features are transformed they are not interpretable anymore as the new features are just a linear combination of the original ones. This technique may also be considered an unsupervised learning method. However, we will keep it here to make the notebook easier to understand.

First, we select the number of components that we want to see in the new set ```PCA(n_components = 2)```. Then, we fit the model using the train set. Finally, we transform the different sets. Using ```PCA().explained_variance_ratio_```, we can check how much variance is explained by the computed components. A rule of thumb is to keep at least 70-80% of the explained variance. Otherwise, the model is not reliable.

In [None]:
from sklearn import decomposition
pca_model = decomposition.PCA(n_components = 2)
pca_model.fit(features_train_selected)
features_train_reduced = pca_model.transform(features_train_selected)
features_val_reduced = pca_model.transform(features_val_selected)
features_test_reduced = pca_model.transform(features_test_selected)
explained_variance_ratio = pca_model.explained_variance_ratio_

### Plot dataset

After reducing the dimension of our dataset to just 2 variables, we are able to see the samples in a 2D plot.

In [None]:
import matplotlib.pyplot as plt
plt.scatter(features_train_reduced[:,0], features_train_reduced[:,1], c = targets_train)

### Unsupervised learning

Unsupervised learning is a type of machine learning where the model is trained on unlabeled data. The goal of this field is to discover hidden patterns, structures, or relationships within the data. KMeans is a highly used unsupervised learning method. It consists on grouping data points together in such a way that points within the same cluster are more similar to each other compared to those in the other clusters. The centroid of each cluster is the average of all data points belonging to it.

First, we define how many clusters we have in our model ```KMeans(n_clusters = 2)```. Then, we fit the model and get the assigned labels. As we want to see the samples in a 2D plot, we use the projected features for training the clustering method. To streamline the notebook, we exclusively use the train set for this task.

In [None]:
from sklearn import cluster
kmeans_model = cluster.KMeans(n_clusters = 2, random_state = 42)
kmeans_model.fit(features_train_reduced)

targets_pred_train_cluster = kmeans_model.predict(features_train_reduced)

# Lets see the result
true_positive_train = features_train_reduced[np.where((targets_train==1) & (targets_pred_train_cluster==1))[0]]
true_negative_train = features_train_reduced[np.where((targets_train==0) & (targets_pred_train_cluster==0))[0]]
false_positive_train = features_train_reduced[np.where((targets_train==0) & (targets_pred_train_cluster==1))[0]]
false_negative_train = features_train_reduced[np.where((targets_train==1) & (targets_pred_train_cluster==0))[0]]
plt.scatter(true_positive_train[:,0], true_positive_train[:,1], color = 'blue', marker = '.')
plt.scatter(true_negative_train[:,0], true_negative_train[:,1], color = 'green', marker = '.')
plt.scatter(false_positive_train[:,0], false_positive_train[:,1], color = 'blue', marker = 'x')
plt.scatter(false_negative_train[:,0], false_negative_train[:,1], color = 'green', marker = 'x')
plt.legend(['TP', 'TN', 'FP', 'FN'])
plt.title('Clustering data')

### Supervised learning

Supervised learning is a type of machine learning where the algorithm learns patterns and relations directly from labeled training data. It can be split in classification, in which the model predicts categories, e.g. image classification or spam filters; and in regression, in which the model predicts a continuous numerical output, e.g. house price prediction or time-series forecasting.

#### Classification

As aforementioned, classification is used for labeling samples to predefined classes. Two highly used methods for performing this task are the logistic regression (linear classifier) and the decision tree (non-linear classifier). Both are easily interpretable and do not require a significant amount of computational resources to be optimised. We use the sets before the projection because we want to be able to interpret the models.

##### Logistic Regression

First, we fit the model using the train set (```LogisticRegression().fit()```) and then using ```predict```, we predict the labels in the validation set. This procedure is used for selecting the algorithm hyperparameters. In this case, as it is not intended to get the best classifier overall, we concatenate the train and the validation sets and retrain the model. After training the model, one can get the coefficients necessary to reconstruct the equation used for predicting the labels by calling the functions ```LogisticRegression().intercept_``` and ```LogisticRegression().coef_```.

In [None]:
from sklearn import linear_model

logistic_regression_model = linear_model.LogisticRegression()
logistic_regression_model.fit(features_train_selected, targets_train)

targets_pred_val = logistic_regression_model.predict(features_val_selected)

# Change the classifier to get the best validation performance
complete_features_train_selected = np.vstack((features_train_selected, features_val_selected))
complete_targets_train = np.hstack((targets_train, targets_val))
logistic_regression_model.fit(complete_features_train_selected, complete_targets_train)

targets_pred_test_lr = logistic_regression_model.predict(features_test_selected)
lr_intercept = logistic_regression_model.intercept_
lr_coefs = np.squeeze(logistic_regression_model.coef_)

equation = "logit(X) = %.2f + (%.2f x0) + (%.2f x1) + (%.2f x2) + (%.2f x3) + (%.2f x4) + (%.2f x5) + (%.2f x6) + (%.2f x7) + (%.2f x8) + (%.2f x9)" % (lr_intercept[0], 
                                                                                                                                                        lr_coefs[0], 
                                                                                                                                                        lr_coefs[1], 
                                                                                                                                                        lr_coefs[2], 
                                                                                                                                                        lr_coefs[3], 
                                                                                                                                                        lr_coefs[4], 
                                                                                                                                                        lr_coefs[5], 
                                                                                                                                                        lr_coefs[6], 
                                                                                                                                                        lr_coefs[7], 
                                                                                                                                                        lr_coefs[8], 
                                                                                                                                                        lr_coefs[9])
print(equation)

##### Decision Tree

First, we fit the model using the train set (```DecisionTreeClassifier().fit()```) and then using ```predict```, we predict the labels in the validation set. Again, this procedure is used for selecting the algorithm hyperparameters. In this case, as it is not intended to get the best classifier overall, we concatenate the train and the validation sets and retrain the model. After training the model, one can draw the tree architecture and see what features were used and how they were split to classify the different classes, using the function ```tree.plot_tree```.

In [None]:
from sklearn import tree

decision_tree_model = tree.DecisionTreeClassifier(max_depth = 3)
decision_tree_model.fit(features_train_selected, targets_train)

targets_pred_val = decision_tree_model.predict(features_val_selected)

complete_features_train_selected = np.vstack((features_train_selected, features_val_selected))
complete_targets_train = np.hstack((targets_train, targets_val))
decision_tree_model.fit(complete_features_train_selected, complete_targets_train)

targets_pred_test_dt = decision_tree_model.predict(features_test_selected)

tree.plot_tree(decision_tree_model, feature_names = selected_features_names, class_names = target_names, filled = True)

### Evaluation

After developing the classifiers, you should measure how well the models are performing in terms of correctness, accuracy, and reliability. For that, confusion matrixes are usually used because they contain a lot of information inside, which you can use to calculate different metrics, such as accuracy, recall, precision, among others. For that, ```metrics.confusion_matrix(true_labels, predicted_labels)``` is used. Scikit-learn also contains a function that returns a small report about the models being evaluated (```metrics.classification_report(true_labels, predicted_labels)```).

In [None]:
from sklearn import metrics
confusion_matrix_test_lr = metrics.confusion_matrix(targets_test, targets_pred_test_lr)
confusion_matrix_test_dt = metrics.confusion_matrix(targets_test, targets_pred_test_dt)

print("---------------------Logistic Regression---------------------")
print(metrics.classification_report(targets_test, targets_pred_test_lr, target_names = target_names))

print("------------------------Decision Tree------------------------")
print(metrics.classification_report(targets_test, targets_pred_test_dt, target_names = target_names))

## Regression

### Dataset

#### Load Data

In [None]:
import sklearn.datasets
data = sklearn.datasets.load_diabetes()
X = data['data']
y = data['target']
feature_names = np.array(data['feature_names'])

#### Train-validation-test split

In [None]:
import sklearn.model_selection
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.20, random_state = 42)
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X_train, y_train, test_size = 0.25, random_state = 42)

### Preprocessing

#### Standardising data

In [None]:
import sklearn.preprocessing
standard_scaler = sklearn.preprocessing.StandardScaler()
X_train_standardised = standard_scaler.fit_transform(X_train)
X_val_standardised = standard_scaler.transform(X_val)
X_test_standardised = standard_scaler.transform(X_test)

#### Feature selection

In [None]:
import sklearn.feature_selection
feature_selection_function = sklearn.feature_selection.f_regression
select_best_features = sklearn.feature_selection.SelectKBest(score_func = feature_selection_function, k = 2)
X_train_selected = select_best_features.fit_transform(X_train_standardised, y_train)
X_val_selected = select_best_features.transform(X_val_standardised)
X_test_selected = select_best_features.transform(X_test_standardised)
selected_features_indexes = select_best_features.get_support(indices=True)
selected_features_names = feature_names[selected_features_indexes]

### Supervised learning

#### Linear Regression

In [None]:
import sklearn.linear_model

linear_regression_model = sklearn.linear_model.LinearRegression()
linear_regression_model.fit(X_train_selected, y_train)
y_pred_val = linear_regression_model.predict(X_val_selected)

# Change the regressor to get the best validation performance
complete_X_train_selected = np.vstack((X_train_selected, X_val_selected))
complete_y_train = np.hstack((y_train, y_val))
linear_regression_model.fit(complete_X_train_selected, complete_y_train)

y_pred_test_linreg = linear_regression_model.predict(X_test_selected)
linreg_intercept = linear_regression_model.intercept_
linreg_coefs = np.squeeze(linear_regression_model.coef_)

equation = "y = %.2f + (%.2f x0) + (%.2f x1)" % (lr_intercept, linreg_coefs[0], linreg_coefs[1])
print(equation)

#### Evaluation

In [None]:
import sklearn.metrics
r2_score_linreg = sklearn.metrics.r2_score(y_test, y_pred_test_linreg)
rmse_linreg = sklearn.metrics.mean_squared_error(y_test, y_pred_test_linreg, squared = False)
rms_target = np.sqrt(np.mean(y_test ** 2))
rrmse_linreg = rmse_linreg/rms_target * 100


# Pipeline

## Exercises

In [None]:
%reload_ext tutorial.tests.testsuite

In [None]:
%%ipytest
def solution_equal_to_one(a):
    return a==2