# Introduction to Machine Learning 

Notebook prepared by [Chloé-Agathe Azencott](http://cazencott.info), with thanks to [Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook). Data from the GEMLeR (Gene Expression Machine Learning Repository) maintained by [Gregor Stiglic](http://www.ri.fzv.um.si/gstiglic/).

In this notebook, we'll try to build a classifier that automatically separates breast cancer tumor from ovarian cancer tumors, from the gene expression (microarray data) of 3,000 genes.

# 1. Preamble
## 1.1 What is Jupyter? 

A Jupyter notebook is a web application that allows you to create and share documents (such as this `.ipynb` notebook) that contain live code, visualisations and explanatory text (with equations).

Here are some tips on using a Jupyter notebook:
* Each block of text is contained in a _cell_. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).
* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the play button in the toolbar).

In [None]:
2 + 2  # hit Shift+Enter to run

* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the plus button in the toolbar).

Some tips on using a Jupyter notebook with Python:
* A notebook behaves like an interactive python shell! This means that
    * classes, functions, and variables defined at the cell level have global scope throughout the notebok
    * hitting `Tab` will autocomplete the keyword you have started typing
    * typing a question mark after a function name will load the interactive help for this function.
* Jupyter has special Python commands (shortcuts, if you will) called _magics_. For instance, `%bash` will allow you to run bash code, `%paste` will allow you to paste a block of code while retaining its formating, and `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html 
* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html

For more info on Jupyter: https://jupyter.org/

### Google Colab

You can run this notebook on [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).

Here are several things you will need that are specific to Google Colab:
* Make sure to download the data file `small_Breast_Ovary.csv` locally (to your computer), and upload it again on Colab.
* You may need to force Colab to use Python 3.7. To do so, uncomment (that is to say, remove the `# `) and run the cell below:

In [None]:
#!apt-get install python3.7

### Local installs

If you want to be able to run this notebook on your own machine, here's what you'll need:

__Option 1:__ If you are familiar with python and comfortable with managing your own installation, make sure you have Python 3.7 installed and the following packages (all can be installed with pip): numpy, scipy, pandas, matplotlib, scikit-learn, jupyter and jupyterlab.

__Option 2:__ If you are not familiar with python and library management, we recommand using either
* miniconda: https://docs.conda.io/en/latest/miniconda.html
* anaconda: https://www.anaconda.com/distribution/
Miniconda is lighter, but you will need to make sure all the required packages are installed; anaconda is heavier (requires a few GB of space) but everything should work “out of the box”.

Make sure to follow the installation instructions for your operating system (Mac/Windows/Linux) and install the Python 3.7 version.

If you’re unsure whether your Windows machine is running a 32-bit or 64-bit system, you can use the instructions here: https://www.lifewire.com/am-i-running-a-32-bit-or-64-bit-version-of-windows-2624475 to check. If you have a 32-bit version, you’ll need to use miniconda. For Linux, run “uname -i” in a terminal. If the answer is x86_64, you have a 64-bit system; if it is i386 or i686, you have a 32-bit system.

## 1.2 Data science libraries

Let us start with the Jupyter magic "`%pylab inline`", which is equivalent to importing `numpy` as `np`, and importing `matplotlib` as `plt`. 

`numpy` (for "numeric python") is the library used for manipulating arrays (typically representing vectors and matrices) in Python. To access object `a_numpy_object` from `numpy`, we'll use `np.a_numpy_object`.

`matplotlib` is a plotting library inspired by Matlab.

The `inline` specifier makes it so that the plots will appear under the cell and not in a separate window.

In [None]:
%pylab inline

This command is equivalent to:

```python
import numpy as np
import matplotlib.plot as plt
```

We will also import the `pandas` library, which is very useful for data manipulation.

__Documentation:__ http://pandas.pydata.org/pandas-docs/stable/

In [None]:
import pandas as pd

For all our machine learning purposes, we will use the libray `scikit-learn`: https://scikit-learn.org/stable/index.html
Its documentation is very complete! Don't hesitate to refer to it extensively.

## 1.3 Data 

## Load the data

In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. There are two types of tumors: breast tumors and ovary tumors. Our goal will be to build a tumor classifier based on gene expression.

In [None]:
bvo_df = pd.read_csv('small_Breast_Ovary.csv')

In [None]:
bvo_df.head()

The first column ("ID_REF") contains the sample ID, the last one ("Tissue") the "Breast" or "Ovary" label, and all others are gene expressions.

## Transform the data in numpy arrays

The information describing the samples can be thought of as a two-dimensional numerical array or matrix, which we will call the __design matrix.__ By convention, this  matrix is often stored in a variable named `X`. It is assumed to be two-dimensional, with shape `(n_samples, n_features)`, contained in a NumPy array.

The samples (i.e., rows) always refer to the individual objects described by the dataset; here, our tumors. 

The features (i.e., columns) always refer to the distinct observations that describe each sample in a quantitative manner; here, the transcript levels.

In addition to the feature matrix X, we also work (in supervised learning) with a NumPy array containing the labels (or targets), which we will usually call `y`. It is stored as a one-dimensional NumPy array of shape `(n_samples, )`. This __target array__ may have continuous numerical values, or discrete classes/labels. This array contains the variable we want to _predict_, by opposition to the features matrix, which contain the variables we want to _use to make our predictions_.

Let us extract these arrays from the `bvo_df` dataframe:

In [None]:
# design matrix
X = np.array(bvo_df.drop(columns=["ID_REF", "Tissue"]))

In [None]:
X.shape

We have 542 samples, each represented by 3000 gene expressions.

In [None]:
n_features = X.shape[1]

In [None]:
# target array
y = np.array(bvo_df["Tissue"])

# convert "Breast" in 0 and the other labels (here, "Ovarian") into 1
y = np.where(y=='Breast', 0, 1)

In [None]:
y.shape

### Data standardization

Let us make sure our features all have a mean of 0 and a standard deviation of 1: this will avoid giving too much importance to genes that are more abundant across the whole data set.

This can easily be done with scikit-learn's [preprocessing module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

In [None]:
from sklearn import preprocessing

Let us instantiate a scaler:

In [None]:
scaler = preprocessing.StandardScaler()

And then compute the scaling parameters on our data:

In [None]:
scaler.fit(X)

Now we can create a scaled version of the data:

In [None]:
X_scaled = scaler.transform(X)

In [None]:
X_scaled

# 2. Training a logistic regression
In this section you will learn how to train a logistic regression on this data.

All machine learning algorithms implemented in scikit-learn follow the same logic:

1. Choose an algorithm and import the appropriate class from scikit-learn

The [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#), which is a linear model, is part of the `linear_model` module.

In [None]:
from sklearn import linear_model

2. Instantiate this class with desired hyperparameters.

Here, we don't want to use regularization yet, so we'll use `penalty='None'`.

In [None]:
my_model = linear_model.LogisticRegression(penalty='none')

3. Fit the model to the data using `fit()`.

At this stage, our model has not seen any data. Now we'll pass it the data we want it to learn on. 

In [None]:
my_model.fit(X_scaled, y)

We have learned a model! In the case of linear models, we can inspect its coefficients:

In [None]:
my_model.coef_

In [None]:
plt.scatter(np.arange(n_features), my_model.coef_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the logistic model")

Notice that most of these coefficients are very close to zero... this is why we'll use regularization later on.

4. Use the model to make predictions with `predict()`.

Here we'll make predictions on the data we used to learn from.

In [None]:
y_predicted = my_model.predict(X_scaled)

Scikit-learn has a lot of ways to evaluate predictions in its [`metrics`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics) module.

We can for example look at the accuracy of our model: what proportion of the samples did it predict correctly?

In [None]:
from sklearn import metrics

In [None]:
print("Accuracy of the logistic regression: %.3f" % metrics.accuracy_score(y, y_predicted))

__Wow!__ Our model made perfect predictions!

True, but that's easy to do on the data it learned from... Think of it as being tested on the exact same exercises you did in class — not the same as being able to solve a brand new problem, right?

# 3. Using a test set

It's much more realistic to evaluate the performance of a model on data it has never seen before. For that reason, we're going to set aside a chunk of our data, called the __test set__, which we'll only use for evaluation purposes. We'll train on our model on the rest of the data, called the __train set__.

## 3.1 Splitting the data into a train and a test set

Scikit-learn provide utilities to create train and test sets (and more complex evaluation/validation set ups) in the `model_selection` module.

In [None]:
from sklearn import model_selection

In [None]:
(X_train, X_test, y_train, y_test) = model_selection.train_test_split(X_scaled, y, 
                                                                      test_size=0.2, 
                                                                      stratify=y # stratifying means respecting the proportion of samples of each class in all sets
                                                                     )

`test_size=0.2` means the test set will be 20% of the full set

`stratify=y` means the relative proportions of samples of each class in `y` will be respected in the train and test sets.

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

The train set contains 433 samples ; the test set contains 109 samples.

## 3.1 Training on the train set only

We can now train on logistic regression on the train set only:

In [None]:
my_model.fit(X_train, y_train)

## 3.2 Evaluation on the test set

Let us now use this model to make predictions on the test set:

In [None]:
y_predicted = my_model.predict(X_test)

The accuracy of the model is now:

In [None]:
print("Accuracy of the logistic regression: %.3f" % metrics.accuracy_score(y_test, y_predicted))

Not bad, but it's not perfect any longer.

To understand this performance in more depth, we can look at the __confusion matrix__ of our predictions:

In [None]:
metrics.plot_confusion_matrix(my_model, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

The bottom left cell contains the number of tumors that were predicted to be from breast cancer (Predicted label=0), whereas they were ovarian cancer (True label=1). 

# 4. Regularized logistic regression

Let us look a the model coefficients again:

In [None]:
plt.scatter(np.arange(n_features), my_model.coef_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the logistic model")

Many of these coefficients are very close to 0. Using a logistic regression with l1 regularization can bring down these coefficients to exactly zero, resulting in a _sparse_ model. Then we can make the hypothesis that only the genes that have non-zero coefficients in the model are relevant to the prediction!

## 4.1 Training a regularized logistic regression

In [None]:
my_l1_regularized_model = linear_model.LogisticRegression(penalty='l1',
                                                         solver='liblinear')

`solver='liblinear'` tells scikit-learn which optimization algorithm to use to fit the model. The default solver is not compatible with l1 regularization, so here we need to explicitely set a solver that can be used for l1 regularization. You can learn more about it [in the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#).

In [None]:
my_l1_regularized_model.fit(X_train, y_train)

## 4.2 Effect of the regularization on the model coefficients

Let us now look at the model's coefficients:

In [None]:
plt.scatter(np.arange(n_features), my_l1_regularized_model.coef_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression")

In [None]:
np.nonzero(my_model.coef_)

To make the difference more clear, we'll plot in two different colors the coefficients that are equal to zero and those that aren't:

In [None]:
fig = plt.figure(figsize=(10, 3))

# First subplot in a (1 x 2) grid
ax = plt.subplot(1, 2, 1)
nonzero_coefficents_indices = np.nonzero(my_model.coef_)
plt.scatter(nonzero_coefficents_indices[1], 
            my_model.coef_[nonzero_coefficents_indices], label='Non-zero coefficients')
zero_coefficients_indices = np.nonzero(my_model.coef_ == 0)
plt.scatter(zero_coefficients_indices[1], 
           my_model.coef_[zero_coefficients_indices], label='Zero coefficients')
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("Logistic regression")

# Second subplot in a (1 x 3) grid
ax = plt.subplot(1, 2, 2)
nonzero_coefficents_indices = np.nonzero(my_l1_regularized_model.coef_)
plt.scatter(nonzero_coefficents_indices[1], 
            my_l1_regularized_model.coef_[nonzero_coefficents_indices], label='Non-zero coefficients')
zero_coefficients_indices = np.nonzero(my_l1_regularized_model.coef_ == 0)
plt.scatter(zero_coefficients_indices[1], 
            my_l1_regularized_model.coef_[zero_coefficients_indices], label='Zero coefficients')
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression")

plt.legend()

Without logistic regression, there is no feature that has exactly a coefficient of zero!

## 4.3 Prediction performance

In [None]:
y_predicted_l1log = my_l1_regularized_model.predict(X_test)

The accuracy of the model is now:

In [None]:
print("Accuracy of the l1-regularized logistic regression: %.3f" % metrics.accuracy_score(y_test, y_predicted_l1log))

In [None]:
metrics.plot_confusion_matrix(my_l1_regularized_model, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

## 4.4 Effect of the amount of regularization

### Large regularization

We have used the default setting for the inverse of the regularization strength parameter `C`. 

However, changing this hyperparameter has a strong effect: the stronger the regularization (i.e. the smaller `C`), the more coefficients will be set to zero in the model. 

We can observe this by reiterating the above experiment with `C=0.01`:

In [None]:
my_l1_regularized_model_2 = linear_model.LogisticRegression(penalty='l1',
                                                            solver='liblinear', 
                                                           C=0.01)

In [None]:
my_l1_regularized_model_2.fit(X_train, y_train)

Let us look at the coefficients now:

In [None]:
fig = plt.figure(figsize=(15, 3))

# First subplot in a (1 x 3) grid
ax = plt.subplot(1, 3, 1)
nonzero_coefficents_indices = np.nonzero(my_model.coef_)
plt.scatter(nonzero_coefficents_indices[1], 
            my_model.coef_[nonzero_coefficents_indices], label='Non-zero coefficients')
zero_coefficients_indices = np.nonzero(my_model.coef_ == 0)
plt.scatter(zero_coefficients_indices[1], 
           my_model.coef_[zero_coefficients_indices], label='Zero coefficients')
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("Logistic regression")

# Second subplot in a (1 x 3) grid
ax = plt.subplot(1, 3, 2)
nonzero_coefficents_indices = np.nonzero(my_l1_regularized_model.coef_)
plt.scatter(nonzero_coefficents_indices[1], 
            my_l1_regularized_model.coef_[nonzero_coefficents_indices], label='Non-zero coefficients')
zero_coefficients_indices = np.nonzero(my_l1_regularized_model.coef_ == 0)
plt.scatter(zero_coefficients_indices[1], 
            my_l1_regularized_model.coef_[zero_coefficients_indices], label='Zero coefficients')
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression (C=1.0)")

# Third subplot in a (1 x 3) grid
ax = plt.subplot(1, 3, 3)
nonzero_coefficents_indices = np.nonzero(my_l1_regularized_model_2.coef_)
plt.scatter(nonzero_coefficents_indices[1], 
            my_l1_regularized_model_2.coef_[nonzero_coefficents_indices], label='Non-zero coefficients')
zero_coefficients_indices = np.nonzero(my_l1_regularized_model_2.coef_ == 0)
plt.scatter(zero_coefficients_indices[1], 
            my_l1_regularized_model_2.coef_[zero_coefficients_indices], label='Zero coefficients')
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression (C=0.01)")

plt.legend()

Many more coefficients are equal to zero now. How did this affect the prediction performance?

In [None]:
y_predicted_l1log_2 = my_l1_regularized_model_2.predict(X_test)

The accuracy of the model is now:

In [None]:
print("Accuracy of the l1-regularized logistic regression (C=0.01): %.3f" % metrics.accuracy_score(y_test, y_predicted_l1log_2))

In [None]:
metrics.plot_confusion_matrix(my_l1_regularized_model_2, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

Increasing the amount of regularization reduces the number of features used by the model, but this can also hurt performance.

## 4.5 Setting the amount of regularization by cross-validation

We now want to perform __model selection__, that is to say, _select_ the best value of `C`. 

One way to approach the problem would be to test several values for `C` and compare performance on the test set. We would then pick the value of `C` leading to the best performance. Unfortunately, if we proceed in this way, the performance we observe on the test set is biased: it's not true any more to say that we have not touched the test set to create our model! 

What we could do now is split the training set into two sets again: a train set and a validation set. 

However, if we split the data in, say, 60% train + 20% validation + 20% test, now we're only using little more than half our data for training! This is not optimal, especially if the initial set of training data is small, because the more data we have, the better we learn. 

One way to address this is to use __cross-validation__; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set. If we do a 5-fold cross-validation, we split the data in 5 blocks, and run 5 experiments for each value of `C` that we want to test: use the first 4 blocks for training and the last one for validation ; use the three first blocks and the last one for training, and the fourth one for validation ; and so on and so forth. We end up with 5 measures of performance for each value of `C`, which we can then average to get a global picture of the performance (still for each value of `C`). We can now pick the value of `C` that led to the best performance, train our model again on the training set, and evaluate its performance on the test set.

### Automated model selection by cross-validation with `GridSearchCV`

Let us start by setting up a grid of values of `C`. We'll use 50 values, spread on a logarithmic scale between 1e-3 and 1e3:

In [None]:
C_values = np.logspace(-3, 3, 50)

We can now use scikit-learn's `GridSearchCV`:

In [None]:
l1_regularized_cv = model_selection.GridSearchCV(linear_model.LogisticRegression(penalty='l1', solver='liblinear'), 
                                                 {'C': C_values},
                                                 cv=5)

`{'C': C_values}` tells scikit-learn that it will have to consider all models `linear_model.LogisticRegression(penalty='l1', solver='liblinear', C=xxx)` with `xxx` in `C_values`. 

`cv=5` tells scikit-learn to use a 5-fold cross-validation

Now we can train our model as usual:

In [None]:
l1_regularized_cv.fit(X_train, y_train)

The optimal value of the hyperparameter is in the `best_params_` attribute of our trained model:

### Optimal model

In [None]:
l1_regularized_cv.best_params_

The corresponding performance in the `best_score_` attribute of our trained model:

In [None]:
l1_regularized_cv.best_score_

Wait, but what measure of performance is this? The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) tells us that this can be set with the `scoring` parameter of `GridSearchCV`, which we did not touch. So the default was used — the documentation reads "If None, the estimator’s score method is used." So let's look up the documentation of [LogisticRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#); it tells us that its `score` method returns the mean accuracy, so that's what we're looking at.

scikit-learn has also retrained a l1-regularized logistic regression with the optimal hyperparameter. It is accessible in the `best_estimator_` attribute of our model:

In [None]:
l1_regularized_cv.best_estimator_

We can plot the weights of this model:

In [None]:
plt.scatter(np.arange(n_features), l1_regularized_cv.best_estimator_.coef_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression")

We can look at the performance of this model on the test data set:

In [None]:
y_predicted_l1log_cv = l1_regularized_cv.best_estimator_.predict(X_test)

In [None]:
print("Accuracy of the l1-regularized logistic regression (optimal C): %.3f" % metrics.accuracy_score(y_test, y_predicted_l1log_cv))

In [None]:
metrics.plot_confusion_matrix(l1_regularized_cv.best_estimator_, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

The number of selected features is:

In [None]:
np.count_nonzero(l1_regularized_cv.best_estimator_.coef_)

# 5. Scoring for an unbalanced data set

We used accuracy to select the best model. However, the data is _unbalanced_: there are more breast tumors than ovarian tumors. Let us check their numbers in the training set:

In [None]:
print("Number of breast tumors = %d (%.2f %%  of the training set)" % ((np.count_nonzero(y_train==0)), 100*(np.count_nonzero(y_train==0)/y_train.shape[0])))
print("Number of ovarian tumors = %d (%.2f %% of the training set)" % (np.count_nonzero(y_train==1), 100*(np.count_nonzero(y_train==1)/y_train.shape[0])))

What this means is that, on the training set, a model that sytematically returns 0 (i.e. "breast") will have an accuracy of 63.5%.

The imbalance in the data means that models will tend to favor the majority class.

To avoid this, we can use a performance score that accounts for this imbalance. These include the __balanced accuracy__ and the __f1__ score. You can learn more about them in the [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).

## 5.1 Balanced accuracy of the previous model

In [None]:
print("Balanced accuracy of the l1-regularized logistic regression (optimal C): %.3f" % metrics.balanced_accuracy_score(y_test, y_predicted_l1log_cv))

## 5.2 Optimizing for balanced accuracy

In [None]:
l1_regularized_cv_ba = model_selection.GridSearchCV(linear_model.LogisticRegression(penalty='l1', solver='liblinear'), 
                                                   {'C': C_values},
                                                   cv=5, scoring='balanced_accuracy')

In [None]:
l1_regularized_cv_ba.fit(X_train, y_train)

In [None]:
l1_regularized_cv_ba.best_params_

In [None]:
l1_regularized_cv_ba.best_score_

In [None]:
l1_regularized_cv_ba.best_estimator_

We can plot the weights of this model:

In [None]:
plt.scatter(np.arange(n_features), l1_regularized_cv_ba.best_estimator_.coef_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Coefficient in the model")
plt.title("L1-regularized logistic regression")

In [None]:
np.count_nonzero(l1_regularized_cv_ba.best_estimator_.coef_)

We can look at the performance of this model on the test data set:

In [None]:
y_predicted_l1log_cv = l1_regularized_cv_ba.best_estimator_.predict(X_test)

In [None]:
print("Balanced accuracy of the l1-regularized logistic regression (optimal C): %.3f" % metrics.balanced_accuracy_score(y_test, y_predicted_l1log_cv))

In [None]:
metrics.plot_confusion_matrix(l1_regularized_cv_ba.best_estimator_, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

# 6. Decision trees and random forest classifiers

## 6.1 Decision tree

Let us start with a simple non-linear models: a decision tree. They are implemented in scikit-learn's [tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
from sklearn import tree

We create a DT model:

In [None]:
dt_model = tree.DecisionTreeClassifier()

We can now train a decision tree on the train set only:

In [None]:
dt_model.fit(X_train, y_train)

Let us now use this model to make predictions on the test set:

In [None]:
y_predicted = dt_model.predict(X_test)

The performance of the decision tree is:

In [None]:
print("Accuracy of the decision tree: %.3f" % metrics.accuracy_score(y_test, y_predicted))

In [None]:
print("Balanced accuracy of the decision tree: %.3f" % metrics.balanced_accuracy_score(y_test, y_predicted))

In [None]:
metrics.plot_confusion_matrix(dt_model, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

A decision tree clearly underperforms compared to a logistic regression.

Can we improve this with __ensemble methods__?

## 6.2 Random forests

A random forest combines the prediction of multiple decision trees, each trained on a subset of the samples and of the features.

Can an ensemble method improve the performance of the decision tree on the difficult data set? We will use the random forest implementation in scikit-learn's [ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn import ensemble

An important hyperparameter of a random forest is the number `n_estimators` of trees it contain. We will therefore use a cross-validation to fix this hyperparameter.

In [None]:
ntrees_values = [10, 20, 50, 100, 300, 500, 2000]

In [None]:
rf_cv = model_selection.GridSearchCV(ensemble.RandomForestClassifier(),  
                                     {'n_estimators': ntrees_values},
                                     cv=5, scoring='balanced_accuracy')

In [None]:
rf_cv.fit(X_train, y_train)

The optimal number of trees is:

In [None]:
rf_cv.best_params_

In [None]:
print("Optimal cross-validated balanced accuracy: %.3f" % rf_cv.best_score_)

Let's see how it performs on the test set:

In [None]:
y_predicted = rf_cv.best_estimator_.predict(X_test)

In [None]:
print("Balanced accuracy of the random forest: %.3f" % metrics.balanced_accuracy_score(y_test, y_predicted))

In [None]:
metrics.plot_confusion_matrix(rf_cv.best_estimator_, X_test, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

The performance is much better than that of a single decision tree, and is also better than that of the linear models.

### Feature Importance

Random forests have a notion of _feature importance_, stored in the `feature_importances_` attribute. The importance of a feature is computed by looking at how much using that feature decreases the Gini impurity (a measure of classification error) of the model.

In [None]:
plt.scatter(np.arange(n_features), rf_cv.best_estimator_.feature_importances_)
plt.xlabel("Feature/Gene index")
plt.ylabel("Feature importance")
plt.title("Random forest")

We can consider that all features with a non-zero importance are selected:

In [None]:
np.count_nonzero(rf_cv.best_estimator_.feature_importances_)

But that's a lot, so we can also set a threshold by hand (either on the number of features to keep, or on the importance value).

For example, here, we can decide to keep only the feature with an importance at least equal to 0.005:

In [None]:
np.count_nonzero(rf_cv.best_estimator_.feature_importances_ >= 0.005)

In [None]:
np.nonzero(rf_cv.best_estimator_.feature_importances_ >= 0.005)[0]

Can we really use only these features? Let's retrain a random forest only on those features:

In [None]:
selected_features = np.nonzero(rf_cv.best_estimator_.feature_importances_ >= 0.005)[0]

In [None]:
X_train_reduced = X_train[:, selected_features]

In [None]:
X_train_reduced.shape

In [None]:
rf_reduced_cv = model_selection.GridSearchCV(ensemble.RandomForestClassifier(),  
                                     {'n_estimators': ntrees_values},
                                     cv=5, scoring='balanced_accuracy')

In [None]:
rf_reduced_cv.fit(X_train_reduced, y_train)

The optimal number of trees is:

In [None]:
rf_reduced_cv.best_params_

In [None]:
print("Optimal cross-validated balanced accuracy: %.3f" % rf_reduced_cv.best_score_)

The performance didn't really drop that much!

Let's see how it performs on the test set:

In [None]:
X_test_reduced = X_test[:, selected_features]

In [None]:
y_predicted = rf_reduced_cv.best_estimator_.predict(X_test_reduced)

In [None]:
print("Balanced accuracy of the random forest: %.3f" % metrics.balanced_accuracy_score(y_test, y_predicted))

In [None]:
metrics.plot_confusion_matrix(rf_reduced_cv.best_estimator_, X_test_reduced, y_test, 
                             cmap=plt.cm.Blues # use a blue color map
                             )

Our final model makes very few mistakes and uses very few genes!