# Introduction Tree Learning

In this practical you will have your first real contact supervised machine elarning applied to real biological data. 

Your task is to establish, which biomarker (or features/attributes) influence the outcome. This execise goes through the clinical biomarkers and has a look at the data using decision trees and random forrests. The author of the paper (see below) has established that no real clinical markers could be found. Instead, he found some other biomarker. The file 

```
'clinical_biomarkers.csv'
``` 

Using initially here, contains the clinical biomarkers and the file

```
'biomarkers.csv'
```

the informtive ones. 

Please go through the exercise/tutorial and establish that you know what you are doing. In a second round use the second file and look into the informative biomarkers. Which one is the most informative on?



## Data origin

The data originates form the following publication:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1292-2

(going down to section Additional files - Additional file 3 will give you the full ist of raw data)

For the purpose of the exercise, we transformed the data already.

Before goint into downloading the data - some common imports



In [None]:
import os
import sys
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt # plotting and visulisation
import seaborn as sns # nicer (easier) visualisation
%matplotlib inline


In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Some required import for plotting a learnt tree graphically

Please excute this command only if you are sure that you have Graphviz installed. Please also ensure, that you are using the right python versin for pip (in case you have still python2.7 installed)

To install graphviz (especially for Windows) have a look here:

https://graphviz.gitlab.io/download/

You most likely will have to set the Windows PATH variable. Something similar to this one:

```!set PATH=%PATH%;C:\Program Files (x86)\Graphviz2.38\bin```



In [None]:
#!set PATH=%PATH%;C:\Program Files (x86)\Graphviz2.38\bin # I could not test this

In [None]:
#!pip install graphviz --user # or similar

This assumes that graphviz is instal

In [None]:
# own mini- library
import session_helpers
import IPython.display


## Loading in the file and setting the first column to be the index

In [None]:
biomarkers_file_csv = 'clinical_biomarkers.csv'


df = pd.read_csv(biomarkers_file_csv)
df = df.set_index(['Sample'])


Please have a look at the loaded data. How many columns/attributes does it have?

### Mapping classes into positive and negative

The following maps alle examples either to be positive or begative. Not matching ones ( 'C.'- Control ) are deleted

In [None]:
df_ex = df.copy()
df_ex['Response'] = df_ex['Response'].map(
    {
     'C.R.':'negative',
     'C.':'negative',
     'Int. II. R.':'negative',
     'High R.':'negative',
     'Int. I.':'positive',
     'Int. II.':'positive',
     'High':'positive',
    })


df_ex = df_ex.dropna()

## Plotting the values of all columns

Here we use the melt function of pandas. This function allows the values to be plotted in a nice fashion. Just click on Run and see. 

Are you able to spot an attribute or two, separating positive from negative?


In [None]:
plot_data_melt = pd.melt(df_ex,id_vars="Response",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(20,10))
sns.boxplot(x="features", y="value", hue="Response", data=plot_data_melt)
plt.xticks(rotation=90)

## First Decision Tree Model

You might or might not have been able to spot a pattern in the data in order to distinguish positive from negative examples. Here, we build a first decision tree to see what underlying pattern can be found. 

Before doing this, we split the data into data X and labels y.


In [None]:
y = df_ex['Response']
X = df_ex.drop(['Response'],axis=1)

## Train/Test Split

For a initial evaluation of the model, we use a simple train/test split. 

In [None]:
from sklearn.model_selection import train_test_split
# simple train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=15)

### import the DecisionTreeClassifier

In [None]:
from sklearn.tree import DecisionTreeClassifier


### Training the clasifier

In sklean, we first have to set up the decision tree model and then train it using our training data. The model expects at least two inputs: the actual data and the labels. 

In [None]:
dt_model = DecisionTreeClassifier(random_state=0)
dtree = dt_model.fit(X_train,y_train)

### Analysing the learnt tree

In [None]:
dtree

Now, this is a bit dissapointing. You can use the model to predict, but the printout is not very informative. To overcome this, I have written a plotting function (hidden in the session_helpers import from the beginning).

### Plotting the  Tree




Here we are going to plot the tree inside the model. This will only work when Graphviz and the pyton module for graphviz are installed. 

You should see something similar to the following:

![2 Class Tree](img/tree_2class.png)


In [None]:
# for visulisation:
image = session_helpers.plot_tree(dtree,X_test,y_test,rotate=False,max_depth=None)
IPython.display.Image(image)

Play around with some of the settings of the decision tree as well as (if you like) with rotate and max_depth in the plotting command.

# A more realistic validation scenario - k-fold cross-validation

The learning of the tree in the previous sections was only a first glimpse of a validation. Here we use a cross validation to estimate the performance of the learning algorithm. To do this, we need some additional objects (modules)

In [None]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV, KFold
from sklearn.metrics import confusion_matrix

## Cross validation

As we do not want to perform the real splitting away of folds and merging all backtogetehr ourselves, we use the prediefined cross validation function in sklearn. 

Here, we use a simple 5-fold CV. Have a look what other parameters are possible (this might involve you searching the net!)

Within each of the folds, we plot the confusion matrix. Can you change the cose, such that it will calculate the accuracy on each test fold? May be even precision and recall?






In [None]:
kf = KFold(n_splits=5, random_state=15, shuffle=True)
count_k = 0
for train_index, test_index in kf.split(X):
    count_k += 1
    X_train = X.iloc[train_index]
    X_test  = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test  = y.iloc[test_index]
    dtree = dt_model.fit(X_train,y_train)
    y_test_predicted = dtree.predict(X_test)
    print('Confusion Matrix (k={})'.format(count_k))
    print(confusion_matrix(y_test,y_test_predicted))
    print()
    

# A more realistic setting

Actually, the data contained more than two classes. Here we map all 'R.' (Recovery) ones into the class negative and leave the rest as is. 

Furthermore, we perform the same kind of analysis as before.

In [None]:
df_ex = df.copy()
df_ex['Response'] = df_ex['Response'].map(
    {
     'C. R.':'negative',
     'Int. II. R.':'negative',
     'High R.':'negative',
     'C.':'C.',
     'Int. I.':'Int. I.',
     'Int. II.':'Int. II.',
     'High':'High',
    })
df_ex = df_ex.dropna()
y = df_ex['Response']
X = df_ex.drop(['Response'],axis=1)



### Plotting the data



In [None]:
plot_data_melt = pd.melt(df_ex,id_vars="Response",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(30,10))
sns.boxplot(x="features", y="value", hue="Response", data=plot_data_melt)
plt.xticks(rotation=90)

# Simple Train/Test - Decision Tree

Warning - more than two classes! What does that mean later on?
Just in case Graphviz does not work in your setting. Here is the tree I generated:

![5 Class Tree](img/tree_5class.png)



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
dtree = dt_model.fit(X_train,y_train)
# for visulisation:
image = session_helpers.plot_tree(dtree,X_test,y_test,rotate=False,max_depth=None)
IPython.display.Image(image)



### Coss Validation

Can you still calculate the accuracy?

In [None]:
kf = KFold(n_splits=5, random_state=15, shuffle=True)
count_k = 0
for train_index, test_index in kf.split(X):
    count_k += 1
    X_train = X.iloc[train_index]
    X_test  = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test  = y.iloc[test_index]
    dtree = dt_model.fit(X_train,y_train)
    y_test_predicted = dtree.predict(X_test)
    print('Confusion Matrix (k={})'.format(count_k))
    print(confusion_matrix(y_test,y_test_predicted))
    print()
    

## Grid Search

You normal task would be to establish what are the best parameters for each of these folds. Python's sklean offers an easy way to evaluate and test what is the best parameter setting. This way is called grid search. The idea is that you will give a range of hyper-parameters which should be used for testing in the inner loop.  

Actually, here we will only do the inner loop on a training and test set setting. Howevewr, you should do this in a real cross validation (outer loop). Furthermore, sklearn can not easily deal with more than two classes in the grid searh and area under curce. Hence, we will be using some form of accuray. Here is a link at possible parameters: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter .

To get an idea of what option cann be passed as parameter in the grid search, have a look at the decision tree method of sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In case you get a warning (red message with DeprecationWarning), please ignore.



In [None]:
parameters = {
    'criterion':('gini', 'entropy'), 
    'max_depth':[1,2,3,4],
    'min_samples_leaf':[2,5,10]
}

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=15)

dt_grid_search = GridSearchCV(dt_model, parameters, cv=5,scoring='balanced_accuracy') # weighted == F1 Measure for multi-class
grid_search = dt_grid_search.fit(X_train, y_train)



Here is a list of what the grid search returns as information from the search

In [None]:
sorted(dt_grid_search.cv_results_.keys())


To find out what the best score was, we can just save the best performace as number:

In [None]:
best_result = max(dt_grid_search.cv_results_['mean_test_score'])
best_result


.. and now look wich parameter setting performed best with that parameter. 

In [None]:
for parameter_setting, mean_test_score in zip(dt_grid_search.cv_results_['params'],dt_grid_search.cv_results_['mean_test_score']):
    if mean_test_score == best_result:
        print('-'*80)
        print('BEST RESULTS!!')
        print(parameter_setting, mean_test_score)
        print('-'*80)
    else:
        print(parameter_setting, mean_test_score)


### A better way 

A better way for finding the best performing decision tree, is to directly ask for the best one. Once this is returned, we can use the get_params() method to establish what the set of hyper-parameters were:

In [None]:
best_tree_model = dt_grid_search.best_estimator_ # best model according to grid search 

best_tree_model.get_params()

### Predict the test set 

In [None]:
y_test_predicted = best_tree_model.predict(X_test)

print('Confusion Matrix of best model on test')
print(confusion_matrix(y_test,y_test_predicted))


### Feature importance

If you want to find out, what the most influencial attributes (features or biomarker), we can use the the trees built in information about this. 

Please note that we use the zip(A,B) method of python to produce a list of tuples from two lists of singletons. I.e. 
```python 
zip(['a1','a2','a3'],['b1','b2','b3'])
```

produces
```python 
[('a1', 'b1'), ('a2', 'b2'), ('a3', 'b3')]
```
(actually if you want to print is, you will have to put the ```zip()``` into a list : ```list(zip( ... , ... )))```

Back to feature importance. Have a look at the most important features:

In [None]:
for feature_name,feature_importance in zip(X_test.columns.values,best_tree_model.feature_importances_):
    if feature_importance > 0.0:
        print('{:20s}:{:3.4f}'.format(feature_name,feature_importance))

## Random Forest Classifier

Let us repeat this exercise with Random Forrests

In [None]:
from sklearn.ensemble import RandomForestClassifier


### Grid search

In [None]:
parameters = {
    'n_estimators': [2,3,5], 
    'max_depth':[1,2,3,4],
    'min_samples_leaf':[2,5,10]
}

random_f_model = RandomForestClassifier() 
rf_grid_search = GridSearchCV(random_f_model, parameters, cv=5,scoring='balanced_accuracy') # weighted == F1 Measure for multi-class
grid_search = rf_grid_search.fit(X_train, y_train)



### Best model

In [None]:
best_random_f_model = rf_grid_search.best_estimator_ # best model according to grid search 

best_random_f_model.get_params()

### Confusion Matrix 

In [None]:
y_test_predicted = best_random_f_model.predict(X_test)

print('Confusion Matrix of best model on test')
print(confusion_matrix(y_test,y_test_predicted))



### Most important biomarkers

In [None]:
for feature_name,feature_importance in zip(X_test.columns.values,best_random_f_model.feature_importances_):
    if feature_importance > 0.0:
        print('{:20s}:{:3.4f}'.format(feature_name,feature_importance))

## For the clinical biomarkers

Just exchange the two filenames:
        

In [None]:
#biomarkers_file_csv = 'clinical_biomarkers.csv'
biomarkers_file_csv = 'biomarkers.csv'



Now you can re-run the complete exersice or just concentrate on the essentials:

In [None]:
df_bio = pd.read_csv(biomarkers_file_csv)
df_bio = df_bio.set_index(['Sample'])

df_bio_ex = df_bio.copy()

df_bio_ex['Response'] = df_bio_ex['Response'].map(
    {
     'C. R.':'negative',
     'Int. II. R.':'negative',
     'High R.':'negative',
     'C.':'C.',
     'Int. I.':'Int. I.',
     'Int. II.':'Int. II.',
     'High':'High',
    })
df_bio_ex = df_bio_ex.dropna()

y = df_bio_ex['Response']
X = df_bio_ex.drop(['Response'],axis=1)


In [None]:
plot_data_melt_bio = pd.melt(df_bio_ex,id_vars="Response",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(30,10))
sns.boxplot(x="features", y="value", hue="Response", data=plot_data_melt_bio)
plt.xticks(rotation=90)


In [None]:
parameters = {
    'n_estimators': [2,3,5], 
    'max_depth':[1,2,3,4],
    'min_samples_leaf':[2,5,10]
}


random_f_model = RandomForestClassifier() 



kf1 = KFold(n_splits=5, random_state=15, shuffle=True)

count_k = 0
for train_index, test_index in kf.split(X):    
    count_k += 1
    # set up train and test  data
    X_train = X.iloc[train_index]
    X_test  = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test  = y.iloc[test_index]
    
    
    # set up grid search
    rf_grid_search = GridSearchCV(random_f_model, parameters, cv=5,scoring='balanced_accuracy') 
    grid_search_result = rf_grid_search.fit(X_train, y_train)

    # get best model
    best_rf_model = rf_grid_search.best_estimator_ # best model according to grid search 
    
    # print 'best' mdoels paramter
    best_rf_model_parameters = best_rf_model.get_params()
    print('k = {}'.format(count_k))
    for parameter in parameters:
        print('{:30}\t{}'.format(parameter,best_rf_model_parameters[parameter]))
    print()

    # print confusion matrix
    y_test_predicted = best_rf_model.predict(X_test)

    print('Confusion Matrix on test')
    print(confusion_matrix(y_test,y_test_predicted))     
    print()
    
    
    # get feature importance
    feature_importances = list(zip(best_rf_model.feature_importances_,X_test.columns.values))
    feature_importances.sort(reverse=True)
    
    # only plot the top 5 (please adopt)
    for feature_importance,feature_name in feature_importances[:5]:
        print('{:20s}:{:3.4f}'.format(feature_name,feature_importance))


    print('-'*60)
    
