# Part 1: Introduction to Decision Trees

Let's import the packages that we will use during the practical:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### The dataset

The dataset is available in the `data/` directory, but it can be also downloaded from [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It consists of data from marketing campaigns of a Portuguese bank. We will try to build a classifier that can predict whether or not the client targeted by the campaign ended up subscribing to a term deposit (column `y`).

Load the file `data/bank-marketing.c` with `pandas` and check the distribution of the target `y`. Here the separator is `';'` instead of a comma.

Save the DataFrame as `df`.

In [None]:
# Your code here...
df = pd.read_csv("data/bank-marketing.csv",sep=";")
df['y'].value_counts()


The dataset is imbalanced, so we will need to keep that in mind when building our models!

Now split the data into the feature matrix `X` (all features except `y`) and the target vector `y`, making sure that you convert `yes` to `1` and `no` to `0`.

In [None]:
# Get X, y
# Your code here...
y = df["y"].map({"no":0, "yes":1})
X = df.drop("y", axis=1)


Here is the list of features in our `X` matrix:

| | | |
| --- | --- | --- |
age | | numeric 
job | type of job | categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown'
marital | marital status | categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed
education | | categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown'
default | has credit in default? | categorical: 'no','yes','unknown'
housing | has housing loan? | categorical: 'no','yes','unknown'
loan | has personal loan? | categorical: 'no','yes','unknown'
contact | contact communication type | categorical: 'cellular','telephone'
month | last contact month of year | categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec'
day_of_week | last contact day of the week | categorical: 'mon','tue','wed','thu','fri'
duration | last contact duration, in seconds | numeric. Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
campaign | number of contacts performed during this campaign and for this client | numeric, includes last contact
pdays | number of days that passed by after the client was last contacted from a previous campaign | numeric; 999 means client was not previously contacted
previous | number of contacts performed before this campaign and for this client | numeric
poutcome | outcome of the previous marketing campaign | categorical: 'failure','nonexistent','success'
emp.var.rate | employment variation rate - quarterly indicator | numeric
cons.price.idx | consumer price index - monthly indicator | numeric
cons.conf.idx | consumer confidence index - monthly indicator | numeric
euribor3m | euribor 3 month rate - daily indicator | numeric 
nr.employed | number of employees - quarterly indicator | numeric

Note the comment about the `duration` feature. We will exclude it from our analysis.

Drop `duration` from `X`:

In [None]:
# Your code here...
X.drop("duration", inplace=True, axis=1)


Now we can check the types of all our features. We see that some seem to be categorical whilst others are numerical. We will keep two lists, one for each type, so we can preprocess them differently.

In [None]:
X.dtypes

In [None]:
# when there is a third class "unknown", we'll process the feature as non-binary categorical
num_features = ["age", "campaign", "pdays", "previous", "emp.var.rate", 
                "cons.price.idx", "cons.conf.idx","euribor3m", "nr.employed"]

cat_features = ["job", "marital", "education","default", "housing", "loan",
                "contact", "month", "day_of_week", "poutcome"]

### Visualise the numerical features

Using `seaborn`, show a boxplot of the numerical features.

In [None]:
# Your code here...
plt.figure(figsize=(20, 10))
sns.boxplot(data=X[num_features], ax=plt.gca())
plt.show()


The features aren't at the same scale. But that's fine for tree-based methods as we've said in the lesson, so we do not need to do any scaling here!

### One-hot encoding on categorical features

The `sklearn` implementation of decision trees cannot work directly with categorical features, so we need to make sure our dataset contains only numbers. Consequently, we will need to transform our categorical features into one-hot encoded features.

To do so, use `pd.get_dummies` on our DataFrame (select only the categorical features - we already have them stored in the variable `cat_features`) to generate the new columns.

Assign the new DataFrame to a variable `X_categorical`.

In [None]:
# Your code here...
X_categorical = pd.get_dummies(X[cat_features])


Create a Dataframe with only our numerical features (we have their names stored in the variable `num_features`) from `X` together with the `X_categorical` DataFrame.

Use `pd.concat` (making sure to specify the correct axis!) and call the new DataFrame `X_processed`.

In [None]:
X_processed = pd.concat([X[num_features], X_categorical], axis=1)


### Split the data into training and test sets

Split the data (use `X_processed`) into a training set and test set. Here we are dealing with an imbalanced dataset, so it is important to enforce stratification. We will use the argument `stratify` from `train_test_split` to do so (check the documentation).

Call the new variables `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# Your code here...
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_processed,
    y,
    test_size=.3,
    random_state=42,
    stratify=y
)


## Train a decision tree

Now that we have done our preprocessing and our data is ready, we can train a decision tree. We will use `DecisionTreeClassifier` from `sklearn.tree`.

For now we will keep our tree unconstrained with:
- `max_depth=None`
- `min_samples_split=2`

Create a new decision tree, assigning it to the variable `dtc`.

In [None]:
# Your code here...
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth=None, min_samples_split=2)


Now fit the model on the training set:

In [None]:
# Your code here...
dtc.fit(X_train, y_train)


Execute the cell below to display the tree in the notebook, what do you observe?

In [None]:
from sklearn import tree

plt.figure(figsize=(200, 20))
tree.plot_tree(dtc, 
               filled=True, 
               rounded=True,
               max_depth=6,
               proportion=True,
               fontsize=10,
               feature_names=X_train.columns)
plt.show()

Compute the accuracy of the model on the training data and then on the test data, what can you tell?

In [None]:
# Your code here...
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


Now let's investigate a bit more by looking at the `classification_report` (you can import it from `sklearn.metrics`) for our test set. That will provide us with more information about precision and recall on both our classes.

In [None]:
# Your code here...
from sklearn.metrics import classification_report

print(classification_report(y_test, dtc.predict(X_test)))


It looks like our model is predicting the majority class `0` (no) really well, which leads to a high accuracy, but we're really bad at predicting class `1`, which corresponds to successful campaigns and is of interest here!

# Part 2: Parameter Tuning and Feature Importance

## Parameter tuning

We've found two major issues with our model so far:

- It greatly overfits
- It focuses on the majority class

With our decision tree, we can address both. 

- For the first issue we will need to tune `max_depth` and `min_samples_split`. 
- For the second issue, we will set `class_weight='balanced'` so that it automatically gives more weight to our minority class as a way to compensate.

### Exploration of different parameters

Let's use more sensible/constraining values for `max_depth` and `min_samples_split`, let's say `6` and `20` respectively.

To change the parameters of the existing tree classifier `dtc`, you can use `set_params` on it with the name and values you want to update (for example `max_depth=6`).

Don't forget to re-train the tree after changing the parameters.

In [None]:
# Your code here...
dtc.set_params(max_depth=6, min_samples_split=20)
dtc.fit(X_train, y_train)


Let's check the accuracy on both the train and the test set. Is it better than before?

In [None]:
# Your code here...
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


We can also visualise our tree:

In [None]:
plt.figure(figsize=(120, 12))
tree.plot_tree(dtc, 
               filled=True, 
               rounded=True,
               max_depth=6,
               proportion=True,
               fontsize=10,
               feature_names=X_train.columns)
plt.show()

That's a simpler tree!

Let's have a look at the classification report now for the test set:

In [None]:
# Your code here...
dtc.fit(X_train, y_train)
print(classification_report(y_test, dtc.predict(X_test)))


It is still doing really badly on class `1`. Try to set the parameter `class_weight` to `"balanced"` and retrain the tree:

In [None]:
# Your code here...
dtc.set_params(class_weight="balanced")
dtc.fit(X_train, y_train)


Check the classification report again:

In [None]:
# Your code here...
print(classification_report(y_test, dtc.predict(X_test)))


That's much better!

### Use grid search to find the optimal parameters

Now that we've observed the impact of various parameters, we can do a grid search to find the optimal ones.

Define a new `parameters` dictionary that contains all the values you want to try for `max_depth` and `min_samples_split`.

Then define a new `GridSearchCV` object and find the best parameters.

When searching for the best parameters, we typically select the ones which give the best results on the validation set, which is distinct from the training and test sets. `GridSearchCV` includes cross-validation, so we can pass it the training data directly. As part of cross-validation, the original training data will be repeatedly split into various training and validation sets.

In [None]:
# Your code here...
from sklearn.model_selection import GridSearchCV

parameters  = [{'max_depth': [3, 4, 7], "min_samples_split": [5, 10, 20]}] 

gridCV = GridSearchCV(dtc, parameters, cv=10)

gridCV.fit(X_train, y_train)


What are your best parameters?

In [None]:
# Your code here...
gridCV.best_params_


Now we can re-train our model using these parameters. Set the parameters of the tree to be the best ones given by the grid search, and train the model again:

In [None]:
# Your code here...
dtc.set_params(**gridCV.best_params_)
dtc.fit(X_train, y_train)


Display the final tree:

In [None]:
plt.figure(figsize=(40, 8))
tree.plot_tree(dtc, 
               filled=True, 
               rounded=True,
               max_depth=6,
               proportion=True,
               fontsize=10,
               feature_names=X_train.columns)
plt.show()

Compute its accuracy on the train and test sets:

In [None]:
# Your code here...
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


Finally check the classification report for the test set:

In [None]:
# Your code here...
print(classification_report(y_test, dtc.predict(X_test)))


## Feature importance

Decision trees have the advantage of providing a feature importance, a score allowing you to rank all features by their importance for the model when predicting the outcome. With `sklearn`, you can access it with the attribute called `feature_importances_`.

Take a look at the `feature_importances_` attribute:

In [None]:
# Your code here...
dtc.feature_importances_


That's hard to read. The array gives a number for each column in our training set, in the same order. A better way to visualise it would be to put it in a table, so let's do that.

Create a new DataFrame where the data will be the feature importances from above, and the index will be the list of columns from our training data. Call this DataFrame `importances_df`.

In [None]:
# Your code here...
importances_df = pd.DataFrame(
    dtc.feature_importances_,
    columns=["importance"],
    index=X_train.columns
)
importances_df.sort_values("importance", ascending=False).head()


Plot it as a bar plot:

In [None]:
# Your code here...
importances_df.sort_values("importance", ascending=False).plot(kind="bar", figsize=(20,7))
plt.show()


What's the most important feature?