# Classification and Regression Trees

In this notebook you will learn how to build classification and regression trees for prediction. Furthermore, you will learn how to evaluate the performance of these models. Most of this notebook content is based on the examples from the text book.

> (c) 2019 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> Code included in
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.

Let's start by importing all the required libraries:

In [None]:
import os
import io
import math
import pydotplus

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error
from sklearn import tree
from IPython.display import Image

%matplotlib inline

## Dataset

The very first thing we will do is to load the dataset. If you look at our folder structure, we have a folder called `data` inside the `lab3` folder. This is the place where we will store our datasets. Up to now we loaded data from the same folder as the notebook or from the `dmba` library. But you can tell Python the location of your dataset and load the data directly from there. Now we will see how it works:

In [None]:
# Get your current working directory
CWD = os.getcwd()
CWD

In [None]:
# Full path of the dataset
DATA_DIR = 'data'
filename = 'UniversalBank.csv'
FILE_PATH = os.path.join(CWD, DATA_DIR, filename)
FILE_PATH

And we can finally load the data:

In [None]:
df = pd.read_csv(FILE_PATH)

Now let's have a quick look at this dataset. You can use the commands you learned in the privious lab:

In [None]:
# TODO: how do we print the first lines?


In [None]:
# TODO: what is the dimension of the data, i.e. number of rows and columns


In [None]:
# TODO: what are the data types?


In [None]:
# TODO: what are the statistics for the numerical variables?


We have a dataset that doesn't need any cleaning. So we can go ahead and directly train a model. But bare in mind that in real life that almost never happens. For most real-life problems it takes quite some effort to prepare data before being able to use it to train models: the pre-processing steps of the data mining process.


## Classification trees

Now let's build our first classification model. We want to train a model to recognize when someone is likely to take a personal loan. The very first thing we need to do is to split our data into predictors and outcome. We know that our outcome is personal loan. Next, we need to decide which columns we want to use as predictors. We would now normally do some dimension reduction, also known as feature selection. But to keep things simple and concentrate on how to build a decision tree, we will simply remove ID and zip code and keep the rest:

In [None]:
X = df.drop(columns=['ID', 'ZIP Code', 'Personal Loan']) # features/predictors
y = df['Personal Loan'] # outcome

### Data partition

The next step is to divide our data into training and validation sets. As you should remember from the lectures, this is important because we will use the train set to train the model and the validation set to evaluate model performance.

This is very easy to do with the scikit-learn library, which provides a lot of functionalities for data mining:

In [None]:
# Splitting data into train/validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.4, random_state=1)

Now a few questions to see whether you understood what happened above:

1. What is the name of the function used to split the data?
2. What is test_size?
3. What is random_state?
4. Why do we have these 4 variables separated with commas?
5. What would have happened if we had assigned the outome of this function to one variable?

TIP: have a look at the docstring of the function as we learned last time (shift + tab).


### Choosing technique

We will use the DecisionTreeClassifier functionality of scikit-learn as our decision tree algorithm:

In [None]:
classTree = DecisionTreeClassifier()

### Applying algorithm & interpreting results

And we will train a model with the train data:

In [None]:
# TODO: what do we need to give as input for fit?
# Replace <input1> and <input2> for your answer
classTree.fit(<input1>, <input2>)

Now we want to visualize the results, but before, we will write a function for it, since we will be repeating tree visualizaion a few times:

In [None]:
def tree_visualization(decisionTree, feature_names=None, class_names=None):
    
    plt.figure(figsize=(60, 30))
    tree.plot_tree(decisionTree, filled=True, rounded=True, impurity=False,
                   feature_names=feature_names, class_names=class_names, label='root')

    
    return plt.show()

Now let's call the function to visualize the tree:

In [None]:
tree_visualization(classTree, feature_names=X_train.columns)

Finally, we want to know the performance of our model. We can get the confusion matrix and accuracy by calling the following scikit-learn functions, but we will also be repeating this process a few times today. So can you make a function out of it?

In [None]:
# Performance with training data
y_pred = classTree.predict(X_train)
cm = confusion_matrix(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
settype = 'Train'

sns.heatmap(cm, annot=True, fmt="d", cbar=False)
plt.title('%s accuracy = %f' % (settype, accuracy))
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.show()

In [None]:
# TODO: write a function to check model performance using the code from the cell above


In [None]:
# TODO: now get the performance for the validation data


Is this a good model? We discussed that during the video lecture and the Q&A session.

One way of testing whether a model is good is by running a sensitivity test, which we can do using k-fold cross-validation:

In [None]:
# Five-fold cross-validation of the full decision tree classifier
treeClassifier = DecisionTreeClassifier()
scores = cross_val_score(treeClassifier, X_train, y_train, cv=5)

And let's have a look at the accuracy of each fold and their average:

In [None]:
print('Accuracy scores of each fold:', np.round(scores, 3))
print('Accuracy average:', np.round(scores.mean(), 3))
print('Accuracy standard deviation:', np.round(scores.std(), 4))

Is the accuracy stable? It seems to be, but they are all very high. The highest accuracy (0.992) is quite different from the lowest (0.972).

In order to build a reliable model we should avoid overfitting, which is what is happening with the model above. One way to do so is by limiting tree growth. As we learned in the video lecture, a common way to stop tree growth is by giving threshold values for the tree depth, number of samples to keep splitting a node, and the required impurity decrease. Now let's see how to do it:

In [None]:
smallClassTree = DecisionTreeClassifier(max_depth=30, min_samples_split=20, min_impurity_decrease=0.01)
smallClassTree.fit(X_train, y_train)

tree_visualization(smallClassTree, feature_names=X_train.columns)

Now we have a tree that is much easier to read. We can explain this model to anyone. And what is the accuracy now?

In [None]:
# TODO: use the function you wrote above to show the performance for both train and validation sets


As you can see the accuracy estimated with the validation set is as good as the accuracies obtained with a fully grown tree and similar to the one from the train set. But in this case we have a much simpler model that is easy to explain and much more likely to perform well with new data, as it is not modelling all the small variantions in the train data (i.e. noise).

Now we know how to build a good classifier, preventing overfitting by limitting the tree growth. Moreover, we should be able to explain such a model to anyone. The problem here is that we had to choose thresholds for the tree depth, number of samples in a node to keep splitting it, and the reduction in impurity. We had some numbers above, but they will not always be good for every problem. So how can we choose them in an unbiased way? Well... if you remember the lecture, we saw that we can perform a grid search. A good way of performing a grid search is by starting with an initial guess:

In [None]:
# Start with an initial guess for parameters
param_grid = {
    'max_depth': [10, 20, 30, 40], 
    'min_samples_split': [20, 40, 60, 80, 100], 
    'min_impurity_decrease': [0.0009, 0.001, 0.005, 0.01], 
}
gridSearch = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Initial score: ', gridSearch.best_score_)
print('Initial parameters: ', gridSearch.best_params_)

And subsequentially using the results to refine our search:

In [None]:
# Adapt grid based on result from initial grid search
param_grid = {
    'max_depth': list(range(2, 16)), 
    'min_samples_split': list(range(10, 25)), 
    'min_impurity_decrease': [0.0009, 0.001, 0.0011], 
}
gridSearch = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Improved score: ', gridSearch.best_score_)
print('Improved parameters: ', gridSearch.best_params_)

Now let's pick the best result and see what happened:

In [None]:
bestClassTree = gridSearch.best_estimator_
tree_visualization(bestClassTree, feature_names=X_train.columns)

And what is the accuracy?

In [None]:
# TODO: use the function you wrote above to show the performance for both train and validation sets


By running a grid search we managed to improve the accuracy while mantaining interpretability and removing bias from our choice for the hyperparameters.


## Homework: regression trees

Let's now move to regression trees. They work very similarly to classification trees, but let's see the differences.

We start, as always, by loading the data:

In [None]:
filename = 'ToyotaCorolla.csv'
FILE_PATH = os.path.join(CWD, DATA_DIR, filename)

df = pd.read_csv(FILE_PATH).iloc[:1000,:]
df = df.rename(columns={'Age_08_04': 'Age', 'Quarterly_Tax': 'Tax'})
df.head(3)

Then, we define predictors and outcome:

In [None]:
# TODO: why are we choosing these? 
# You could try and have a look at the data to see whether you agree, 
# perhaps you even come up with a better model? ;-)
predictors = ['Age', 'KM', 'Fuel_Type', 'HP', 'Met_Color', 
              'Automatic', 'CC', 'Doors', 'Tax', 'Weight']
outcome = 'Price'

In this case we have some categorical variables, meaning that we need to create dummy variables from them:

In [None]:
X = pd.get_dummies(df[predictors], drop_first=True)
y = df[outcome]

Next step is data partition:

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.4, random_state=1)

We will start right away with a grid search to look for the best hyperparameters:

In [None]:
# First guess to find optimized tree
param_grid = {
    'max_depth': [5, 10, 15, 20, 25], 
    'min_impurity_decrease': [0, 0.001, 0.005, 0.01], 
    'min_samples_split': [10, 20, 30, 40, 50], 
}
gridSearch = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Initial parameters: ', gridSearch.best_params_)

In [None]:
# Let's now refine our search with the results
param_grid = {
    'max_depth': [3, 4, 5, 6, 7], 
    'min_impurity_decrease': [0, 0.001, 0.002, 0.003, 0.005, 0.006, 0.007, 0.008], 
    'min_samples_split': [15, 16, 18, 20, 21, 22, 23, 24, 25], 
}
gridSearch = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Improved parameters: ', gridSearch.best_params_)

And we pick the best result as our model:

In [None]:
regTree = gridSearch.best_estimator_

Let's now have a look at performance measures for the regression tree model:

In [None]:
def regression_performance(decisionTree, X, y, settype):

    # True and predicted values
    y_true = y.values
    y_pred = decisionTree.predict(X)
    
    # Compute error
    y_res = y_true - y_pred
    
    # Metrics
    metrics = [
        ('Mean Error (ME)', sum(y_res) / len(y_res)),
        ('Root Mean Squared Error (RMSE)', math.sqrt(mean_squared_error(y_true, y_pred))),
        ('Mean Absolute Error (MAE)', sum(abs(y_res)) / len(y_res)),
    ]
    if all(yt != 0 for yt in y_true):
        metrics.extend([
            ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
            ('Mean Absolute Percentage Error (MAPE)', 100 * sum(abs(y_res / y_true) / len(y_res))),
        ])
        
    # Print results
    fmt1 = '{{:>{}}} : {{:.4f}}'.format(max(len(m[0]) for m in metrics))
    print('\n%s regression statistics\n' % settype)
    for metric, value in metrics:
        print(fmt1.format(metric, value))
    print()

In [None]:
# Performance measures
regression_performance(regTree, X_train, y_train, 'Train')
regression_performance(regTree, X_valid, y_valid, 'Validation')

In [None]:
tree_visualization(regTree, feature_names=X_train.columns)

Now you also know how to build a regression tree! :)

You can explore both datasets further and play a bit with it.