# Working with Classification Trees in Python

## Learning Objectives
Decision Trees are one of the most popular approaches to supervised machine learning. Decison Trees use an inverted tree-like structure to model the relationship between independent variables and a dependent variable. A tree with a categorical dependent variable is known as a **Classification Tree**. By the end of this tutorial, you will have learned:

+ How to import, explore and prepare data
+ How to build a Classification Tree model
+ How to visualize the structure of a Classification Tree
+ How to Prune a Classification Tree 

## 1. Collect the Data

<p style="color:yellow">In this exercise, we'll use a sample loans data set to build a classification tree that predicts whether a borrower will default or not default on a new loan.

We start by importing the Pandas package.

In [None]:
import pandas as pd # import pandas package

Then we import the data into a data frame called loan and preview it to make sure that the input worked as expected.

In [None]:
loan = pd.read_csv("loan.csv")
loan.head()

## 2. Explore the Data

Now that we have our data, let's try to understand it. First, we get a concise summary of the structure of the data by calling the info method of the data frame.

In [None]:
loan.info()

From the summary, we can tell that there are 30 instances in the dataset by looking at the range index. We can also tell that there are three features in the dataset. Looking at the D type column of the summary, we see that the income and loan amount columns hold integer values while the default column holds text, AKA object.
Next, we get summary statistics for the data by calling the described method of the data frame.

In [None]:
loan.describe()

From the statistics, we see that the minimum income value in the data is five. While the maximum value is 34. Note that these values are in the thousands. So what we're seeing here is $5,000 and $34,000. Likewise, the average loan amount is $51,967.

Next, let's also visually explore the data by creating a few plots.

To ensure that our plots show up in line, we run the map plot lib in line command.

In [None]:
%matplotlib inline

Then, we import two packages, mapplotlib pyplot and the seaborne package.

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

The first plot we create is a boxplot. This is a plot to show the difference in annual income between those that did not default, which is no, and those that did default, which is yes. 
The plot shows that those that did not default on their loans tend to have a higher annual income.

In [None]:
ax = sns.boxplot(data = loan, x = 'Default', y = 'Income')

Next, let's create another boxplot to show the difference in loan amount between those that did not default on their loans and those that did.

This chart shows that those that defaulted on their loans tend to have borrowed a little slightly more than those that did not.

In [None]:
ax = sns.boxplot(data = loan, x = 'Default', y = 'Loan Amount')

Finally, let's create a scatter plot to look at the relationship between income and loan amount.

This chart doesn't show a clear linear relationship between those two variables. There isn't much we can really infer from it, so we can move on now.

In [None]:
ax = sns.scatterplot(data = loan, 
                     x = 'Loan Amount', 
                     y = 'Income', 
                     hue = 'Default', 
                     style = 'Default', 
                     markers = ['^','o'], 
                     s = 150)
ax = plt.legend(bbox_to_anchor = (1.02, 1), loc = 'upper left')

## 3. Prepare the Data

Before we build our classification tree though, we need to split the data into training and test sets. Prior to doing so, we must first separate the dependent variable from the independent variables.

Let's start by creating a data frame called Y for the dependent variable, which is default.

In [None]:
y = loan[['Default']]

Then we also do the same and create a second data frame called X for the independent variables, income and loan amount.

In [None]:
X = loan[['Income', 'Loan Amount']]

Now that we have our two data frames, we need to now build our model. We can now split our data, before we do so, we have to import the train test split function from the SK learn model selections package. Using this, we can split the data, the X and Y data frames, into X_train X_test, Y_train and Y_test. Note that here we set train size to 0.8. This means we want 80% of the original data to become the training data, while 20% becomes the test data. We also set stratify as y which means we want the data splits using stratified random sampling based on the values of y. Finally, we set random state to 1234, simply so we get the same results every time we do the split.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.8,
                                                    stratify = y,
                                                    random_state = 1234) 

Now that our data is split, the shape attribute of the X_train and X_test data frames tell us how many instances, or records, are in each data frame.

We can see that we have 24 instances in the training set and six instances in the test set.

In [None]:
X_train.shape, X_test.shape

## 4. Train and Evaluate the Classification Tree

To build a classification tree in Python, we need to import the decision tree classifier class from the SK learn tree sub package. We then instant sheet an object from the class. We call the object classifier.

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state = 1234)

Now that we have an object, we can build or fit a classification tree model using the training data.

In [None]:
model = classifier.fit(X_train, y_train)

To evaluate and estimate the future performance of our model, we can now see how this model fits against the test data. To do so, we pass the test data to the score method of the model. This returns the accuracy of the model against the test data.

A classification tree is only able to accurately explain 50% of the relationship between the independent variables and a dependent variable within the test data. That's no better than a coin toss. We can do better.

In [None]:
model.score(X_test, y_test)

## 5. Visualize the Classification Tree

Now that we've trained a classification tree, let's visualize it to get a better understanding of the tree logic.

First, we make sure that we import the tree object from the sklearn package.
The figure_method of Pyplot allows us to specify the size of our tree. Feel free to adjust this to see how it impacts the size of your tree.

In [None]:
from sklearn import tree

Finally, we use the plot_tree method of the tree object to visualize the tree.

The first argument we pass to this method is the classification tree model itself, model. Then we specify the independent variables as a list. Next, we specify the possible values of the dependent variable as a list in ascending order, No and Yes. Finally, we specify that we want the nodes of the Tree color filled.



In [None]:
plt.figure(figsize = (15,15))
tree.plot_tree(model, 
                   feature_names = list(X.columns), 
                   class_names = ['No','Yes'],
                   filled = True);

Now we have our tree. 

Let's take some time to understand the structure of this classification tree. We see that the root node asks the question is income less than or equal to $14,500? 
This means that the first splits that the classifier made during the recursive partitioning process is that income equal to 14.5. 
The fact that income variable was used as the first split, let's us know that it is the most important variable within the dataset in predicting the outcome. 
The branch to the left of each node is for the Yes response, while the branch to the right is for the No response. 

Within each node, we get a value for the Gini impurity score. 

Gini is a measure of the degree of impurity in the partition. The smaller this value is, the more homogenous the items in a partition are. 

We also see the number of items or samples within each partition. Notice that this value decreases as we work our way down the tree towards the leaf nodes. This is expected since the primary objective of recursive partitioning is to create smaller, more homogenous subsets of the data.

The next information in each node, value, indicates the count of items within each class. This is the item distribution. 
For example, in the root node there are 14 items with a value of No and 10 with a value of Yes. The Noes are the majority, which is why the class value is equal to No. 
This means that if our classification tree were just one node, the root node, it would label every loan as not default. Notice how the Gini impurity values change in relation to the item distributions. As one class dominates, the Gini value tends toward zero. 
One of the benefits of decision trees is that they are pretty good at ranking the effectiveness of independent variables and predicting the values of the dependent variable. This is known as feature_importance. 

________________________________________________________________________________________


We can visualize the feature_importance of the independent variables as follows. 
First, we assign the feature importances on the score attribute of the model to a variable, which we call Importance. 
The attribute returns an array of the important scores of each independent variable. Next, we create a Pandas Series called feature_importance by using the importance array as the values and the independent variable names as the index. 
Finally, we plot the series. Let's take a look at it. 

From the plot, we see that the income variable is more important than the loan amount in predicting whether a borrower will default on their loan or not.

In [None]:
importance = model.feature_importances_
feature_importance = pd.Series(importance, index = X.columns)
feature_importance.plot(kind = 'bar')
plt.ylabel('Importance');

## 6. Prune the Classification Tree

Now that we've trained and visualized a classification tree, let's look into what we can do to improve its performance by pruning. Decision trees are prone to overfitting. One telltale sign that a tree has overfit is if it has a high accuracy score on the training data with a low accuracy score on the test data.

Let's start by getting our trees accuracy on the training data. To do this we pass the training data to the score method of the model.

A model is a hundred percent accurate on a training data. That's suspicious. Let's get a second opinion from the test data.

In [None]:
model.score(X_train, y_train)

Similarly, we pass the test data to the score method of the model. Our model is 50% accurate on the test data. Our model has definitely overfit on the training data and needs to be pruned.

In [None]:
model.score(X_test, y_test)

There are two ways to prune a decision tree. One is to set parameters that manage its growth during the recursive partitioning process. This is known as pre-pruning. Another approach is to allow the tree to fully grow on impeded and then gradually reduce its size in order to improve its performance. This is known as post-pruning. In this tutorial, we will use a pre-running approach. This means that we need to figure out the best combination of values for the parameters of the tree that will result in the best performance. This is known as hyper parameter tuning. The psyche learned package scikit-learn provides several parameters we can tune during this process.

We will limit ourselves to three of them.

We start by creating a dictionary which we call grid that holds the values of the parameters we want to try out. 

The first parameter is max depth. This sets the maximum depth of the decision tree. We will try setting the value to two, three, four and five to see which is the best. 

The next parameter is min sample split. This sets the minimum number of items we can have in the partition before it can be split. Studies show that a value between one and 40 is best. We will try setting the value to two, three, and four. 

Next is the min samples leaf parameter. This sets the minimum number of items we have in a leaf node. Studies show that the best values are between one and 20. We will try setting the value to one, two, three, four, five, and six. We set the range from 1 to 7 (the last value isn't use)

In [None]:
grid = {'max_depth': [2, 3, 4, 5],
         'min_samples_split': [2, 3, 4],
         'min_samples_leaf': range(1, 7)}

The gridsearch CV class from the scikit-learn model selection sub package allows us to perform a great search to find the best parameter values for our tree. We import the class.

In [None]:
from sklearn.model_selection import GridSearchCV

Then we instantiate a decision tree classifier object and then we pass the object to a new grid search CV object, which we call GCV. We also pass the parameter grid to the object. We then pass the training data to the fit method of GCV so it evaluates each hyper parameter combination in grid. 

In [None]:
classifier = DecisionTreeClassifier(random_state = 1234)
gcv = GridSearchCV(estimator = classifier, param_grid = grid)
gcv.fit(X_train, y_train)

The best estimator attributes of GCV returns the classifier with the best combination of hyper parameters for our data.

We then fit a classification tree on the training data using this classifier.

The output shows that the best combination of hyper parameters is max depth set at two and min samples leaf set at six.

In [None]:
model_ = gcv.best_estimator_
model_.fit(X_train, y_train)

Now we can reevaluate how well our model fits the training data by passing the training data to the score method of the model.

In [None]:
model_.score(X_train, y_train)

We see that the accuracy has gone down from a hundred percent to 87.5%.

Let's see how the model fits the test data as well.

In [None]:
model_.score(X_test, y_test)

Now, the model's accuracy on the test data has risen from 50% to 83.3% that is much better.

Finally, we can visualize our prune model. Our prune tree is much smaller than the one we started off with but it generalizes much better.

In [None]:
plt.figure(figsize = (8,8))
tree.plot_tree(model_, 
                   feature_names = list(X.columns), 
                   class_names = ['No','Yes'],
                   filled = True);