We learned about how decision trees are constructed. We used a modified version of ID3, which is a bit simpler than the most common tree building algorithms, [C4.5](https://en.wikipedia.org/wiki/C4.5_algorithm) and [CART](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees). The basics are the same, however, so we can apply what we learned about how decision trees work to any tree construction algorithm.

In this project, we'll learn about when to use decision trees, and how to use them most effectively.

We'll continue using the 1994 census data on U.S. incomes. It contains information on marital status, age, type of work, and more. The target column, high_income, indicates an income of less than or equal to 50k a year (0), or more than 50k a year (1)

We can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

In [18]:
import pandas as pd
import numpy as np

income = pd.read_csv("income.csv")

In [19]:
cols = ["workclass","education", "marital_status", "occupation", "relationship", "race", "sex", "native_country", "high_income"]

for col in cols:
    income[col] = pd.Categorical(income[col]).codes

In [20]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


We can use the scikit-learn package to fit a decision tree. The interface is very similar to other algorithms.

We use the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) class for classification problems, and [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) for regression problems. The sklearn.tree package includes both of these classes.

In this case, we're predicting a binary outcome, so we'll use a classifier.

The first step is to train the classifier on the data. We'll use the fit method on a classifier to do this.

We'll want to split our data into training and testing sets first. If we don't, we'll be making predictions on the same data that we train our algorithm with. This leads to overfitting, and will make our error appear lower than it is.

# Overfitting

If we memorize how to perform three specific addition problems (2+2, 3+6, 3+3), we'll get those specific problems correct every time.

On the other hand, if someone asks us what 4+4 is, we won't know how to do it, because we don't know the rules of addition. If we learn the rules of addition, we'll get problems wrong sometimes (because operations like 3443343434+24344343 can be hard to do mentally). Even so, we'll be able to do any problem, and we'll get most of them right.

The first example represents overfitting, where we memorize the details of the training set, but can't generalize to new examples we're asked to make predictions on.

We can avoid overfitting by always making predictions and evaluating error on data that we haven't trained our algorithm with. This will show us when we're overfitting by giving us a realistic error on data that the algorithm hasn't seen before.

We can split the data by shuffling the order of the dataframe, then selecting certain rows to include in the training set, and certain rows to include in the testing set.

In [21]:
import math

np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
9646,62,6,26911,5,4,6,8,1,4,0,0,0,66,39,0
709,18,4,208103,1,7,4,8,2,4,1,0,0,25,39,0
7385,25,4,102476,9,13,4,5,3,4,1,27828,0,50,39,1
16671,33,4,511517,11,9,2,10,0,4,1,0,0,40,39,0
21932,36,4,292570,1,7,4,7,4,4,0,0,0,40,39,0


In [23]:
# we'll make 80% of our rows training data, and the rest testing data.

train_max_row = math.floor(income.shape[0]*.8)
train = income[:train_max_row]
test = income[train_max_row:]

[AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve) ranges from 0 to 1, so it's ideal for binary classification. The higher the AUC, the more accurate our predictions.

We can compute AUC with the [roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) function from sklearn.metrics. This function takes in two parameters:

1. y_true: true labels
2. y_score: predicted labels

It then calculates and returns the AUC value

In [26]:
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier

# A list of columns to train with
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", 
           "race", "sex", "hours_per_week", "native_country"]

clf = DecisionTreeClassifier(random_state = 1) # Set random_state to 1 to make sure the results are consistent

clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])

error = roc_auc_score(test["high_income"],predictions)
print(error)

0.6934656324746192


The AUC for the predictions on the testing set is about .694. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting.

It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.

In [27]:
predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))

0.9471244501437455


Here's the full diagram for the decision tree we can build from it:

![image.png](attachment:image.png)

This tree predicts all of our values perfectly. It will always get a right answer on the training set, but this is equivalent to memorizing the rules of addition. While we've built our tree in such a way that it can perfectly predict the training set, the way it's constructed doesn't make sense when we take a step back.

That's because the tree above is saying that:

If we're under 22.5 years old, we have a low income

If we're 22.5 - 37.5, we have a high income

If we're 37.5 - 47.5, we have a low income

If we're 47.5 to 55, we have a high income

Finally, if we're above 55, we have a low income

These rules are very specific to the training set.

Think about the problem with a real-world lens. Does it make sense to predict that someone who's 20 has a low income, someone who's 25 has a high income, and someone who's 40 has a low income? Intuitively, we know that younger people tend to earn less, middle-aged people earn more, and people who have retired earn less.

Our tree has created so many age-based splits in an attempt to perfectly predict everyone's income that each split is effectively meaningless.

Here's a tree that matches up with our intuition better:

![image.png](attachment:image.png)

All we've done is "pruned" the tree, and removed some of the lower leaves. We've turned some of the higher-level nodes into leaves instead.This version actually has lower accuracy on our training set, but will generalize to new examples better because it matches reality more closely.

Trees overfit when they have too much depth and make overly complex rules that match the training data, but aren't able to generalize well to new data. This may seem to be a strange principle at first, but the deeper a tree is, the worse it typically performs on new data.

There are three main ways to combat overfitting:

1. "Prune" the tree after we build it to remove unnecessary leaves.
2. Use ensembling to blend the predictions of many trees.
3. Restrict the depth of the tree while we're building it.

We'll look at the third method first.

Limiting tree depth during the building process will result in more general rules. This prevents the tree from overfitting.

We can restrict tree depth by adding a few parameters when we initialize the DecisionTreeClassifier class:

* max_depth - Globally restricts how deep the tree can go
* min_samples_split - The minimum number of rows a node should have before it can be split; if this is set to 2, for example, then nodes with 2 rows won't be split, and will become leaves instead
* min_samples_leaf - The minimum number of rows a leaf must have
* min_weight_fraction_leaf - The fraction of input rows a leaf must have
* max_leaf_nodes - The maximum number of total leaves; this will cap the count of leaf nodes as the tree is being built

Some of these parameters aren't compatible, however. For example, we can't use **max_depth** and **max_leaf_nodes** together.

In [28]:
# Set min_samples_split to 13

clf = DecisionTreeClassifier(random_state = 1, min_samples_split= 13)

clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

0.6995617145150872
0.8421431849275413


By setting min_samples_split to 13, we managed to boost the test AUC from .694 to .700. The training set AUC decreased from .947 to .843, showing that the model we built was less overfit to the training set than before

In [29]:
# Set max_depth to 7 and min_samples_split to 13

clf = DecisionTreeClassifier(random_state = 1, min_samples_split= 13, max_depth = 7)

clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

0.7436344996725136
0.748037708309209


We just improved the AUC again! The test set AUC increased to .744, while the training set AUC decreased to .748:


In [30]:
# Set max_depth to 2 and min_samples_split to 100

clf = DecisionTreeClassifier(random_state = 1, min_samples_split= 100, max_depth = 2)

clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

0.6553138481876499
0.6624508042161483


Our accuracy went down 

![image.png](attachment:image.png)

This is because we're now [underfitting](https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted). Underfitting is what occurs when our model is too simple to explain the relationships between the variables.

![image.png](attachment:image.png)

This model will predict that anyone under 37.5 has a high income (.66 rounds up), and anyone over 37.5 has a low income (.33 rounds down). It's too simple to model reality, in which people earn less while they're young, more during middle age, and less again after retirement.

Therefore, this tree underfits the data and will have lower accuracy than the appropriate version.

By artificially restricting the depth of our tree, we prevent it from creating a model that's complex enough to correctly categorize some of the rows. If we don't perform the artificial restrictions, however, the tree becomes too complex, fits quirks in the data that only exist in the training set, and doesn't generalize to new data.

This is known as the [bias-variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). Imagine that we take a random sample of training data and create many models. If the models' predictions for the same row are far apart from each other, we have high variance. Imagine this time that we take a random sample of the training data and create many models. If the models' predictions for the same row are close together but far from the actual value, then we have high bias.

High bias can cause underfitting -- if a model is consistently failing to predict the correct value, it may be that it's too simple to model the data faithfully.

High variance can cause overfitting. If a model varies its predictions significantly based on small changes in the input data, then it's likely fitting itself to quirks in the training data, rather than making a generalizable model.

We call this the bias-variance tradeoff because decreasing one characteristic will usually increase the other. This is a limitation of all machine learning algorithms.

We'll generally need to use our intuition and manually tweak parameters to get the "right" fit.

We can induce variance and see what happens with a decision tree. To introduce noise into the data, we'll add a column of random values. A model with high variance (like a decision tree) will pick up on this noise, and overfit to it. This is because models with high variance are very sensitive to small changes in input data.

In [37]:
np.random.seed(1)

# Generate a column containing random numbers from 0 to 3
income["noise"] = np.random.randint(4, size=income.shape[0])

columns = ["noise", "age", "workclass", "education_num", "marital_status", "occupation", 
           "relationship", "race", "sex", "hours_per_week", "native_country"]

train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

0.6914060013941348
0.9750761614350801


As we can see above, the random noise column causes significant overfitting. Our test set accuracy decreases to .691, and our training set accuracy increases to .975.

One way to prevent overfitting is to block the tree from growing beyond a certain depth (we tried this before). Another technique is called [pruning](https://en.wikipedia.org/wiki/Decision_tree_pruning). Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex. It can result in a simpler model that has higher accuracy on the testing set.

Data scientists use pruning less often than parameter optimization (what we just did) and ensembling.

# The main advantages of using decision trees 

* Easy to interpret
* Relatively fast to fit and make predictions
* Able to handle multiple types of data
* Able to pick up nonlinearities in data, and usually fairly accurate

The main disadvantage of using decision trees is their tendency to overfit.

Decision trees are a good choice for tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing.

The most powerful way to reduce decision tree overfitting is to create ensembles of trees. The [random forest](https://en.wikipedia.org/wiki/Random_forest) algorithm is a popular choice for doing this. In cases where prediction accuracy is the most important consideration, random forests usually perform better.