# Applying Decision Trees

Decision trees are a powerful and popular machine learning technique. The basic concept is very similar to trees that we commonly use to aid decision-making.

Machine Learning techniques enables us to automatically construct a decision tree that tells us what outcomes we should predict in certain situations.

The decision tree algorithm is a supervised learning algorithm -- we first construct the tree with historical data, and then use it to predict an outcome. 
- One of the major advantages of decision trees is that they can pick up nonlinear interactions between variables in the data that linear regression can't.

The following work is a walk through of the building blocks of making a decision tree automatically.

We'll be looking at individual income in the United States. 
The data is from the 1994 census, and contains information on an individual's 
- marital status
- age 
- type of work, and more. 

The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year. The dataset can be accessed from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

An [Introduction to Decision Trees](https://github.com/ajdatahub/ProjectDS/tree/master/Decision%20Trees) and how they are implemented using the same data set can be found [here](https://github.com/ajdatahub/ProjectDS/tree/master/Decision%20Trees). 
In this project, I have learned about how decision trees are constructed. The project demonstrates a modified version of ID3, which is a bit simpler than the most common tree building algorithms, C4.5 and CART. The basics are the same, however, so we can apply what we learned about how decision trees work to any tree construction algorithm.

In the current project, we'll learn about when to use decision trees, and how to use them most effectively.

In [61]:
import pandas as pd

# First column is age of the person. Set index_col to False to avoid pandas thinking that the first column is row indexes
income = pd.read_csv("income.csv", index_col=False)
print(income.head(5))

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country high_income  
0          2174             0              40   United-States   

We have categorical variables such as workclass that have string values.
- Multiple individuals can share the same string value. The types of work include State-gov, Self-emp-not-inc, Private, and so on. Each of these strings is a label for a category.
- Another example of a column of categories is sex, where the options are Male and Female.

Before we get started with decision trees, we need to convert the categorical variables in our data set to numeric variables.

In [62]:
workclass_col = pd.Categorical(income['workclass'])
income['workclass'] = workclass_col.codes
income['workclass'].head()

0    7
1    6
2    4
3    4
4    4
Name: workclass, dtype: int8

In [63]:
def convert_categorical(column,data):
#     for each in columns:
    column_cat = pd.Categorical(data[column])
    codes = column_cat.codes
    return codes
    
income['education'] = convert_categorical('education',income)
income['marital_status'] = convert_categorical('marital_status',income)
income['occupation'] = convert_categorical('occupation',income)
income['relationship'] = convert_categorical('relationship',income)
income['race'] = convert_categorical('race',income)
income['sex'] = convert_categorical('sex',income)
income['native_country'] = convert_categorical('native_country',income)
income['high_income'] = convert_categorical('high_income',income)

In [64]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


We can use scikit-learn package to fit a decision tree. 
- For classification problem, we can use DecisionTreeClassifier
- For regression problem, we can use DecisionTreeRegressor

In our project, we are trying to predict a binary outcome for the high-income. So, we will be using the regressor class.

Firstly, we will train our classifier on the data.

In [65]:
from sklearn.tree import DecisionTreeClassifier

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

cl = DecisionTreeClassifier(random_state=1)

# Training the model
cl.fit(income[columns],income[['high_income']])


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Now, we will split our data into train and test dataset. If we do not split our data, we will overfit our model by making prediciotns about the data which has already been feeded into the model. 

- This will reduce the error metric as the data has already been trained on the observations we are trying to predict.
- We can avoid overfitting by making predictions and evaluating the error metric on the data we have not trained our algorithm or model on. 

In [66]:
# Splitting the data into train and test

import numpy as np
import math

# Set a random seed so the shuffle is the same every time
np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

split_val = math.floor(income.shape[0] * .8)

train = income.iloc[:split_val]
test = income.iloc[split_val:]

AUC ranges from 0 to 1, so it's ideal for binary classification. The higher the AUC, the more accurate our predictions.

We can compute AUC with the roc_auc_score function from sklearn.metrics. This function takes in two parameters:
- y_true: true labels
- y_score: predicted labels

It then calculates and returns the AUC value.

In [67]:
from sklearn.metrics import roc_auc_score

cl = DecisionTreeClassifier(random_state=1)
cl.fit(train[columns], train["high_income"])

predictions = cl.predict(test[columns])

error = roc_auc_score(test["high_income"], predictions)
print(error)

0.6934656324746192


Let's compare this against the AUC for predictions on the training set to see if the model is overfitting.

- It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. 
- However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.

In [68]:
# Calculating predictions for train data set
predictions1 = cl.predict(train[columns])

# Calculating the roc score
error = roc_auc_score(train[['high_income']],predictions1)

print(error)

0.9471244501437455


Our AUC on the training set was .947, and the AUC on the test set was .694. 
- There's no hard and fast rule on when overfitting is occurring, but our model is predicting the training set much better than the test set. 
- Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect and fix it.

Based on our AUC measurements, it appears that we are in fact overfitting.

Trees overfit when they have too much depth and make overly complex rules that match the training data, but aren't able to generalize well to new data. 
- This may seem to be a strange principle at first, but the deeper a tree is, the worse it typically performs on new data.

###  Reducing Overfitting With a Shallower Tree

There are three main ways to combat overfitting:
- "Prune" the tree after we build it to remove unnecessary leaves.
- Use ensembling to blend the predictions of many trees.
- Restrict the depth of the tree while we're building it.

Lets discuss the third method first.

Limiting tree depth during the building process will result in more general rules. This prevents the tree from overfitting.

We can restrict tree depth by adding a few parameters when we initialize the DecisionTreeClassifier class:

- max_depth - Globally restricts how deep the tree can go
- min_samples_split - The minimum number of rows a node should have before it can be split; if this is set to 2, for example, then nodes with 2 rows won't be split, and will become leaves instead
- min_samples_leaf - The minimum number of rows a leaf must have
- min_weight_fraction_leaf - The fraction of input rows a leaf must have
- max_leaf_nodes - The maximum number of total leaves; this will cap the count of leaf nodes as the tree is being built

Some of these parameters aren't compatible, however. For example, we can't use max_depth and max_leaf_nodes together.



In [69]:
# Setting min_samples_split to 13 when creating the DecisionTreeClassifier

cl = DecisionTreeClassifier(random_state=1, min_samples_split = 13)
cl.fit(train[columns],train['high_income'])

# Predicting the labes for test data set
predictions = cl.predict(test[columns])
test_auc = roc_auc_score(test['high_income'],predictions)

# Predicting the labes for train data set
predictions1 = cl.predict(train[columns])
train_auc = roc_auc_score(train['high_income'],predictions1)

print('Test AUC : ',test_auc)
print('Train AUC : ',train_auc)

Test AUC :  0.6995617145150872
Train AUC :  0.8421431849275413


By setting min_samples_split to 13, 
- we managed to boost the test AUC from .694 to .700. 
- the training set AUC decreased from .947 to .843
- showing that the model we built was less overfit to the training set than before.

Lets add more parameters to see how the error metric changes.(Set max_depth to 7 and min_samples_split to 13 when creating the DecisionTreeClassifier.)

In [70]:
# Set max_depth to 7 and min_samples_split to 13 when creating the DecisionTreeClassifier.

cl = DecisionTreeClassifier(random_state = 1, min_samples_split = 13, max_depth = 7)
cl.fit(train[columns],train['high_income'])

# Predicting the labes for test data set
predictions = cl.predict(test[columns])
test_auc = roc_auc_score(test['high_income'],predictions)

# Predicting the labes for train data set
predictions1 = cl.predict(train[columns])
train_auc = roc_auc_score(train['high_income'],predictions1)

print('Test AUC : ',test_auc)
print('Train AUC : ',train_auc)

Test AUC :  0.7436344996725136
Train AUC :  0.748037708309209


We aren't overfitting anymore because both AUC values are about the same. 
- Let's tweak the parameters more aggressively and see how the error metric changes (Set max_depth to 2 and min_samples_split to 100 when creating the DecisionTreeClassifier).

In [71]:
# Set max_depth to 2 and min_samples_split to 100 when creating the DecisionTreeClassifier.

cl = DecisionTreeClassifier(random_state = 1, min_samples_split = 100, max_depth = 2)
cl.fit(train[columns],train['high_income'])

# Predicting the labes for test data set
predictions = cl.predict(test[columns])
test_auc = roc_auc_score(test['high_income'],predictions)

# Predicting the labes for train data set
predictions1 = cl.predict(train[columns])
train_auc = roc_auc_score(train['high_income'],predictions1)

print('Test AUC : ',test_auc)
print('Train AUC : ',train_auc)

Test AUC :  0.6553138481876499
Train AUC :  0.6624508042161483


Now, the accuracy of our model came down. This happened because we are 'Underfitting'.  
- Underfitting is what occurs when our model is too simple to explain the relationships between the variables.


###  The Bias-Variance Tradeoff

- By artificially restricting the depth of our tree, we prevent it from creating a model that's complex enough to correctly categorize some of the rows. 
- If we don't perform the artificial restrictions, however, the tree becomes too complex, fits quirks in the data that only exist in the training set, and doesn't generalize to new data.

This is known as the bias-variance tradeoff.
- If we take a random sample of training data and create many models. If the models' predictions for the same row are far apart from each other, we have high variance. 
- If we take a random sample of the training data and create many models. If the models' predictions for the same row are close together but far from the actual value, then we have high bias.

High bias can cause underfitting. If a model is consistently failing to predict the correct value, it may be that it's too simple to model the data faithfully.

High variance can cause overfitting. If a model varies its predictions significantly based on small changes in the input data, then it's likely fitting itself to quirks in the training data, rather than making a generalizable model.

- Decreasing one of these properties will increase the other. The bias-variance tradeoff is a limitation of all machine learning algorithms. 

    - Decision trees typically suffer from high variance. 
    - The entire structure of a decision tree can change if we make a minor alteration to its training data. 
    - By restricting the depth of the tree, we increase the bias and decrease the variance. 
    - If we restrict the depth too much, we increase bias to the point where it will underfit.
    

### Pruning

One way to prevent overfitting is to Pruning. Basically, we build the full tree and then remove (prune) off the leaves which do not add to the prediction accuracy.

- Pruning prevents a model from becoming overly complex. It can result in a simpler model that has higher accuracy on the testing set.

### When to use Decision Trees

So, advantages of using a tree-
- easy to interpret
- relatively fast to fit and make predictions
- able to pick up nonlinearities in data, and usually fairly accurate

The main disadvantage of using decision trees is their tendency to overfit.

- Decision trees are a good choice for tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing.