# MATH 3375 Examples Notebook #17

# Decision Trees

We explore another algorithm for predicting classification (categorical response), using the **mtcars** and **iris** data sets.

Again, our response variable is **am**, the transmission type, where 0=automatic, and 1=manual.


In [None]:
#install.packages("rpart")
library(rpart)
library(rpart.plot)

In [None]:
#Look at data set
head(mtcars)

## Decision Tree with Default Configuration Parameters 

Our first model is a decision tree created using the **rpart** library, with default values for all parameters.

In [None]:
#Create model
tran_model_tree_01 <- rpart(am~.,data=mtcars,method="class")

#Print model summary
print(tran_model_tree_01)

#View model sketch and show group sizes
par(xpd=NA)
plot(tran_model_tree_01)
text(tran_model_tree_01,use.n=TRUE)

### A More Refined Plot

The plot below is easier to read, but does not show the precise count of data points in each group.  However, it does provide:

* Branch labels (yes/no)
* Majority class at each node
* Proportion of each node that has class 1 (manual transmission)
* Proportion of overall data set at each node

In [None]:
rpart.plot(tran_model_tree_01)

### Surprisingly Simple?

This model only uses one variable, and makes a very simple comparison. Yet the final nodes are not "pure" (exclusively containing records of only one class.)  Is this the best we can do?

The default parameters for a decision tree prevent the tree from branching to trivial comparisons that are likely to be overfit to the training data.  We will look at these 3 parameters to control how much the tree is "pruned".

* **cp** - Complexity Parameter - Minimum improvement in fit that must be accomplished to warrant splitting a node
* **minsplit** - Minimum number of data points in node before it can be split
* **minbucket** - Minimum number of data points in a terminal node ('leaf')

For more information on the parameters that govern tree construction, view the documentation for rpart by running the cells below.

In [None]:
?rpart

In [None]:
?rpart.control

### An Un-Pruned Tree

Using the above parameters, we can force a tree not to be pruned.  The tree is very likely to be overfit, but is useful for getting a better idea of how trees work, and how the parameters can affect the tree. Notice that the unpruned tree continues until every terminal node is pure (every 'leaf' contains only one class of data points).

In [None]:
#Create model
tran_model_tree_02 <- rpart(am~.,data=mtcars,method="class",minsplit=2,minbucket=1)

print(tran_model_tree_02)

#View model sketch and show group sizes
par(xpd=NA)
plot(tran_model_tree_02)
text(tran_model_tree_02,use.n=TRUE)

In [None]:
rpart.plot(tran_model_tree_02)

## Second Example: Iris Species

We examine a non-binary classification example with the **iris** data set, where _Species_ has 3 possible values.

This time, we hold out 5 rows that we can use for testing. We will train the first tree with the remaining 145 rows.

The first model is pruned with default parameter settings in **rpart**.

In [None]:
test_rows <- c(14,23,80,119,123)
iris_test <- iris[test_rows,]
iris_train <- iris[-test_rows,]

#Create model
iris_model_tree_01 <- rpart(Species~.,data=iris_train,method="class")

print(iris_model_tree_01)

#View model sketch and show group sizes
par(xpd=NA)
plot(iris_model_tree_01)
text(iris_model_tree_01,use.n=TRUE)

In [None]:
rpart.plot(iris_model_tree_01)

### Reduce Pruning

We will train the remaining trees with the full data set, so we will NOT test these models.  

The examples below illustrate levels of pruning different from the default. First we use the minsplit and minbucket parameters to increase splitting (reduce pruning). 

In [None]:
iris_model_tree_02 <- rpart(Species~.,data=iris,method="class",minsplit=2,minbucket=1)
rpart.plot(iris_model_tree_02)

### Complexity Parameter

This time, notice that even with **minsplit** and **minbucket** set to the lowest possible values, the terminal nodes are not pure. This is because the _complexity parameter_ prevents a node from splitting if there is not sufficient improvement to the model fit.  The default complexity parameter is 0.01, representing the minimum improvement needed in overall classification accuracy before a node will be split.  Below, we show what happens if we reduce this value.

In [None]:
iris_model_tree_03 <- rpart(Species~.,data=iris,method="class",minsplit=2,minbucket=1,cp=0.001)
rpart.plot(iris_model_tree_03)

### Completely Un-Pruned Tree

By reducing the complexity parameter threshold, we have prevented any pruning of the tree, and every terminal node is pure. 

The tree above is almost certainly overfit to the particular data set that we used to train it.

## Using Decision Trees for Prediction

Using the test set that we set aside when creating the first model, we will see how the trees can be used for prediction.

In [None]:
iris_test


In [None]:
test_pred <- predict(iris_model_tree_01,iris_test)
test_pred

### Predicting Class Directly

In [None]:
test_pred <- predict(iris_model_tree_01,iris_test,type="class")
test_pred

## Suggestion

Explore decision trees more on your own using any data set that has a binary or categorical variable.