# tree vs. forest
A single decision tree is prone to overfitting and sensitive to outliers and missing values. Random forests address these issues by creating many decision trees, each based on a bootstrapped subset of the original data and a random set of features. Each tree gets a "vote" when predicting an outcome. While a random forest generally make more accurate and generalizable predictions than a decision tree, it can be much more computationally expensive. Let's test if this is indeed the case using the built-in Iris dataset.

## dataset
The [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) has 50 examples of Iris flowers from 3 different species and 4 features that can help predict the species. 

In [36]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree.export import export_text
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings("ignore")

In [37]:
# load data
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

In [38]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## decision tree

### train model

In [45]:
%%time

decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X_train, y_train)

CPU times: user 800 µs, sys: 302 µs, total: 1.1 ms
Wall time: 923 µs


### accuracy

In [46]:
%%time

# make predictions
y_pred = decision_tree.predict(X_test)

# measure accuracy
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix: \n{}\n".format(cm))
print(
    "Accuracy is {} for a single decision tree.".format(accuracy_score(y_test, y_pred))
)

Confusion matrix: 
[[11  0  0]
 [ 0 13  0]
 [ 0  1  5]]

Accuracy is 0.9666666666666667 for a single decision tree.
CPU times: user 1.47 ms, sys: 435 µs, total: 1.9 ms
Wall time: 1.5 ms


## random forest

### train model

In [47]:
%%time

random_forest = RandomForestClassifier(max_depth=2, random_state=0)
random_forest = random_forest.fit(X_train, y_train)

CPU times: user 10.3 ms, sys: 2.28 ms, total: 12.6 ms
Wall time: 10.8 ms


### accuracy

In [48]:
%%time

# make predictions
y_pred = random_forest.predict(X_test)

# measure accuracy
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix: \n{}\n".format(cm))
print("Accuracy is {} for a random forest.".format(accuracy_score(y_test, y_pred)))

Confusion matrix: 
[[11  0  0]
 [ 0 13  0]
 [ 0  2  4]]

Accuracy is 0.9333333333333333 for a random forest.
CPU times: user 3.72 ms, sys: 1.05 ms, total: 4.77 ms
Wall time: 4.03 ms


## summary
In this case, when there are only 50 cases and 4 features in the dataset, a random forest is no more accurate than a single decision tree; if anything, the former is even less accurate than the latter (96.7% vs. 93.3% accuracy) for the Iris data. As 