## Chapter 6 - Decision Trees

### Setup

In [None]:
import matplotlib as mpl
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

### Solutions

**1** What is the approximate depth of a decision tree trained (without restrictions) on a training set with one million instances?

A decision tree trained without restrictions will keep creating splits until all leaves are pure, which leads to $m$ total nodes, where $m$ is the number of training instances. \
Given that for depth $k$ there will be $2^k$ nodes, if we set $m = 2^k$ and take logarithms of both sides we get $k = log_2(m)$, which for one million observations leads to a depth of about ~20. 

### 7: Train and fine-tune a Decision Tree for the moons dataset by following these steps:

a. Use `make_moons(n_samples=10000, noise=0.4)` to generate a moons dataset.

In [None]:
X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=0)

In [None]:
# Visualise the moons dataset
plt.scatter(X_moons[:,0], X_moons[:,1], c=y_moons, cmap='Paired', s=2)
plt.title('Moons dataset')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

b. Use train_test_split() to split the dataset into a training set
and a test set.

In [None]:
X_moons_train, X_moons_test, y_moons_train, y_moons_test = train_test_split(X_moons, y_moons,
                                             test_size=0.2, random_state=0)

c. Use grid search with cross-validation (with the help of the `GridSearchCV` class) to find good hyperparameter values for a `DecisionTreeClassifier`. \
Hint: try various values for `max_leaf_nodes`.

In [None]:
grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=0),
                          param_grid={'max_leaf_nodes': list(range(2, 100))},
                            verbose=1, n_jobs=-1)
grid_dt.fit(X_moons_train, y_moons_train)

In [None]:
grid_dt.cv_results_