# Exercise: Decision Trees in sklearn

In this section, you'll use decision trees to fit a given sample dataset.

For your decision tree model, you'll be using scikit-learn's `Decision Tree Classifier` class. 
This class provides the functions to define and fit the model to your data.

## Hyperparameters

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

    `max_depth`: The maximum number of levels in the tree.
    `min_samples_leaf`: The minimum number of samples allowed in a leaf.
    `min_samples_split`: The minimum number of samples required to split an internal node.

For example, here we define a model where the maximum depth of the trees `max_depth` is 7, 
and the minimum number of elements in each leaf `min_samples_leaf` is 10. 
```
python >>> model = DecisionTreeClassifier(max_depth = 7, min_samples_leaf = 10)
```

## Decision Tree Quiz

In this quiz, you'll be given a sample dataset, and your goal is to define a model that gives 100% accuracy on it.

Note: This quiz requires you to find an accuracy of 100% on the training set. This is like memorizing the training data! A model designed to have 100% accuracy on training data is unlikely to generalize well to new data. If you pick very large values for your parameters, the model will fit the training set very well but may not generalize well. Try to find the smallest possible parameters that do the job—then, the model will be more likely to generalize well.


In [1]:
# Import statements 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

In [2]:
# Read the data.
data = np.asarray(pd.read_csv('data_1.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y. 
X = data[:,0:2]
y = data[:,2]

In [20]:
# Create the decision tree model and assign it to the variable model.
max_depth         =  7
min_samples_split =  5
min_samples_leaf  = 10
model = DecisionTreeClassifier(max_depth         = max_depth, \
                               min_samples_leaf  = min_samples_leaf, \
                               min_samples_split = min_samples_split)

In [21]:
# Fit the model.
# Fitting the model means finding the best tree that fits the training data. 
model.fit(X, y)

In [23]:
# Make predictions. Store them in the variable y_pred.
y_pred = model.predict(X)

# Make another prdiction
print(model.predict([ [0.2, 0.8], [0.5, 0.4] ]))

[0. 1.]


In [28]:
# Calculate the accuracy and assign it to the variable acc.
acc = accuracy_score(y, y_pred)
print(f"[INFO] The accuracy for \n       max_depth={max_depth} \n       min_samples_split={min_samples_split} \n       min_samples_leaf={min_samples_leaf} is: {acc} ")

[INFO] The accuracy for 
       max_depth=7 
       min_samples_split=5 
       min_samples_leaf=10 is: 0.8333333333333334 
