# Decision Trees in Practice

We explore various techniques for preventing overfitting in decision trees. 
We extend the implementation of the Binary Decision Trees that we implemented in the previous assignment.

In this notebook, we
* Implement binary decision trees with different Early Stopping methods.
* Compare models with different stopping parameters.
* Visualize the concept of overfitting in decision trees.

In [1]:
import turicreate

### Loading the LendingClub Dataset

This assignment will use the [LendingClub](https://www.lendingclub.com/) dataset used in the previous two assignments.

In [3]:
loans = turicreate.SFrame('../data/lending-club-data.sframe/')

In [None]:
# Create `safe_loans` column to have +1 for a safe loan, and -1 for a risky (bad) loan.
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)

# Remove the bad_loans columns
loans = loans.remove_column('bad_loans')

We will be using the same 4 categorical features from the previous assignment: 
1. `grade`, grade of the loan 
2. `term`, the length of the loan term
3. `home_ownership` the home ownership status: own, mortgage, rent
4. `emp_length` number of years of employment.

As in the past notebook we will convert this to binary data using 1-hot encoding.

In [6]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

### Balancing class distribution in the dataset

In [7]:
safe_loans_all = loans[loans[target] == 1]
risky_loans_all = loans[loans[target] == -1]

ratio_risky_to_safe = len(risky_loans_all)/float(len(safe_loans_all))

safe_loans = safe_loans_all.sample(ratio_risky_to_safe, seed = 1)

risky_loans = risky_loans_all

# Merge both into a single sample
loans_data = risky_loans.append(safe_loans)

print("Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data)))
print("Percentage of risky loans                :", len(risky_loans) / float(len(loans_data)))
print("Total number of loans in our new dataset :", len(loans_data))

Percentage of safe loans                 : 0.5022361744216048
Percentage of risky loans                : 0.4977638255783951
Total number of loans in our new dataset : 46508


### Transforming categorical data into binary features

In [None]:
loans_data = risky_loans.append(safe_loans)

In [8]:
# For every feature of our working data:
for feature in features:
    
    # Convert the values in the column into single-key dictionaries 
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})  #SArray
    
    # Use the Sarray.unpack method to convert into an SFrame with a column for each value
    # an SArray of dictionaries will be expanded into as many columns as there are unique keys.
    # We use the original feature's name as a prefix
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature) # Sframe

    
    # Replace the default None values of the unpacked columns with 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)
    
    # Remove the original feature column and add the unpacked columns to the working data SFrame
    loans_data = loans_data.remove_column(feature)
    loans_data = loans_data.add_columns(loans_data_unpacked)

The feature columns now look like this:

In [9]:
features = loans_data.column_names()
features.remove('safe_loans')  # Remove the response variable
features

['grade.A',
 'grade.B',
 'grade.C',
 'grade.D',
 'grade.E',
 'grade.F',
 'grade.G',
 'term. 36 months',
 'term. 60 months',
 'home_ownership.MORTGAGE',
 'home_ownership.OTHER',
 'home_ownership.OWN',
 'home_ownership.RENT',
 'emp_length.1 year',
 'emp_length.10+ years',
 'emp_length.2 years',
 'emp_length.3 years',
 'emp_length.4 years',
 'emp_length.5 years',
 'emp_length.6 years',
 'emp_length.7 years',
 'emp_length.8 years',
 'emp_length.9 years',
 'emp_length.< 1 year',
 'emp_length.n/a']

### Splitting Train-Validation

In [10]:
train_data, validation_set = loans_data.random_split(.8, seed=1)

# Early stopping methods for decision trees

From the lecture, we discussed 3 early stopping methods:

1. Reached a **maximum depth**. (set by parameter `max_depth`). _Note: This one has been implemented in the previous assignment._
2. Reached a **minimum node size**. (set by parameter `min_node_size`).
3. Don't split if the **gain in error reduction** is too small. (set by parameter `min_error_reduction`).

For the rest of the notebook, we will refer to these three as **early stopping conditions 1, 2, and 3**.



### Early stopping condition 2: Minimum node size

The function `notEnoughData` returns a boolean whether the node meets the minimum size requirement.
This function will be used to detect this early stopping condition in the `decision_tree_create` function.

In [17]:
'''
Function that returns True if the number of data points is less than or equal to the Minimum Node Size parameter.
'''
def notEnoughData(data, min_node_size):
    """
    * data - the observations present in the node
    * min_node_size - the minimum number of data points required for any node
    """
    return len(data) <= min_node_size
    

**Quiz Question:** Given an intermediate node with 6 safe loans and 3 risky loans, if the `min_node_size` parameter is 10, what should the tree learning algorithm do next? `A: stop right there`

### Early stopping condition 3: Minimum gain in error reduction

The function `errorGain` will be used to detect this early stopping condition in the `decision_tree_create` function.

In [63]:
'''
Computer the gain in Error reduction: the difference between classification error before and after the split.

'''
def errorGain(error_before_split, error_after_split):
    return error_before_split - error_after_split

**Quiz Question:** Assume an intermediate node has 6 safe loans and 3 risky loans.  For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the **minimum gain in error reduction** parameter is set to 0.2, what should the tree learning algorithm do next?

`A: Stop right there`

### Decision Tree Utility functions from past assignments

In [64]:
'''
Computes the number of prediction mistakes at an intermediate node of a Decision Tree 
making predictions based on Majority Class.
'''
def nodeMistakes(labels):
    """Compute and return the number of misclassified examples of an intermediate node,
        given the array of labels of the data points in that node."""
    # Edge case: empty node
    if len(labels) == 0:
        return 0
    
    # Count the number of 1's (safe loans)
    n_safe = (labels == +1).sum()
    
    # Count the number of -1's (risky loans)
    n_risk = (labels == -1).sum()
                
    # The majority classifier labels all datapoints in the node with the most popular label.
    # Number of mistakes is equal to the observations not in the majority class.
    return min(n_safe, n_risk)

In [65]:
'''
Loops through the list of available features, 
computes classification error based on would-be feature split
and returns the feature with smallest classification error after split
'''
def bestFeatureSplit(data, features, target):
    """
    Receives:
    * data - SFrame that includes all the feature columns and label column
    * features - list of strings of column names to consider for splits
    * target - string for the name of target label column
    Returns: 
      string for the name of next best splitting feature
    
    """
    best_feature = features[0] 
    best_error = 10     # Error is always <= 1, we should intialize it with something larger than 1.

    N = float(len(data))
    
    # For every possible feature split
    for feature in features:
        
        # feature value = 0
        left_split = data[data[feature] == 0]
        
        # feature value = 1
        right_split =  data[data[feature] == 1]
            
        # Count number of misclassified examples
        left_mistakes = nodeMistakes(labels=left_split[target])           
        right_mistakes = nodeMistakes(labels=right_split[target])
            
        # Classification error of current split
        error = (left_mistakes + right_mistakes) / N 

        if error < best_error:
            best_feature = feature
            best_error = error
        
    
    return best_feature # Return the best feature found

In [66]:
'''
Creates a LEAF node given the array of target values of the data so as to set its prediction label.
'''
def createLeaf(labels):
    
    # Create a leaf node dictionary
    leaf = {
            'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf':  True
            }
    
    # Count the number of datapoints from each class in this node
    num_plus = len(labels[labels == +1])
    num_minus = len(labels[labels == -1])
    
    # Set the prediction to be the majority class.
    if num_plus > num_minus:
        leaf['prediction'] = +1         
    else:
        leaf['prediction'] = -1        

        
    return leaf 

## Incorporating Early Stopping conditions in binary decision tree implementation

In [69]:
'''
Recursive function to build a decision tree that consists of dictionary nodes; 
given the data, list of features, target name, current tree depth and maximum depth limit.
It is an extension of `buildTree` that implements early stopping conditions.
'''
def buildTreeEarlyStopping(data, features, target, current_depth = 0, 
                         max_depth = 10, min_node_size=1, 
                         min_error_reduction=0.0):
    
    remaining_features = features[:] # Shallow copy of the features.
    
    labels = data[target]
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(labels)))

    
    # Stopping condition 1: All nodes are of the same type.
    if nodeMistakes(labels) == 0:
        print("Stopping condition 1 reached. All nodes same class.")
        return createLeaf(labels)

    # Stopping condition 2: No more features to split on.
    if remaining_features == []:
        print("Stopping condition 2 reached. No more features.")
        return createLeaf(labels)
    
    # Early stopping condition 1: Reached max depth limit.
    if current_depth >= max_depth:
        print("Early stopping condition 1 reached. Reached maximum depth.")
        return createLeaf(labels)
    
    # Early stopping condition 2: Reached the minimum node size.
    if notEnoughData(data, min_node_size):
        print("Early stopping condition 2 reached. Reached minimum node size.")
        return createLeaf(labels)
    
    # Find the best splitting feature
    splitting_feature = bestFeatureSplit(data, features, target)
    
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    

    error_before_split = nodeMistakes(labels) / float(len(data))
    
    left_mistakes = nodeMistakes(labels=left_split[target])           
    right_mistakes = nodeMistakes(labels=right_split[target])
    
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))
    
    # If the error reduction is LESS THAN OR EQUAL TO min_error_reduction, return a leaf.
    if errorGain(error_before_split, error_after_split) <= min_error_reduction:       ## YOUR CODE HERE
        print("Early stopping condition 3 reached. Minimum error reduction.")
        return createLeaf(labels)
    
    
    remaining_features.remove(splitting_feature)
    #print("Split on feature %s. (%s, %s)" % (splitting_feature, len(left_split), len(right_split)))

    # Recurse on both subtrees
    left_tree = buildTreeEarlyStopping(left_split, remaining_features, target, 
                                       current_depth + 1, max_depth, min_node_size, min_error_reduction)        

    right_tree = buildTreeEarlyStopping(right_split, remaining_features, target,
                                        current_depth + 1, max_depth, min_node_size, min_error_reduction)
    
    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

In [70]:
def count_nodes(tree):
    if tree['is_leaf']:
        return 1
    return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])

Run the following test code to check your implementation. Make sure you get **'Test passed'** before proceeding.

In [71]:
small_decision_tree = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 2, 
                                        min_node_size = 10, min_error_reduction=0.0)
if count_nodes(small_decision_tree) == 7:
    print('Test passed!')
else:
    print('Test failed... try again!')
    print('Number of nodes found                :', count_nodes(small_decision_tree))
    print('Number of nodes that should be there : 7' )

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (4701 data points).
Early stopping condition 1 reached. Reached maximum depth.
Test p

# Building a tree with Early Stopping

First we train a tree model `my_decision_tree_new` on the **train_data** with
* `max_depth = 6`
* `min_node_size = 100`, 
* `min_error_reduction = 0.0`

In [72]:
my_decision_tree_new = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 6, 
                                min_node_size = 100, min_error_reduction=0.0)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 3 reached. Minimum error reduction.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (96 data points).
Early stopping condition 2 reached. Reached minimum node size.
--------------------------------------------------------------------
Subtree, depth = 3 (5 data points).
Early stopping condition 2 reached. Reached minimum node size.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
-------------------------------------------------------------

Next train a tree model `my_decision_tree_old` which ignores early stopping conditions 2 and 3, so that we get the same tree as in the previous assignment.  Set `min_node_size=0` and `min_error_reduction=-1` (a negative value).

In [73]:
my_decision_tree_old = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=-100)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2133 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------

--------------------------------------------------------------------
Subtree, depth = 6 (9 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 3 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1276 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All nodes same class.
-------------------------------

## Making predictions

In [74]:
'''
Recursive function that traverses a Decision Tree to return the predicted label for a given input x.
'''
def classify(node, x, annotate = False):   
    
    if node['is_leaf']:
        if annotate: 
            print(f"Reached leaf, predicted: {node['prediction']}")
        return node['prediction'] 
    
    else:
        # Go down the subtree that corresponds to feature split
        split_feature_value = x[node['splitting_feature']]
        if annotate: 
            print(f"Split on {node['splitting_feature']} = {split_feature_value}")
        
        if split_feature_value == 0:  
            # belongs to left_split
            return classify(node['left'], x, annotate)
        else:
            # belongs to right_split
            return classify(node['right'], x, annotate)

Now, let's consider the first example of the validation set and see what the `my_decision_tree_new` model predicts for this data point.

In [75]:
validation_set[0]

{'safe_loans': -1,
 'grade.A': 0,
 'grade.B': 0,
 'grade.C': 0,
 'grade.D': 1,
 'grade.E': 0,
 'grade.F': 0,
 'grade.G': 0,
 'term. 36 months': 0,
 'term. 60 months': 1,
 'home_ownership.MORTGAGE': 0,
 'home_ownership.OTHER': 0,
 'home_ownership.OWN': 0,
 'home_ownership.RENT': 1,
 'emp_length.1 year': 0,
 'emp_length.10+ years': 0,
 'emp_length.2 years': 1,
 'emp_length.3 years': 0,
 'emp_length.4 years': 0,
 'emp_length.5 years': 0,
 'emp_length.6 years': 0,
 'emp_length.7 years': 0,
 'emp_length.8 years': 0,
 'emp_length.9 years': 0,
 'emp_length.< 1 year': 0,
 'emp_length.n/a': 0}

In [76]:
print('Predicted class: %s ' % classify(my_decision_tree_new, validation_set[0]))

Predicted class: -1 


Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:

In [77]:
classify(my_decision_tree_new, validation_set[0], annotate = True)

Split on term. 36 months = 0
Split on grade.A = 0
Reached leaf, predicted: -1


-1

Let's now recall the prediction path for the decision tree learned in the previous assignment, which we recreated here as `my_decision_tree_old`.

In [78]:
classify(my_decision_tree_old, validation_set[0], annotate = True)

Split on term. 36 months = 0
Split on grade.A = 0
Split on grade.B = 0
Split on grade.C = 0
Split on grade.D = 1
Split on grade.E = 0
Reached leaf, predicted: -1


-1

**Quiz Question:** For `my_decision_tree_new`, is the prediction path for `validation_set[0]` shorter, longer, or the same as for `my_decision_tree_old` that ignored the early stopping conditions 2 and 3? `A: much shorter`

**Quiz Question:** For `my_decision_tree_new` trained with `max_depth = 6`, `min_node_size = 100`, `min_error_reduction=0.0`, is the prediction path for **any point** always shorter, always longer, always the same, shorter or the same, or longer or the same as for `my_decision_tree_old` that ignored the early stopping conditions 2 and 3? `For the new tree, shorter or the same`

**Quiz Question:** For a tree trained on **any** dataset using `max_depth = 6`, `min_node_size = 100`, `min_error_reduction=0.0`, what is the maximum number of splits encountered while making a single prediction? `A: 6`

# Evaluating the models

In [79]:
def getClassificationError(tree, data, target):
    """
    Receives:
    * tree - the root node of a learned decision tree as implemented above
    * data - SFrame including feature columns and label column
    * target - name of target column
    """
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x))
    
    # Once you've made the predictions, calculate the classification error and return it
    mistakes = (prediction != data[target]).sum()
    
    return mistakes / float(len(data))

We use this function to evaluate the classification error of `my_decision_tree_new` on the **validation_set**.

In [80]:
getClassificationError(my_decision_tree_new, validation_set, target)

0.38367083153813014

In [81]:
getClassificationError(my_decision_tree_old, validation_set, target)

0.3837785437311504

**Quiz Question:** Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assignment?
`A: lower`

# Exploring the effect of a DT's maximum depth

In this section we compare 3 models trained with different values of the stopping criterion. 
We intentionally picked models with provided parameters (**too small**, **just right**, and **too large**).

1. **model_1**: max_depth = 2 (too small)
2. **model_2**: max_depth = 6 (just right)
3. **model_3**: max_depth = 14 (may be too large)

For each of these three, we set `min_node_size = 0` and `min_error_reduction = -1`.

_Waiting Warning_: `model_3` will probably take the longest to train, up to a few couple minutes.

In [82]:
# Too small: max_depth = 2
model_1 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 2, 
                                min_node_size = 0, min_error_reduction=-1)



--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (4701 data points).
Early stopping condition 1 reached. Reached maximum depth.


In [83]:
# Just Right: max_depth = 6
model_2 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=-1)



--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2133 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------

--------------------------------------------------------------------
Subtree, depth = 6 (9 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 3 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1276 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All nodes same class.
-------------------------------

In [84]:
# Maybe Too Large: max_depth = 14
model_3 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth = 14, 
                                min_node_size = 0, min_error_reduction=-1)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
--------------------------------------------------------------------
Subtree, depth = 7 (1692 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (339 data points).
----------------------------

--------------------------------------------------------------------
Subtree, depth = 7 (2133 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (2133 data points).
--------------------------------------------------------------------
Subtree, depth = 9 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 9 (2133 data points).
--------------------------------------------------------------------
Subtree, depth = 10 (1045 data points).
--------------------------------------------------------------------
Subtree, depth = 11 (1044 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (879 data points).
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All nodes same class.
-----------------------------------

--------------------------------------------------------------------
Subtree, depth = 5 (2190 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (2190 data points).
--------------------------------------------------------------------
Subtree, depth = 7 (2190 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (2190 data points).
--------------------------------------------------------------------
Subtree, depth = 9 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 9 (2190 data points).
--------------------------------------------------------------------
Subtree, depth = 10 (803 data points).
--------------------------------------------------------------------
Subtree, depth = 11 (746 data points).
--------------------------------------------------------------------
Subtree, depth = 1

--------------------------------------------------------------------
Subtree, depth = 13 (545 data points).
--------------------------------------------------------------------
Subtree, depth = 14 (506 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (39 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (35 data points).
--------------------------------------------------------------------
Subtree, depth = 14 (35 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 12 (22 data points).
---------------------

--------------------------------------------------------------------
Subtree, depth = 6 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 7 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 9 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 10 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 11 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 11 (85 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (26 data points).
--------------------------------------------------------------------
Subtree, depth = 13 (24 data

--------------------------------------------------------------------
Subtree, depth = 5 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 4 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (22024 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (21666 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (20734 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (20638 data points).
--------------------------------

--------------------------------------------------------------------
Subtree, depth = 13 (4169 data points).
--------------------------------------------------------------------
Subtree, depth = 14 (4169 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 11 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 10 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 8 (28

--------------------------------------------------------------------
Subtree, depth = 14 (41 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (1 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 13 (1 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 12 (1 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 11 (52 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (47 data points).
--------------------------------------------------------------------
Subtree, depth = 13 (47 data points).
---------------------------------------

--------------------------------------------------------------------
Subtree, depth = 7 (230 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (230 data points).
--------------------------------------------------------------------
Subtree, depth = 9 (230 data points).
--------------------------------------------------------------------
Subtree, depth = 10 (230 data points).
--------------------------------------------------------------------
Subtree, depth = 11 (119 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (119 data points).
--------------------------------------------------------------------
Subtree, depth = 13 (71 data points).
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 14 (

--------------------------------------------------------------------
Subtree, depth = 4 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 7 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 8 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 9 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 10 (855 data points).
--------------------------------------------------------------------
Subtree, depth = 11 (849 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (737 data points).
----------------------------

--------------------------------------------------------------------
Subtree, depth = 14 (374 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (30 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 11 (10 data points).
--------------------------------------------------------------------
Subtree, depth = 12 (10 data points).
--------------------------------------------------------------------
Subtree, depth = 13 (10 data points).
--------------------------------------------------------------------
Subtree, depth = 14 (9 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------------

### Evaluating the models

Let us evaluate the models on the **train** and **validation** data. Let us start by evaluating the classification error on the training data:

In [86]:
print("Training data, classification error (model 1):", getClassificationError(model_1, train_data, target))
print("Training data, classification error (model 2):", getClassificationError(model_2, train_data, target))
print("Training data, classification error (model 3):", getClassificationError(model_3, train_data, target))

Training data, classification error (model 1): 0.40003761014399314
Training data, classification error (model 2): 0.38185041908446166
Training data, classification error (model 3): 0.37446271222866967


Now evaluate the classification error on the validation data.

In [89]:
print("Validation set, classification error (model 1, depth 2):", getClassificationError(model_1, validation_set, target))
print("Validation set, classification error (model 2, depth 6):", getClassificationError(model_2, validation_set, target))
print("Validation set, classification error (model 3, depth 14):", getClassificationError(model_3, validation_set, target))

Validation set, classification error (model 1, depth 2): 0.3981042654028436
Validation set, classification error (model 2, depth 6): 0.3837785437311504
Validation set, classification error (model 3, depth 14): 0.38000861697544164


**Quiz Question:** Which tree has the smallest error on the validation data? `A: Model 3`

**Quiz Question:** Does the tree with the smallest error in the training data also have the smallest error in the validation data? `A: yes`

**Quiz Question:** Is it always true that the tree with the lowest classification error on the **training** set will result in the lowest classification error in the **validation** set? `A: no`

### Measuring the complexity of the tree

Recall in the lecture that we talked about deeper trees being more complex. We will measure the complexity of the tree as

```
  complexity(T) = number of leaves in T
```

Here, we provide a function `count_leaves` that counts the number of leaves in a tree. Using this implementation, compute the number of nodes in `model_1`, `model_2`, and `model_3`. 

In [91]:
'''
Recursively counts the leaves of a tree with dictionary-based nodes
'''
def count_leaves(tree):
    if tree['is_leaf']:
        return 1
    return count_leaves(tree['left']) + count_leaves(tree['right'])

Compute the number of nodes in `model_1`, `model_2`, and `model_3`.

In [92]:
print(f"Model 1 {count_leaves(model_1)}")
print(f"Model 2 {count_leaves(model_2)}")
print(f"Model 3 {count_leaves(model_3)}")

Model 1 4
Model 2 41
Model 3 341


**Quiz Question:** Which tree has the largest complexity? `Model 3`

**Quiz Question:** Is it always true that the most complex tree will result in the lowest classification error in the **validation_set**? `No`

# Exploring the effect of a DT's minimum error gain

We compare three models trained with different values of the stopping criterion.
We train three models with these parameters intentionally:

1. **model_4**: `min_error_reduction = -1` (ignoring this early stopping condition)
2. **model_5**: `min_error_reduction = 0` (just right)
3. **model_6**: `min_error_reduction = 5` (too positive)

For each of these three, we set `max_depth = 6`, and `min_node_size = 0`.

_Waiting Warning_: Each tree can take up to 30 seconds to train.

In [101]:
# Ignores minimum error gain
model_4 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size = 0, min_error_reduction=-1)


--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2133 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------

--------------------------------------------------------------------
Subtree, depth = 3 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1276 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 5 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 4 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------

In [102]:
# Just right
model_5 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size = 0, min_error_reduction=0)


--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 3 reached. Minimum error reduction.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (96 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (85 data points).
Early stopping condition 3 reached. Minimum error reduction.
--------------------------------------------------------------------
Subtree, depth = 4 (11 data points).
Early stopping condition 3 reached. Minimum error reduction.
-------------------------------------------------------------------

In [103]:
# Way too demanding
model_6 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size = 0, min_error_reduction=5)


--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Early stopping condition 3 reached. Minimum error reduction.


Calculate the accuracy of each model (**model_4**, **model_5**, or **model_6**) on the validation set. 

In [104]:
print("Validation data, classification error (model 4):", getClassificationError(model_4, validation_set, target))
print("Validation data, classification error (model 5):", getClassificationError(model_5, validation_set, target))
print("Validation data, classification error (model 6):", getClassificationError(model_6, validation_set, target))

Validation data, classification error (model 4): 0.3837785437311504
Validation data, classification error (model 5): 0.3837785437311504
Validation data, classification error (model 6): 0.503446790176648


Using the `count_leaves` function, compute the number of leaves in each of each models in (**model_4**, **model_5**, and **model_6**). 

In [105]:
print(f"Model 4 leaves = {count_leaves(model_4)}")
print(f"Model 5 leaves = {count_leaves(model_5)}")
print(f"Model 6 leaves = {count_leaves(model_6)}")

Model 4 leaves = 41
Model 5 leaves = 13
Model 6 leaves = 1


**Quiz Question:** Using the complexity definition above, which model (**model_4**, **model_5**, or **model_6**) has the largest complexity? `A: model 4`

Did this match your expectation? `yes`

**Quiz Question:** **model_4** and **model_5** have similar classification error on the validation set but **model_5** has lower complexity. Should you pick **model_5** over **model_4**? `We should pick Model 5 because it is considered 'simpler'`


# Exploring the effect of a DT's minimum node size

Train three models with these parameters:
1. **model_7**: min_node_size = 0 (too small)
2. **model_8**: min_node_size = 2000 (just right)
3. **model_9**: min_node_size = 50000 (too large)

For each of these three, we set `max_depth = 6`, and `min_error_reduction = -1`.

In [106]:
# Too small of a node
model_7 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size=0, min_error_reduction=-1)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2133 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------

--------------------------------------------------------------------
Subtree, depth = 4 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (1276 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1276 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 5 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 4 (0 data points).
Stopping condition 1 reached. All nodes same class.
--------------------------------------------------------------------
Subtree, depth = 2 (4701 data points).
--------------------------------------

In [107]:
# Just Right
model_8 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size=2000, min_error_reduction=-1)



--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2133 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------

In [108]:
# Too demanding of node size
model_9 = buildTreeEarlyStopping(train_data, features, 'safe_loans', max_depth=6, 
                                min_node_size=50000, min_error_reduction=-1)


--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Early stopping condition 2 reached. Reached minimum node size.


Now, let us evaluate the models (**model_7**, **model_8**, or **model_9**) on the **validation_set**.

In [109]:
print("Validation data, classification error (model 7):", getClassificationError(model_7, validation_set, target))
print("Validation data, classification error (model 8):", getClassificationError(model_8, validation_set, target))
print("Validation data, classification error (model 9):", getClassificationError(model_9, validation_set, target))

Validation data, classification error (model 7): 0.3837785437311504
Validation data, classification error (model 8): 0.38453252908229213
Validation data, classification error (model 9): 0.503446790176648


Using the `count_leaves` function, compute the number of leaves in each of each models (**model_7**, **model_8**, and **model_9**). 

In [110]:
print(f"Model 7 leaves = {count_leaves(model_7)}")
print(f"Model 8 leaves = {count_leaves(model_8)}")
print(f"Model 9 leaves = {count_leaves(model_9)}")

Model 7 leaves = 41
Model 8 leaves = 19
Model 9 leaves = 1


**Quiz Question:** Using the results obtained in this section, which model (**model_7**, **model_8**, or **model_9**) would you choose to use?
`A: probably model 8 since its error is similar to model 7, but has less complexity`