# Decision Trees
In this notebook, we'll implement the decision tree algorithm, and use it in the example of identifying edible mushrooms as in the course graded lab.  
The code here are based on my own implementations in the graded lab, organized and rewritten to be more concise and clear.

## Tools

In [1]:
import numpy as np

## Dataset

The dataset for this task is as follows:

|                                                     | Cap Color | Stalk Shape | Solitary | Edible |
|:---------------------------------------------------:|:---------:|:-----------:|:--------:|:------:|
| <img src="images/0.png" alt="drawing" width="50"/> |   Brown   |   Tapering  |    Yes   |    1   |
| <img src="images/1.png" alt="drawing" width="50"/> |   Brown   |  Enlarging  |    Yes   |    1   |
| <img src="images/2.png" alt="drawing" width="50"/> |   Brown   |  Enlarging  |    No    |    0   |
| <img src="images/3.png" alt="drawing" width="50"/> |   Brown   |  Enlarging  |    No    |    0   |
| <img src="images/4.png" alt="drawing" width="50"/> |   Brown   |   Tapering  |    Yes   |    1   |
| <img src="images/5.png" alt="drawing" width="50"/> |    Red    |   Tapering  |    Yes   |    0   |
| <img src="images/6.png" alt="drawing" width="50"/> |    Red    |  Enlarging  |    No    |    0   |
| <img src="images/7.png" alt="drawing" width="50"/> |   Brown   |  Enlarging  |    Yes   |    1   |
| <img src="images/8.png" alt="drawing" width="50"/> |    Red    |   Tapering  |    No    |    1   |
| <img src="images/9.png" alt="drawing" width="50"/> |   Brown   |  Enlarging  |    No    |    0   |


-  You have 10 examples of mushrooms. For each example, you have
    - Three features
        - Cap Color (`Brown` or `Red`),
        - Stalk Shape (`Tapering (as in \/)` or `Enlarging (as in /\)`), and
        - Solitary (`Yes` or `No`)
    - Label
        - Edible (`1` indicating yes or `0` indicating poisonous)

### One hot encoded dataset
We can one-hot encoded the features in our dataset:  

- `X_train` contains three features for each example 
    - Brown Color (A value of `1` indicates "Brown" cap color and `0` indicates "Red" cap color)
    - Tapering Shape (A value of `1` indicates "Tapering Stalk Shape" and `0` indicates "Enlarging" stalk shape)
    - Solitary  (A value of `1` indicates "Yes" and `0` indicates "No")
- `y_train` is whether the mushroom is edible 
    - `y = 1` indicates edible
    - `y = 0` indicates poisonous

In [2]:
X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])

In [3]:
print ('The shape of X_train is:', X_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(X_train))

The shape of X_train is: (10, 3)
The shape of y_train is:  (10,)
Number of training examples (m): 10


## Implementation
We'll follow the guidelines of the decision tree algorithm to implement our decision tree.
- Start with all examples at the root node
- Calculate information gain for all possible features, and pick one with the hightest information gain
- Split dataset according to selected feature, and create left and right branches of the tree
- Keep repeating splitting process until stopping criteria is met, here we'll use these two criterias:
    - When a node is 100% one class
    - When splitting a node will result in the tree exceeding maximum depth

Our implementation of the decision tree will consist of the following functions:
- `compute_entropy`: computes entropy at a given node
- `split_dataset`: takes in the data at a node and a feature to split on and splits it into left and right branches
- `compute_information_gain`: computes information gain of a split
- `get_best_feature`: get the best feature to split the node
- `build_tree_recursive`: builds the decision tree recursively

In [7]:
def compute_entropy(y):
    """
    Computes the entropy at a given node
    
    Args:
       y (ndarray): labels of the examples at the node
       
    Returns:
        entropy (float): Entropy at that node       
    """
    if len(y) == 0:  # If there's no example in the node, entropy defined to be 0
        return 0
        
    p1 = np.sum(y) / len(y)
    
    if p1 == 0 or p1 == 1:  # If p1 = 0 or p0 = 0, entropy is defined to be 0
        return 0

    entropy = - p1 * np.log2(p1) - (1 - p1) * np.log2(1 - p1)
    
    return entropy

In [8]:
# Compute entropy at the root node (i.e. with all examples)
# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1"

print("Entropy at root node: ", compute_entropy(y_train)) 

Entropy at root node:  1.0


**Expected Output**:
<table>
  <tr>
    <td> <b>Entropy at root node:<b> 1.0 </td> 
  </tr>
</table>

In [21]:
def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (list):     List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (list):     Indices with feature value == 1
        right_indices (list):    Indices with feature value == 0
    """
    msk_arr = X[node_indices, feature] == 1  # Create a mask array for samples with feature value == 1
    left_indices = node_indices[msk_arr]
    right_indices = node_indices[~msk_arr]
    
    return left_indices, right_indices

In [23]:
# Case 1
root_indices = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
feature = 0

left_indices, right_indices = split_dataset(X_train, root_indices, feature)

print("CASE 1:")
print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

# Case 2
root_indices_subset = np.array([0, 2, 4, 6, 8])
left_indices, right_indices = split_dataset(X_train, root_indices_subset, feature)

print("CASE 2:")
print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

CASE 1:
Left indices:  [0 1 2 3 4 7 9]
Right indices:  [5 6 8]
CASE 2:
Left indices:  [0 2 4]
Right indices:  [6 8]


**Expected Output**:
```
CASE 1:
Left indices:  [0, 1, 2, 3, 4, 7, 9]
Right indices:  [5, 6, 8]

CASE 2:
Left indices:  [0, 2, 4]
Right indices:  [6, 8]
```

In [24]:
def compute_information_gain(X, y, node_indices, feature): 
    """
    Compute the information gain of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         List or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        feature (int):          Index of feature to split on
   
    Returns:
        info_gain (float):      Information gain
    """
    