![table.png](pics/table.png)

![dt_decisiveness.png](pics/dt_decisiveness.png)

![table_to_tree.png](pics/table_to_tree.png)

![best_feature.png](pics/best_feature.png)


![first_split](pics/first_split.png)

![final_grades_tree.png](pics/final_grades_tree.png)

## Binary entropy

![entropy_explained.png](pics/entropy_explained.png)

![prob_winning.png](pics/prob_winning.png)

![entropy_formula.png](pics/entropy_formula.png)

![entropy_example.png](pics/entropy_example.png)

### Quiz

What is the entropy for a bucket with a ratio of four red balls to ten blue balls? Input your answer to at least three decimal places.



In [2]:
from math import log2
entropy = -(4/14) * log2(4/14) -(10/14) * log2(10/14)
print(entropy)

0.863120568566631


## Multi-class Entropy
Last time, you saw this equation for entropy for a bucket with $m$ red balls and $n$ blue balls:


We can state this in terms of probabilities instead for the number of red balls as $p_1$  and the number of blue balls as $p_2$:

$$entropy = -p_1\log_2(p_1)-p_2\log_2(p_2)$$

This entropy equation can be extended to the multi-class case, where we have three or more possible values:

$$entropy = -p_1\log_2(p_1) - p_2\log_2(p_2) - ... - p_n\log_2(p_n) = -\sum\limits_{i=1}^n p_i\log_2(p_i)$$

The minimum value is still 0, when all elements are of the same value. The maximum value is still achieved when the outcome probabilities are the same, but the upper limit increases with the number of different outcomes. (For example, you can verify the maximum entropy is 2 if there are four different possibilities, each with probability 0.25.)

### Quiz

If we have a bucket with eight red balls, three blue balls, and two yellow balls, what is the entropy of the set of balls? Input your answer to at least three decimal places.



In [3]:
entropy = -(8/13) * log2(8/13) -(3/13) * log2(3/13) - (2/13) * log2(2/13)
print(entropy)

1.3346791410515946


### Quiz Information Gain

![quiz_information_gain.png](pics/quiz_information_gain.png)

Where did we gain more information? Where did we gain less? Match the columns.

**Solution**

Tree 1: Small, Tree 2: Medium, Tree3: Large

![information_gain_example.png](pics/information_gain_example.png)

![information_gain_calculation.png](pics/information_gain_calculation.png)

![information_gain_calculation_example.png](pics/information_gain_calculation_example.png)

![information_gain_calculation_example2.png](pics/information_gain_calculation_example2.png)

### Maximizing Information Gain

![max_ig_1.png](pics/max_ig_1.png)

![max_ig_2.png](pics/max_ig_2.png)

![max_ig_3.png](pics/max_ig_3.png)

![max_ig_4.png](pics/max_ig_4.png)

![max_ig_5.png](pics/max_ig_5.png)


For the following quiz, consider the data found in `ml_bugs.csv`, consisting of twenty-four made-up insects measured on their length and color.

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

In [7]:
# Here was how I calculated the solution:
import numpy as np 


def two_group_ent(first, tot):                        
    return -(first/tot*np.log2(first/tot) +           
             (tot-first)/tot*np.log2((tot-first)/tot))

tot_ent = two_group_ent(10, 24)                       
g17_ent = 15/24 * two_group_ent(11,15) + 9/24 * two_group_ent(6,9)                  

answer = tot_ent - g17_ent    

print(answer)

0.11260735516748954


## Hyperparameters for Decision Trees

In order to create decision trees that will generalize to new problems well, we can tune a number of different aspects about the trees. We call the different aspects of a decision tree "hyperparameters". These are some of the most important hyperparameters used in decision trees:

### Maximum Depth
The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length $k$ can have at most $2^k$ leaves.

![depth_tree.png](pics/depth_tree.png)

### Minimum number of samples to split

A node must have at least `min_samples_split` samples in order to be large enough to split. If a node has fewer samples than `min_samples_split` samples, it will not be split, and the splitting process stops.

![min-samples-split.png](pics/min-samples-split.png)

However, `min_samples_split` doesn't control the minimum size of leaves. As you can see in the example on the right, above, the parent node had 20 samples, greater than `min_samples_split = 11`, so the node was split. But when the node was split, a child node was created with that had 5 samples, less than `min_samples_split = 11`.

### Minimum number of samples per leaf

When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.

![min_sample_per_leaf.png](pics/min_sample_per_leaf.png)

This number can be specified as an integer or as a float. If it's an **integer**, it's the minimum **number of samples** allowed in a leaf. If it's a **float**, it's the minimum **percentage** of samples allowed in a leaf. For example, 0.1, or 10%, implies that a particular split will not be allowed if one of the leaves that results contains less than 10% of the samples in the dataset.

If a threshold on a feature results in a leaf that has fewer samples than `min_samples_leaf`, the algorithm will not allow that split, but it may perform a split on the same feature at a different threshold, that does satisfy `min_samples_leaf`.

###  Quiz

Let's test your intuition. Which sizes of features are associated with underfitting and which with overfitting? 

* Large depth very often causes overfitting, since a tree that is too deep, can memorize the data. 
* Small depth can result in a very simple model, which may cause underfitting.
* Small minimum samples per split may result in a complicated, highly branched tree, which can mean the model has memorized the data, or in other words, overfit. 
* Large minimum samples may result in the tree not having enough flexibility to get built, and may result in underfitting.

## Decision Trees in sklearn

In this section, you'll use decision trees to fit a given sample dataset.

Before you do that, let's go over the tools required to build this model.

For your decision tree model, you'll be using scikit-learn's Decision Tree Classifier class. This class provides the functions to define and fit the model to your data.

```
>>> from sklearn.tree import DecisionTreeClassifier
>>> model = DecisionTreeClassifier()
>>> model.fit(x_values, y_values)
```

In the example above, the model variable is a decision tree model that has been fitted to the data `x_values` and `y_values`. Fitting the model means finding the best tree that fits the training data. Let's make two predictions using the model's `predict()` function.

```
>>> print(model.predict([ [0.2, 0.8], [0.5, 0.4] ]))
[[ 0., 1.]]
```

The model returned an array of predictions, one prediction for each input array. The first input, `[0.2, 0.8]`, got a prediction of `0.`. The second input, `[0.5, 0.4]`, got a prediction of `1.`.

### Hyperparameters

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

`max_depth`: The maximum number of levels in the tree.
`min_samples_leaf`: The minimum number of samples allowed in a leaf.
`min_samples_split`: The minimum number of samples required to split an internal node.

For example, here we define a model where the maximum depth of the trees `max_depth` is 7, and the minimum number of elements in each leaf `min_samples_leaf` is 10.

`>>> model = DecisionTreeClassifier(max_depth = 7, min_samples_leaf = 10)`

### Decision Tree Quiz

In this quiz, you'll be given the following sample dataset, and your goal is to define a model that gives 100% accuracy on it.

![tree_programming_quiz.png](pics/tree_programming_quiz.png)

The data file can be found under the "data_tree.csv" tab in the quiz below. It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

The data will be loaded for you, and split into features X and labels y.

You'll need to complete each of the following steps:

1. Build a decision tree model

Create a decision tree classification model using scikit-learn's DecisionTreeClassifier and assign it to the `variablemodel`.

2. Fit the model to the data

You won't need to specify any of the hyperparameters, since the default ones will yield a model that perfectly classifies the training data. However, we encourage you to play with hyperparameters such as `max_depth` and min_samples_leaf to try to find the simplest possible model.

3. Predict using the model

Predict the labels for the training set, and assign this list to the variable `y_pred`.
4. Calculate the accuracy of the model

For this, use the function sklearn function `accuracy_score`. A model's accuracy is the fraction of all data points that it correctly classified.
When you hit Test Run, you'll be able to see the boundary region of your model, which will help you tune the correct parameters, in case you need them.

Note: This quiz requires you to find an accuracy of 100% on the training set. This is like memorizing the training data! A model designed to have 100% accuracy on training data is unlikely to generalize well to new data. If you pick very large values for your parameters, the model will fit the training set very well, but may not generalize well. Try to find the smallest possible parameters that do the job—then the model will be more likely to generalize well. (This aspect of the exercise won't be graded.)



In [9]:
# Import statements 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Read the data.
data = np.asarray(pd.read_csv('data/data_tree.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y. 
X = data[:,0:2]
y = data[:,2]

# TODO: Create the decision tree model and assign it to the variable model.
model = DecisionTreeClassifier()

# TODO: Fit the model.
model.fit(X,y)

# TODO: Make predictions. Store them in the variable y_pred.
y_pred = model.predict(X)

# TODO: Calculate the accuracy and assign it to the variable acc.
acc = accuracy_score(y, y_pred)

print(acc)

1.0


## Lab: Titanic Survival Exploration with Decision Trees

### Getting Started
In this lab, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [10]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'data/titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [11]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.

## Preprocessing the data

Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.

One-Hot encoding is useful for changing over categorical data into numerical data, with each different option within a category changed into either a 0 or 1 in a separate *new* category as to whether it is that option or not (e.g. Queenstown port or not Queenstown port). Check out [this article](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) before continuing. 

**Question:** Why would it be a terrible idea to one-hot encode the data without removing the names?

In [12]:
# Removing the names
features_no_names = features_raw.drop(['Name'], axis=1)

# One-hot encoding
features = pd.get_dummies(features_no_names)

And now we'll fill in any blanks with zeroes.

In [13]:
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


### Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [27]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10).fit(X_train, y_train)

### Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [28]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8707865168539326
The test accuracy is 0.8547486033519553


## Outro 

Congratulations! In this section you've learned all about decision trees, and how to use them to make predictions. In the next section, we are going to learn more about some of the concepts we alluded to in this section—how to test and evaluate your model to see how it's performing. See you there!

