In [None]:
# You should only need these 3 imports to complete this assignment
import pandas as pd
import numpy as np
from plotnine import *


# 0. Together

## 0.1 Entropy

Entropy is a measure of disorder/chaos. We want ordered and organized data in the leaf nodes of our decision trees. So we want LOW entropy. **Entropy** is defined as:

$$ E = -\sum_1^N p_i* log_2(p_i) $$

Where $N$ is the number of categories or labels in our outcome variable.

This is compared to **gini impurity** which is:

$$GI = 1 - \sum_1^N p_i^2$$

(if you're super into decision trees, check out this paper [Theoretical comparison between the Gini Index and
Information Gain criteria](https://www.unine.ch/files/live/sites/imi/files/shared/documents/papers/Gini_index_fulltext.pdf))

### *Question*

WHY do we want the leaf nodes of our tree to be ordered (have low entropy or impurity?)?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />


### Answer

## 0.2 Measures of Chaos for a Split

When you split a node, we now have two new nodes. In order to calculate the chaos (entropy or gini impurity) of the split, we have to calculate the chaos (entropy or gini impurity) for EACH of the new nodes and then calculate the weighted average chaos (entropy or gini impurity).  

The reason we weight each node differently in this calculation, is because if a node has more data in it, than it has more impact, and therefore its measure of chaos (entropy or gini impurity) should count more.

In general, once you've calculated the chaos (entropy or gini impurity) for each of the new nodes, you'll use this formula to calculate the weighted average:


$$ WC = (\frac{N_L}{Total}* C_L) + (\frac{N_R}{Total}* C_R)$$

Where $N_L$ is the number of data points in the Left Node, $N_R$ is the number of data points in the Right Node, and $Total$ is the total number of data points in that split. $C_R$ and $C_L$ are the chaos measure (entropy of gini impurity) for the right and left nodes, respectively.



# 1. Measures of Chaos

## 1.1 Gini Impurity

Use python and numpy to write two functions, as described in the comments below.
<img src = "https://drive.google.com/uc?id=1MQEeJDxxcV8zmhzBgaDZ2QY0Ng8z8hz8" width = 300px/>

In [None]:
### YOUR CODE HERE ############


def gini():
    # this function calculates the gini impurity for ONE node (left, right, or root!)
    # this function should take in the right and left node counts as arguments
    # and calculate the gini impurity for that node based on the formula above
    # return the impurity for the node.
    
    pass

def gini_split():
    
    # this function takes FOUR arguments: LNP, LNN, RNP, and RNN and calculates
    # the gini impurity for each node (by calling gini()) and then calculates
    # the weighted average of the impurity in each node.
    # return the impurity for the split.
    
    pass

### YOUR CODE HERE ###############

In [None]:
# use this to test your code, if it prints True, you got the right answer

abs(gini(10,5,2,12) - 0.3481116584564861) <= 0.0001

## 1.2 Entropy

Use python and numpy to write two functions, as described by the comments below. If you want to read more about entropy, see this [article](https://bricaud.github.io/personal-blog/entropy-in-decision-trees/).

hint: `np.log2()`

In [None]:
### YOUR CODE HERE ###############

def entropy():
    # this function calculates the entropy for ONE node (left, right, or root!)
    # this function should take in the right and left node counts as arguments
    # and calculate the entropy for that node based on the formula above
    pass

def entropy_split():
    # this function takes FOUR arguments: LNP, LNN, RNP, and RNN and calculates
    # the entropy for each node (by calling entropy()) and then calculates
    # the weighted average of the entropy in each node.
    # return the entropy for the split.
    pass

### YOUR CODE HERE ###############

In [None]:
# use this to test your code, if it prints True, you got the right answer

abs(entropy(10,5,2,12) - 0.7606157383093077) <= 0.0001

# 2. Build a Categorical Decision Tree

In [None]:
# Load Mushroom Data------------------------------------
import pandas as pd

# see this site for what variables mean: http://archive.ics.uci.edu/ml/datasets/Mushroom
mush = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")

mush.columns = ['poison','cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size',
                'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
                'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number','ring-type',
                'spore-print-color', 'population', 'habitat']

mush.head()

For your sanity, let's restrict our dataset to 3 predictor variables...

In [None]:
mush_small = mush[["poison", "bruises", "gill-size"]]

In [None]:
# make a bar plot of edible/poisonous mushrooms############

In [None]:
# make a bar plot of bruised/not-bruised mushrooms############

In [None]:
# make a bar plot of broad/narrow gilled mushrooms############

## 2.1 Build!

Use the functions you built earlier to build a decision tree that classifies each data point as either edible (`e`) or poisonous (`p`). You can choose to either use entropy or gini impurity. 

### 2.1.1 Layer 1

Choose which variable to use to split the first layer

In [None]:
# create dictionaries of possible splits############
# Try getting something like this for the root node: {'e': 4208, 'p': 3915}
# BUT CALCULATE IT FROM THE DATA, DON'T JUST HARDCODE THAT DICTIONARY


# Something like this for splitting on bruise: 
# {'f': {'e': 4748, 'p': 3292}, 't': {'e': 3375, 'p': 623}}
# BUT CALCULATE IT FROM THE DATA, DON'T JUST HARDCODE THAT DICTIONARY


# Something like this for splitting on gill: 
# {'b': {'e': 5612, 'p': 1692}, 'n': {'e': 2511, 'p': 2223}}
# BUT CALCULATE IT FROM THE DATA, DON'T JUST HARDCODE THAT DICTIONARY



In [None]:
# calculate impurity/entropy of each possible split using your functions###########


In [None]:
# choose which split improves prediction most############

### *Question*

Does splitting the root node improve the tree? How can you tell?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

### Answer


### 2.1.2 Create Classifications

Pretend that this decision stump (a decision tree with only one layer) is your final tree. Generate the classification for each data point and store it in `mush_small`.

In [1]:
# classification############

### 2.1.3 Calculate Accuracy

Count how often your model made the correct classification. How well did your model do?

In [None]:
# accuracy############

# 3. Chaos

### *Question*

When would Gini Impurity be 0? When would Entropy be 0? What does that mean about our tree/node?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

### Answer