## What is a decision tree?

Decision trees can be used for regression (continuous real-valued output, e.g. predicting the price of a house) or classification (categorical output, e.g. predicting email spam vs. no spam), but here we will focus on classification.

A decision tree classifier is a binary tree where predictions are made by traversing the tree from root to leaf — at each node, we go left if a feature is less than a threshold, right otherwise. Finally, each leaf is associated with a class, which is the output of the predictor.

### Their types:

There is a wide variety of decision trees, such as:
- ID3 (Iterative Dichotomiser 3)
- CART (Classification and Regression Tree)
- CHAID (Chi-squared Automatic Interaction Detector), etc.

#### How does it work?

- Starts at the root node
- Splits data into groups (based on some criteria)
- Set a decision at node
- Move the data along the respective branches
- Repeat the process until a stopping criterion is met (max levels/depth reached, min samples left to split, nothing left to split, etc)

#### How to choose root node:

This is slightly different in regression and classification trees.

**In regression trees**, we chose a splitting point such that there is the greatest reduction in RSS (Residual Sum of Squares).


<p>
        <img src = assets/1.png/ height = "400px" width = "400px">
</p>

Or we can calculate standard deviation reduction of the feature with respect to the training data. Here YR1 and YR2 are mean responses of region 1 and 2. Once we train the tree, we predict the response for a test data using the mean of the training observations in that group.

**In Classification trees:**

We use Entropy and Information Gain (in ID3). Gini Index for classification in the CART.

**Entropy (Shannon’s Entropy)** quantifies the uncertainty of chaos in the group. Higher entropy means higher the disorder. It is denoted by H(x), where x is a vector with probabilities p1, p2, p3…..

<p>
        <img src = assets/2.png/ height = "200px" width = "200px">
</p>

From the above figure, we can see that the entropy (uncertainty) is highest (1) when the probability is 0.5, i.e. 50–50 chances. And entropy is lowest when the probability is 0 or 1, i.e. there is no uncertainty or high chance of occurrence.


So, entropy is maximum if in a class there are an equal number of objects from different attributes (like the group has 50 cats and 50 dogs), and this is minimum if the node is pure (like the group has only 100 cats or only 100 dogs). We ultimately want to have minimum entropy for the tree, i.e. pure or uniform classes at the leaf nodes.

##### Entropy calculation:

<p>
        <img src = assets/3.png/ height = "400px" width = "400px">
</p>

- S — Current group for which we are interested in calculating entropy.
- Pi — Probability of finding that system in ith state, or this turns to the proportion of a number of elements in that split group to the number of elements in the group before splitting(parent group).

In this classification tree, while splitting the tree we select those attributes that achieves the greatest reduction in entropy. Now, this reduction (or change) in entropy is measured by **Information Gain** which is given by:

<p>
        <img src = assets/4.png/ height = "400px" width = "400px">
</p>

### Example:

The problem is about predicting whether some kid is going to eat a particular type of food given that kids prior eating habits.

<p>
        <img src = assets/5.png/ height = "400px" width = "400px">
</p>

From the above chart, we can see that the food preferences Taste, Temperature and Texture are exploratory variables and Eat (Yes/No) is target variable.

Now, we need to construct a top-down decision tree that splits the dataset and finally form a pure group, so we can predict for a new test variable if the kid eats or not.

We are going to use the ID3 algorithm for this.

<p>
        <img src = assets/6.png/ >
</p>

<p>
        <img src = assets/7.png/>
</p>

<p>
        <img src = assets/8.png/ >
</p>

<p>
        <img src = assets/8_1.png/ >
</p>

<p>
        <img src = assets/9.png/ >
</p>

<p>
        <img src = assets/10.png/ >
</p>

<p>
        <img src = assets/10_1.png/ >
</p>


<p>
        <img src = assets/11.png/ >
</p>

<p>
        <img src = assets/12.png/ >
</p>

<p>
        <img src = assets/12_1.png/ >
</p>

<p>
        <img src = assets/13.png/ >
</p>

<p>
        <img src = assets/final.png/>
</p>


#### Conclusion:

Finally, what we can conclude is, if the food is sweet the kid is not caring about its Temperature or Texture, he is eating.

If the food is Salty he is eating only if the texture is hard. And if the food is Spicy he eats if it is Hot and Hard or Cold and Soft. Bizarre Kid.

#### REFERENCES:

- [Math behing DT](https://medium.com/@rakendd/building-decision-trees-and-its-math-711862eea1c0)

### Code sample 1:

In [2]:
import numpy as np
import pandas as pd

eps = np.finfo(float).eps # ‘eps’ here is the smallest representable number. 
# At times we get log(0) or 0 in the denominator, to avoid that we are going to use this.

from numpy import log2 as log

In [3]:
# Since the dataset we saw above was small, we represent it with a dictionary

dataset = {'Taste':['Salty','Spicy','Spicy','Spicy','Spicy','Sweet','Salty','Sweet','Spicy','Salty'],
       'Temperature':['Hot','Hot','Hot','Cold','Hot','Cold','Cold','Hot','Cold','Hot'],
       'Texture':['Soft','Soft','Hard','Hard','Hard','Soft','Soft','Soft','Soft','Hard'],
       'Eat':['No','No','Yes','No','Yes','Yes','No','Yes','Yes','Yes']}


In [4]:
df = pd.DataFrame(dataset, columns = ['Taste', 'Temperature', 'Texture', 'Eat'])

In [5]:
df

Unnamed: 0,Taste,Temperature,Texture,Eat
0,Salty,Hot,Soft,No
1,Spicy,Hot,Soft,No
2,Spicy,Hot,Hard,Yes
3,Spicy,Cold,Hard,No
4,Spicy,Hot,Hard,Yes
5,Sweet,Cold,Soft,Yes
6,Salty,Cold,Soft,No
7,Sweet,Hot,Soft,Yes
8,Spicy,Cold,Soft,Yes
9,Salty,Hot,Hard,Yes


We first need to find entropy and then information gain for splitting the dataset

<p>
        <img src = assets/3.png/ height = "400px" width = "400px">
</p>

We’ll define a function that takes in class (target variable vector) and finds the entropy of that class.
Here the fraction is ‘pi’, it is the proportion of a number of elements in that split group to the number of elements in the group before splitting(parent group).

In [16]:
entropy_node = 0 #initialize entropy
values = df.Eat.unique() #Unique objects - 'Yes' or 'No'
# values
for value in values:
    frac = df.Eat.value_counts()[value]/len(df.Eat)
    entropy_node += -frac*np.log2(frac) #summation
    
# df.Eat.value_counts()

In [42]:
entropy_node

0.9709505944546686

This is same as the entropy (Eo) we calculated above.

Now calculating the entropy of the other attributes

In [43]:
attribute = 'Taste'
target_variables = df.Eat.unique() #This gives all 'Yes' and 'No'
variables = df[attribute].unique() #This gives different features in that attribute (like 'Sweet')  

In [36]:
df[attribute]

0    Salty
1    Spicy
2    Spicy
3    Spicy
4    Spicy
5    Sweet
6    Salty
7    Sweet
8    Spicy
9    Salty
Name: Taste, dtype: object

In [44]:
entropy_attribute = 0
for variable in variables:
    entropy_each_feature = 0
    for target_variable in target_variables:
        num = len(df[attribute][df[attribute] == variable][df.Eat == target_variable])
        den = len(df[attribute][df[attribute] == variable])
        
        frac = num/(den+eps) #pi
        entropy_each_feature += -frac*log(frac+eps) #This calculates entropy for one feature like 'Sweet'
        
    frac2 = den/len(df)    
    entropy_attribute += -frac2*entropy_each_feature #Sums up all the entropy ETaste

In [45]:
# Entropy for Taste:
abs(entropy_attribute)

0.7609640474436806

In [50]:
# The information gain is simply: entropy_node - entropy_attribute
IG_taste = entropy_node - abs(entropy_attribute)
IG_taste

0.20998654701098796

In [51]:
# We now calculate the entropy for other attributes like temp and texture

attribute = 'Temperature'
target_variables = df.Eat.unique() #This gives all 'Yes' and 'No'
variables = df[attribute].unique() #This gives different features in that attribute (like 'Sweet')  

In [52]:
df[attribute]

0     Hot
1     Hot
2     Hot
3    Cold
4     Hot
5    Cold
6    Cold
7     Hot
8    Cold
9     Hot
Name: Temperature, dtype: object

In [53]:
entropy_attribute = 0
for variable in variables:
    entropy_each_feature = 0
    for target_variable in target_variables:
        num = len(df[attribute][df[attribute] == variable][df.Eat == target_variable])
        den = len(df[attribute][df[attribute] == variable])
        
        frac = num/(den+eps) #pi
        entropy_each_feature += -frac*log(frac+eps) #This calculates entropy for one feature like 'Sweet'
        
    frac2 = den/len(df)    
    entropy_attribute += -frac2*entropy_each_feature #Sums up all the entropy ETaste

In [54]:
# Entropy for Temperature:
abs(entropy_attribute)

0.950977500432693

In [55]:
# The information gain is simply: entropy_node - entropy_attribute
IG_temp = entropy_node - abs(entropy_attribute)
IG_temp

0.019973094021975557

In [56]:
attribute = 'Texture'
target_variables = df.Eat.unique() #This gives all 'Yes' and 'No'
variables = df[attribute].unique() #This gives different features in that attribute (like 'Sweet')  

In [57]:
df[attribute]

0    Soft
1    Soft
2    Hard
3    Hard
4    Hard
5    Soft
6    Soft
7    Soft
8    Soft
9    Hard
Name: Texture, dtype: object

In [58]:
entropy_attribute = 0
for variable in variables:
    entropy_each_feature = 0
    for target_variable in target_variables:
        num = len(df[attribute][df[attribute] == variable][df.Eat == target_variable])
        den = len(df[attribute][df[attribute] == variable])
        
        frac = num/(den+eps) #pi
        entropy_each_feature += -frac*log(frac+eps) #This calculates entropy for one feature like 'Sweet'
        
    frac2 = den/len(df)    
    entropy_attribute += -frac2*entropy_each_feature #Sums up all the entropy ETaste

In [59]:
# Entropy for Texture:
abs(entropy_attribute)

0.9245112497836524

In [60]:
# The information gain is simply: entropy_node - entropy_attribute
IG_texture = entropy_node - abs(entropy_attribute)
IG_texture

0.04643934467101618

In [None]:
# We’ll find the winner node, the one with the highest Information Gain. We repeat this process to find 
# which is the attribute we need to consider to split the data at the nodes.
