<div class="alert alert-block alert-info">
    <h2 align="center">Decision Tree</h2>
    <h4 align="center"><a href="https://t.me/afsharino">Mohammad Afshari</a></h4>
</div>

<style>
.aligncenter {
    text-align: center;
}
</style>
<p class="aligncenter">'
    <img src = "https://www.researchgate.net/publication/221156067/figure/fig3/AS:393937173925895@1470933349578/The-induced-play-golf-decision-tree-result-of-training-with-all-data.png"  width=70%>
</p>

# Import Libraries

In [1]:
# Scientific
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree

# Others
import warnings
warnings.filterwarnings("ignore")

# Load Dataset

In [2]:
data = pd.read_csv('../dataset/playgolf.csv', index_col=False)

# Inspect the Data

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Outlook    14 non-null     object
 1   Temp       14 non-null     object
 2   Humidity   14 non-null     object
 3   Windy      14 non-null     bool  
 4   Play Golf  14 non-null     object
dtypes: bool(1), object(4)
memory usage: 594.0+ bytes


In [4]:
data.head(14)

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play Golf
0,Rainy,Hot,High,False,No
1,Rainy,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Sunny,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes
5,Sunny,Cool,Normal,True,No
6,Overcast,Cool,Normal,True,Yes
7,Rainy,Mild,High,False,No
8,Rainy,Cool,Normal,False,Yes
9,Sunny,Mild,Normal,False,Yes


# Entropy

$$ E(S) = \Sigma_{i=1}^{n}-p_{i}.log_{2}p_{i} $$

In [8]:
S = data['Play Golf']

In [49]:
def entropy(S):
    p1 = sum(S=="Yes")/len(S)
    p2 = sum(S=="No")/len(S)
    E = -p1*np.log2(p1) - p2*np.log2(p2)
    return E

In [50]:
entropy_root = entropy(S)
print(f'The entropy of root is {entropy_root*100:0.2f}')

The entropy of root is 94.03


#### Let's try splitting on "Temp" attribute

In [51]:
hot = data[data['Temp']=='Hot']['Play Golf']
print(hot)
entropy_hot = entropy(hot)
print(f'\nThe entropy of temp=hot is {entropy_hot*100:0.2f}')


0      No
1      No
2     Yes
12    Yes
Name: Play Golf, dtype: object

The entropy of temp=hot is 100.00


In [54]:
mild = data[data['Temp']=='Mild']['Play Golf']
print(mild)
entropy_mild = entropy(mild)
print(f'\nThe entropy of temp=mild is {entropy_mild*100:0.2f}')

3     Yes
7      No
9     Yes
10    Yes
11    Yes
13     No
Name: Play Golf, dtype: object

The entropy of temp=mild is 91.83


In [55]:
cool = data[data['Temp']=='Cool']['Play Golf']
print(cool)
entropy_cool = entropy(cool)
print(f'\nThe entropy of temp=cool is {entropy_cool*100:0.2f}')

4    Yes
5     No
6    Yes
8    Yes
Name: Play Golf, dtype: object

The entropy of temp=cool is 81.13


# Average Entropy

$$ I(S, A) = \Sigma_{i=1}\frac{|S_{i}|}{|S|}.E(S_{i}) $$

In [56]:
avg_entropy = (len(hot) / len(data['Temp'])*entropy_hot) +\
              (len(mild) / len(data['Temp'])*entropy_mild) +\
              (len(cool) / len(data['Temp'])*entropy_cool)
print(f'The average entropy of attribute=Temp is {avg_entropy*100:0.2f}')

The average entropy of attribute=Temp is 91.11


# Information Gain

$$ Gain(S, A) = E(S) - I(S, A) = E(S) - \Sigma_{i=1}\frac{|S_{i}|}{|S|}.E(S_{i}) $$

In [57]:
information_gain = entropy_root - avg_entropy
print(f'The information gain is {information_gain*100:0.2f}')

The information gain is 2.92
