# Information Quantities for Decision Tree Induction

**CS5483 Data Warehousing and Data Mining**
___

In [1]:
%reset -f
%matplotlib inline
import dit
from dit.shannon import entropy, conditional_entropy, mutual_information
from IPython.display import Math
import matplotlib.pyplot as plt

In this notebook, we will use the [`dit` package](https://dit.readthedocs.io/en/latest/) to compute some basic information quantities used in the decision tree algorithms. A summary of Shannon's information measures and their relationships are given first.

## Information Measures

The followings are the mathemtical definitions of *entropy* and *mutual information*:

\begin{align}
H(Y) &= E\left[\log \tfrac1{P_{Y}(Y)}\right] && \text{entropy of $Y$}\\
H(Y|X) &= E\left[\log \tfrac1{P_{Y|X}(Y|X)}\right] && \text{conditional entropy of $Y$ given $X$}\\
I(X;Y) &= E\left[\log \tfrac{P_{Y|X}(Y|X)}{P_Y(Y)}\right] && \text{mutual information of $X$ and $Y$}\\
\end{align}

These information quantities can be related using a *Venn Diagram*:

<a title="KonradVoelkel, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Entropy-mutual-information-relative-entropy-relation-diagram.svg"><img width="512" alt="Entropy-mutual-information-relative-entropy-relation-diagram" src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg"></a>

\begin{align}
H(X,Y)&=H(X)+H(Y|X) && \text{chain rule of entropy}\\
&=H(Y)+H(X|Y)\\
I(X;Y)&=H(Y)-H(Y|X) && \text{mutual information in terms of entropies}\\
&=H(X)+H(Y)-H(X,Y)\\
&=H(X)-H(X|Y)
\end{align}

## Entropy

Consider the following distribution:
\begin{align}
p_k=\begin{cases}
\frac12 & k=0\\
\frac14 & k=1,2\\
0 & \text{otherwise.}
\end{cases}
\end{align}

In [None]:
p = dit.Distribution(['0', '1', '2', '3'], [1/2, 1/4, 1/4, 0])

plt.stem(p.outcomes,p.pmf,use_line_collection=True)
plt.xlabel('k')
plt.ylabel(r'$p_k$')
plt.ylim((0,1))
plt.show()

In [None]:
Math(f'h(p_1,p_2,\dots)={entropy(p)}')

## Information Gain

Consider the dataset $D$:

|X1|X2|X3|X4|Y|
|:---:|:---:|:---:|:---:|:-:|
|0    |0    |0    |00   |0  |
|0    |0    |0    |00   |0  |
|0    |0    |1    |01   |1  |
|1    |0    |1    |11   |1  |
|0    |1    |0    |00   |2  |
|1    |1    |0    |10   |2  |
|1    |1    |1    |11   |3  |
|1    |1    |1    |11   |3  |

**How to determine which attribute is more informative?**

First, create a uniform distribution over the instances in $D$.

In [None]:
d = dit.uniform([('0','0','0','00','0'),
                 ('0','0','0','00','0'),
                 ('0','0','1','01','1'),
                 ('1','0','1','11','1'),
                 ('0','1','0','00','2'),
                 ('1','1','0','10','2'),
                 ('1','1','1','11','3'),
                 ('1','1','1','11','3')])
d.set_rv_names(('X1','X2','X3','X4','Y'))
d

We can then calculate $\text{Info}(D)$ and $\text{Info}_{X_i}(D)$ for $i=\{1,2,3,4\}$ as the entropy $H(Y)$ and conditional entropies $H(Y|X_i)$'s respectively. 

In [None]:
InfoD = entropy(d,['Y'])
InfoX1D = conditional_entropy(d,['Y'],['X1'])
InfoX2D = conditional_entropy(d,['Y'],['X2'])
InfoX3D = conditional_entropy(d,['Y'],['X3'])
InfoX4D = conditional_entropy(d,['Y'],['X4'])

Math(r'''
\begin{{aligned}}
\text{{Info}}(D)&={}\\
\text{{Info}}_{{X_1}}(D)&={:.3g}\\
\text{{Info}}_{{X_2}}(D)&={:.3g}\\
\text{{Info}}_{{X_3}}(D)&={:.3g}\\
\text{{Info}}_{{X_4}}(D)&={:.3g}\\
\end{{aligned}}
'''.format(InfoD,InfoX1D,InfoX2D,InfoX3D,InfoX4D))

The information gain $\text{Gain}_{X_i}(D)$ can be calculated as the mutual information $I(X_i;Y):=H(Y)-H(Y|X_i)$.

In [None]:
GainX1D = mutual_information(d,['X1'],['Y'])
GainX2D = mutual_information(d,['X2'],['Y'])
GainX3D = mutual_information(d,['X3'],['Y'])
GainX4D = mutual_information(d,['X4'],['Y'])

Math(r'''
\begin{{aligned}}
\text{{Gain}}_{{X_1}}(D)&={:.3g}\\
\text{{Gain}}_{{X_2}}(D)&={:.3g}\\
\text{{Gain}}_{{X_3}}(D)&={:.3g}\\
\text{{Gain}}_{{X_4}}(D)&={:.3g}\\
\end{{aligned}}
'''.format(GainX1D,GainX2D,GainX3D,GainX4D))

**Exercise** Which attribute gives the highest information gain? Should we choose it as the splitting attribute?

YOUR ANSWER HERE

## Information Gain Ratio

To normalize information gain properly, we first calculate $\text{SplitInfo}_{X_i}(D)$ as $H(X_i)$:

In [None]:
SplitInfoX1D = entropy(d,['X1'])
SplitInfoX2D = entropy(d,['X2'])
SplitInfoX3D = entropy(d,['X3'])
SplitInfoX4D = entropy(d,['X4'])

Math(r'''
\begin{{aligned}}
\text{{SplitInfo}}_{{X_1}}(D)&={:.3g}\\
\text{{SplitInfo}}_{{X_2}}(D)&={:.3g}\\
\text{{SplitInfo}}_{{X_3}}(D)&={:.3g}\\
\text{{SplitInfo}}_{{X_4}}(D)&={:.3g}\\
\end{{aligned}}
'''.format(SplitInfoX1D,SplitInfoX2D,SplitInfoX3D,SplitInfoX4D))

Finally, to calculate the information gain ratios:

In [None]:
Math(r'''
\begin{{aligned}}
\frac{{\text{{Gain}}_{{X_1}}(D)}}{{\text{{SplitInfo}}_{{X_1}}(D)}}&={:.3g}\\
\frac{{\text{{Gain}}_{{X_2}}(D)}}{{\text{{SplitInfo}}_{{X_2}}(D)}}&={:.3g}\\
\frac{{\text{{Gain}}_{{X_3}}(D)}}{{\text{{SplitInfo}}_{{X_3}}(D)}}&={:.3g}\\
\frac{{\text{{Gain}}_{{X_4}}(D)}}{{\text{{SplitInfo}}_{{X_4}}(D)}}&={:.3g}\\
\end{{aligned}}
'''.format(GainX1D/SplitInfoX1D,GainX2D/SplitInfoX2D,GainX3D/SplitInfoX3D,GainX4D/SplitInfoX4D))

**Exercise** Is $X_4$ a good splitting attribute? Why?

YOUR ANSWER HERE