Information entropy:
$$
\mathrm{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|}p_k\log_2p_k.
$$
$D:$ total sample data

$p_k:$ the percentage of the kst sample

Information gain
$$
\mathrm{Gain}(D,a)=\mathrm{Ent}(D)-\sum_{v=1}^V\frac{|D^v|}{|D|}\mathrm{Ent}(D^v).
$$
suppose $a$ (one attribute) has $V$ cases $(a^1,a^2,...a^V)$


For an example, this is the data set:
![](1.png)

The information entropy is:
$$
\mathrm{Ent}(D)=-\sum_{k=1}^{2}p_{k}\log_{2}p_{k}=-\left(\frac{8}{17}\log_{2}\frac{8}{17}+\frac{9}{17}\log_{2}\frac{9}{17}\right)=0.998.
$$
If we choose '色泽' to split this data, that means:
$$
D^1\text{(色泽=青绿)}， D^2 (色泽=乌黑)， D^3 ( 色泽=浅白).
$$
And the information gain are:
$$
\begin{gathered}
\mathrm{Ent}(D^{1}) =-\left(\frac36\log_2\frac36+\frac36\log_2\frac36\right)=1.000, \\
\mathrm{Ent}(D^{2}) =-\left(\frac46\log_2\frac46+\frac26\log_2\frac26\right)=0.918, \\
\mathrm{Ent}(D^3) =-\left(\frac15\log_2\frac15+\frac45\log_2\frac45\right)=0.722, 
\end{gathered}
$$
$$
\begin{aligned}
\mathrm{Gain}(D,\text{色泽})& =\mathrm{Ent}(D)-\sum_{v=1}^{3}\frac{|D^{v}|}{|D|}\mathrm{Ent}(D^{v})  \\
&=0.998-\left(\frac{6}{17}\times1.000+\frac{6}{17}\times0.918+\frac{5}{17}\times0.722\right) \\
&=0.109.
\end{aligned}
$$
By the same way,
$$
\begin{aligned}&\mathrm{Gain}(D,\text{根蒂})=0.143;\quad\mathrm{Gain}(D,\text{敲声})=0.141;\\&\mathrm{Gain}(D,\text{纹理})=0.381;\quad\mathrm{Gain}(D,\text{脐部})=0.289;\\&\mathrm{Gain}(D,\text{触感})=0.006.\end{aligned}
$$

“纹理” is the max one! So, we will have the follow dividing:
\begin{array}{cc}&\boxed{\text{纹理}=?}\\[5pt]\text{清晰}&\text{稍糊}&\text{模糊}\\\boxed{\{1,2,3,4,5,6,8,10,15\}}&\boxed{\{7,9,13,14,17\}}&\boxed{\{11,12,16\}}\end{array}

The next step is: for “清晰”, the attribute is “色泽，根蒂，敲声，脐部，触感”, and calculate the information gain:
$$
\begin{aligned}&\mathrm{Gain}(D^{1},\text{色泽})=0.043;\quad\mathrm{Gain}(D^{1},\text{根蒂})=0.458;\\&\mathrm{Gain}(D^{1},\text{敲声})=0.331;\quad\mathrm{Gain}(D^{1},\text{脐部})=0.458;\\&\mathrm{Gain}(D^{1},\text{触感})=0.458.\end{aligned}
$$

We will get the tree at last:
![](2.png )

In [122]:
import numpy as np
import pandas as pd

# 定义数据集
data = [
    ['青绿', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
    ['乌黑', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '好瓜'],
    ['乌黑', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
    ['青绿', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '好瓜'],
    ['浅白', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
    ['青绿', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '好瓜'],
    ['乌黑', '稍蜷', '浊响', '稍糊', '稍凹', '软粘', '好瓜'],
    ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '硬滑', '好瓜'],
    ['乌黑', '稍蜷', '沉闷', '稍糊', '稍凹', '硬滑', '坏瓜'],
    ['青绿', '硬挺', '清脆', '清晰', '平坦', '软粘', '坏瓜'],
    ['浅白', '硬挺', '清脆', '模糊', '平坦', '硬滑', '坏瓜'],
    ['浅白', '蜷缩', '浊响', '模糊', '平坦', '软粘', '坏瓜'],
    ['青绿', '稍蜷', '浊响', '稍糊', '凹陷', '硬滑', '坏瓜'],
    ['浅白', '稍蜷', '沉闷', '稍糊', '凹陷', '硬滑', '坏瓜'],
    ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '坏瓜'],
    ['浅白', '蜷缩', '浊响', '模糊', '平坦', '硬滑', '坏瓜'],
    ['青绿', '蜷缩', '沉闷', '稍糊', '稍凹', '硬滑', '坏瓜']
]

# 创建DataFrame
columns = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感', '标签']
df = pd.DataFrame(data, columns=columns)

# 计算信息熵
def entropy(y):
    value_counts = y.value_counts(normalize=True)
    return -sum(value_counts * np.log2(value_counts + 1e-9))

# 计算信息增益
def information_gain(df, feature, target):
    total_entropy = entropy(df[target])
    feature_entropy = 0
    
    for value in df[feature].unique():
        subset = df[df[feature] == value]
        feature_entropy += (len(subset) / len(df)) * entropy(subset[target])
    
    return total_entropy - feature_entropy


In [123]:
class Node:
    def __init__(self, feature=None, label=None):
        self.feature = feature  # 特征名称
        self.label = label      # 叶子节点的标签
        self.children = {}      # 孩子节点

def build_tree(df, target):
    # 检查是否所有样本的标签相同
    if len(df[target].unique()) == 1:
        return Node(label=df[target].iloc[0])
    
    # 如果没有特征可用，返回多数类标签
    if len(df.columns) == 1:
        return Node(label=df[target].mode()[0])
    
    # 计算信息增益并找到最佳特征
    gains = {feature: information_gain(df, feature, target) for feature in df.columns if feature != target}
    max_gain = max(gains.values())
    best_features = [feature for feature, gain in gains.items() if gain == max_gain]
    if '脐部' in best_features:
        best_feature = best_features[0]
    else:
        best_feature = best_features[0]
    tree = Node(feature=best_feature)
    #print(best_feature)
    #print(df)
    #print(df[best_feature].unique())
    # 对每个特征值递归构建子树
    for value in df[best_feature].unique():
        subset = df[df[best_feature] == value].drop(columns=[best_feature])
        tree.children[value] = build_tree(subset, target)
    
    return tree

# 构建决策树
decision_tree = build_tree(df, '标签')

def print_tree(node, depth=0):
    if node.label is not None:
        print("  " * depth + f"预测: {node.label}")
    else:
        print("  " * depth + f"[特征: {node.feature}]")
        for value, child in node.children.items():
            print("  " * depth + f"[值: {value}]")
            print_tree(child, depth + 1)

print_tree(decision_tree)

[特征: 纹理]
[值: 清晰]
  [特征: 根蒂]
  [值: 蜷缩]
    预测: 好瓜
  [值: 稍蜷]
    [特征: 色泽]
    [值: 青绿]
      预测: 好瓜
    [值: 乌黑]
      [特征: 触感]
      [值: 硬滑]
        预测: 好瓜
      [值: 软粘]
        预测: 坏瓜
  [值: 硬挺]
    预测: 坏瓜
[值: 稍糊]
  [特征: 触感]
  [值: 软粘]
    预测: 好瓜
  [值: 硬滑]
    预测: 坏瓜
[值: 模糊]
  预测: 坏瓜


page 78 is wrong! There is no a watermellon is "清晰 稍卷 浅白"！

<img src="error.png" alt="Description" width="500"/>

#### Gain ratio

Using gain ratio to get the best attribution division.
\begin{align}
\mathrm{Gain}_\mathrm{ratio}(D,a)=\frac{\mathrm{Gain}(D,a)}{\mathrm{IV}(a)} ,\\
\mathrm{IV}(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}\log_2\frac{|D^v|}{|D|}
\end{align}
where,
$$
\mathrm{Gain}(D,a)=\mathrm{Ent}(D)-\sum_{v=1}^V\frac{|D^v|}{|D|}\mathrm{Ent}(D^v).
$$
We call "intrinsic value" of attribution a is:
$$
\mathrm{IV}(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}\log_2\frac{|D^v|}{|D|}
$$

### Pruning

To overcome overfitting, we can use pruning.

{1，2，3，6，7， 10， 14， 15， 16， 17} is training set. 

{4， 5， 8， 9， 11， 12， 13} is validation set.
<img src="3.png" alt="Description" width="500"/>

In [124]:
data_train = [data[0], data[1], data[2], data[5], data[6], data[9], data[13], data[14], data[15], data[16]]
data_validation = [data[3], data[4], data[7], data[8], data[10], data[11], data[12]]

In [125]:
df_train = pd.DataFrame(data_train,columns=columns)
df_train

Unnamed: 0,色泽,根蒂,敲声,纹理,脐部,触感,标签
0,青绿,蜷缩,浊响,清晰,凹陷,硬滑,好瓜
1,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,好瓜
2,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,好瓜
3,青绿,稍蜷,浊响,清晰,稍凹,软粘,好瓜
4,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,好瓜
5,青绿,硬挺,清脆,清晰,平坦,软粘,坏瓜
6,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,坏瓜
7,乌黑,稍蜷,浊响,清晰,稍凹,软粘,坏瓜
8,浅白,蜷缩,浊响,模糊,平坦,硬滑,坏瓜
9,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,坏瓜


In [126]:
build_tree(df_train, '标签')
print_tree(build_tree(df_train, '标签'))

[特征: 色泽]
[值: 青绿]
  [特征: 敲声]
  [值: 浊响]
    预测: 好瓜
  [值: 清脆]
    预测: 坏瓜
  [值: 沉闷]
    预测: 坏瓜
[值: 乌黑]
  [特征: 根蒂]
  [值: 蜷缩]
    预测: 好瓜
  [值: 稍蜷]
    [特征: 纹理]
    [值: 稍糊]
      预测: 好瓜
    [值: 清晰]
      预测: 坏瓜
[值: 浅白]
  预测: 坏瓜
