# 第五章 决策树

1. 什么是决策树？
2. 决策树算法包括哪些？
3. 如何进行特征选择？

1. 分类决策树模型是表示基于特征对实例进行分类的树形结构。决策树可以转换成if-then规则的集合。也可以看做是定义在特征空间划分上的类的条件概率分布。
2. 决策树学习旨在构建一个与训练数据拟合很好，并且复杂度小的决策树。因为从可能的决策树中直接选取最优决策树是个NP完全问题。现实中采用启发式方法学习次优的决策树。
决策树学习算法包括3部分：特征选择、树的生成和树的剪枝。常用的算法有ID3、C4.5、CART。
3. 特征选择的目的在于选取对训练数据能够分类的特征。特征选择的关键是准则。常用准则如下：

（1）样本集合D对特征A的信息增益（ID3）
$$g(D,A)=H(D) - H(D|A)$$
$$H(D) = -\sum_{k=1}^K{\frac{|C_k|}{|D|}log_2 \frac{|C_k|}{|D|}}$$
$$H(D|A) = \sum_{i=1}^{n}{\frac{|D_i|}{|D|}H(D_i)}$$

其中H(D)是数据集$D$的熵，$D(D_i)$是数据集$D_i$的熵，$H(H|A)$是数据集D对特征A的条件熵。$D_i$是D中特征A取第i个值的样本子集，$C_k$是D中属于第k类的样本的子集。$n$是特征A取值的个数，K是类的个数。

（2）样本集合D对特征A的信息增益比（C4.5）
$$g_R(D,A)=\frac{g(D,A)}{H(D)}$$
其中，$g(D,A)$是信息增益，$H(D)$是数据集D的熵。

（3）样本集合D的基尼指数（CART）
$$Gini(D)=1-\sum_{k=1}^{K}{(\frac{|C_k|}{|D|})^2}$$
特征A条件下集合D的基尼指数：
$$Gini(D,A)=\frac{|D_1|}{|D|}Gini(|D_1|)+\frac{|D_2|}{|D|}Gini(D_2)$$

4. 决策树的生成。通常使用信息增益最大、信息增益比最大或基尼指数最小作为特征选择的准则。决策树的生成往往通过计算信息增益或者其他指标，从根节点开始，递归的产生决策树。这相当于用信息增益或其他准则不断地选取局部最优的特征，或将训练集分割为能够基本正确分类的子集。

5. 决策树的剪枝。由于生产决策树存在过拟合问题，需要对它进行剪枝，以简化学到的决策树。决策树的剪枝，往往从已生产的树上剪掉一些叶节点或叶节点以上的子树，并将其父节点或跟节点作为新的叶节点，从而简化生成的决策树。

In [112]:
import numpy as np
import pandas as pd
from math import log

In [113]:
# 数据集
# 样本数 x 特征数
# 书上题目5.1
def create_data():
    datasets = [['青年', '否', '否', '一般', '否'],
               ['青年', '否', '否', '好', '否'],
               ['青年', '是', '否', '好', '是'],
               ['青年', '是', '是', '一般', '是'],
               ['青年', '否', '否', '一般', '否'],
               ['中年', '否', '否', '一般', '否'],
               ['中年', '否', '否', '好', '否'],
               ['中年', '是', '是', '好', '是'],
               ['中年', '否', '是', '非常好', '是'],
               ['中年', '否', '是', '非常好', '是'],
               ['老年', '否', '是', '非常好', '是'],
               ['老年', '否', '是', '好', '是'],
               ['老年', '是', '否', '好', '是'],
               ['老年', '是', '否', '非常好', '是'],
               ['老年', '否', '否', '一般', '否'],
               ]
    labels = [u'年龄', u'有工作', u'有自己的房子', u'信贷情况', u'类别']
    # 返回数据集和每个维度的名称
    return datasets, labels

In [114]:
datasets, labels = create_data()

In [115]:
train_data = pd.DataFrame(datasets, columns=labels)

In [116]:
train_data

Unnamed: 0,年龄,有工作,有自己的房子,信贷情况,类别
0,青年,否,否,一般,否
1,青年,否,否,好,否
2,青年,是,否,好,是
3,青年,是,是,一般,是
4,青年,否,否,一般,否
5,中年,否,否,一般,否
6,中年,否,否,好,否
7,中年,是,是,好,是
8,中年,否,是,非常好,是
9,中年,否,是,非常好,是


In [117]:
# 计算熵
def calc_ent(datasets):
    catogeries = pd.unique(datasets[datasets.columns[-1]])
    ans = 0
    for x in catogeries:
        tmp = sum(datasets[datasets.columns[-1]]==x)/len(datasets)
        ans += -tmp*np.log(tmp)
    return ans

In [118]:
calc_ent(train_data)

0.6730116670092565

In [119]:
# 经验条件熵
def cond_ent(datasets, axis=0):
    features = datasets.columns
    sel_feature = features[axis]
    sel_feature_vals = pd.unique(datasets[sel_feature])
    ans = 0
    for x in sel_feature_vals:
        t_dataset = datasets[datasets[sel_feature]==x]
        ans = len(t_dataset)/len(datasets) * calc_ent(t_dataset)
    return ans

In [120]:
cond_ent(train_data, 0)

0.1668008078460626

In [121]:
# 信息增益
def info_gain(ent, cond_ent):
    return ent - cond_ent

def info_gain_train(datasets):
    count = len(datasets.columns) - 1
    ent = calc_ent(datasets)
    best_feature = []
    for c in range(count):
        c_info_gain = info_gain(ent, cond_ent(datasets, c))
        best_feature.append((c, c_info_gain))
        
    return max(best_feature, key=lambda x: x[-1])

In [122]:
# def calc_ent(datasets):
#         data_length = len(datasets)
#         label_count = {}
#         for i in range(data_length):
#             label = datasets[i][-1]
#             if label not in label_count:
#                 label_count[label] = 0
#             label_count[label] += 1
#         ent = -sum([(p / data_length) * log(p / data_length, 2)
#                     for p in label_count.values()])
#         return ent

# # 经验条件熵
# def cond_ent(datasets, axis=0):
#     data_length = len(datasets)
#     feature_sets = {}
#     for i in range(data_length):
#         feature = datasets[i][axis]
#         if feature not in feature_sets:
#             feature_sets[feature] = []
#         feature_sets[feature].append(datasets[i])
#     cond_ent = sum([(len(p) / data_length) * calc_ent(p)
#                     for p in feature_sets.values()])
#     return cond_ent

# # 信息增益
# def info_gain(ent, cond_ent):
#     return ent - cond_ent

# def info_gain_train(datasets):
#     count = len(datasets[0]) - 1
#     ent = calc_ent(datasets)
#     best_feature = []
#     for c in range(count):
#         c_info_gain = info_gain(ent, cond_ent(datasets, axis=c))
#         best_feature.append((c, c_info_gain))
#     # 比较大小
#     best_ = max(best_feature, key=lambda x: x[-1])
#     return best_

In [123]:
class Node:
    def __init__(self, root=True, label=None, feature_name=None, feature=None):
        self.root = root
        self.label = label
        self.feature_name = feature_name
        self.feature = feature
        self.tree = {}
        self.result = {
            'label': self.label,
            'feature': self.feature,
            'tree': self.tree
        }
        
    def __repr__(self):
        return '{}'.format(self.result)
    
    def add_node(self, val, node):
        self.tree[val] = node
    
    def predict(self, features):
        if self.root is True:
            return self.label
        return self.tree[features[self.feature]].predict(features)

# class Node:
#     def __init__(self, root=True, label=None, feature_name=None, feature=None):
#         self.root = root
#         self.label = label
#         self.feature_name = feature_name
#         self.feature = feature
#         self.tree = {}
#         self.result = {
#             'label:': self.label,
#             'feature': self.feature,
#             'tree': self.tree
#         }

#     def __repr__(self):
#         return '{}'.format(self.result)

#     def add_node(self, val, node):
#         self.tree[val] = node

#     def predict(self, features):
#         if self.root is True:
#             return self.label
#         return self.tree[features[self.feature]].predict(features)

In [128]:
def train(train_data):
    y_train = train_data.iloc[:, -1]
    features = train_data.columns[:-1]
    
    # 1. 仅剩一个类
    if len(y_train.value_counts())==1:
        return Node(root=True, label=y_train.iloc[0])
    # 2. 已无特征可以挑选
    if len(features) == 0:
        return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])
    
    # 3. 计算最大信息增益
    max_feature, max_info_gain = info_gain_train(train_data)
    max_feature_name = features[max_feature]
    
    # 4. 
    if max_info_gain < 0.1:
        return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])
    
    # 5. 构建 ag 子集
    node_tree = Node(root=False, feature_name=max_feature_name, feature=max_feature)
    feature_list = train_data[max_feature_name].value_counts().index
    for f in feature_list:
        sub_train_df = train_data.loc[train_data[max_feature_name]==f].drop([max_feature_name], axis=1)
     
    # 6. 递归生成树
        sub_tree = train(sub_train_df)
        node_tree.add_node(f, sub_tree)
    
    return node_tree

# def train(train_data):
#     """
#     input:数据集D(DataFrame格式)，特征集A，阈值eta
#     output:决策树T
#     """
# #     _, y_train, features = train_data.iloc[:, :-1], train_data.iloc[:,-1], train_data.columns[:-1]
#     y_train = train_data.iloc[:, -1]
#     features = train_data.columns[:-1]
# #     # 1,若D中实例属于同一类Ck，则T为单节点树，并将类Ck作为结点的类标记，返回T
# #     if len(y_train.value_counts()) == 1:
# #         return Node(root=True, label=y_train.iloc[0])

# #     # 2, 若A为空，则T为单节点树，将D中实例树最大的类Ck作为该节点的类标记，返回T
# #     if len(features) == 0:
# #         return Node(
# #             root=True,
# #             label=y_train.value_counts().sort_values(
# #                 ascending=False).index[0])
# # 1. 仅剩一个类
#     if len(y_train.value_counts())==1:
#         return Node(root=True, label=y_train.iloc[0])
#     # 2. 已无特征可以挑选
#     if len(features) == 0:
#         return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])

#     # 3,计算最大信息增益 同5.1,Ag为信息增益最大的特征
# #     max_feature, max_info_gain = info_gain_train(np.array(train_data))
# #     max_feature_name = features[max_feature]
# # 3. 计算最大信息增益
#     max_feature, max_info_gain = info_gain_train(np.array(train_data))
#     max_feature_name = features[max_feature]

#     # 4,Ag的信息增益小于阈值eta,则置T为单节点树，并将D中是实例数最大的类Ck作为该节点的类标记，返回T
# #     if max_info_gain < 0.1:
# #         return Node(
# #             root=True,
# #             label=y_train.value_counts().sort_values(
# #                 ascending=False).index[0])
#     if max_info_gain < 0.1:
#         return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])

# #     # 5,构建Ag子集
# #     node_tree = Node(
# #         root=False, feature_name=max_feature_name, feature=max_feature)

# #     feature_list = train_data[max_feature_name].value_counts().index
# #     for f in feature_list:
# #         sub_train_df = train_data.loc[train_data[max_feature_name] ==
# #                                       f].drop([max_feature_name], axis=1)
#     node_tree = Node(root=False, feature_name=max_feature_name, feature=max_feature)
#     feature_list = train_data[max_feature_name].value_counts().index
#     for f in feature_list:
#         sub_train_df = train_data.loc[train_data[max_feature_name]==f].drop([max_feature_name], axis=1)
     

#         # 6, 递归生成树
#         sub_tree = train(sub_train_df)
#         node_tree.add_node(f, sub_tree)
# #         sub_tree = train(sub_train_df)
# #     node_tree.add_node(f, sub_tree)

#     # pprint.pprint(node_tree.tree)
#     return node_tree

In [129]:
def fit(train_data):
    return train(train_data)

In [130]:
tree = fit(train_data)

In [131]:
tree

{'label': None, 'feature': 1, 'tree': {'否': {'label': None, 'feature': 1, 'tree': {'否': {'label': '否', 'feature': None, 'tree': {}}, '是': {'label': '是', 'feature': None, 'tree': {}}}}, '是': {'label': '是', 'feature': None, 'tree': {}}}}

In [132]:
tree.predict(['老年', '否', '否', '一般'])

'否'