&emsp;&emsp;给定样本集$D$和连续特征$a$,假定$a$在$D$上出现了n个不同的取值,将这些值从
小到大进行排序,记为$\{ a_1, a_2, \dots, a_n \}$.基于划分点$t$可将$D$分为子集$D_t^{-}$和$D_t^{+}$,其中$D_t^{-}$包含
那些在特征$a$上取值不大于$t$的样本,而$D_t^{+}$则包含那些在特征$a$上取值大于$t$的样本.显然,对相邻的特征取值$a_i$与$a_{i+1}$来说,$t$在区间$ [a^i, a^{i+1}] $中取
任意值所产生的划分结果相同.因此,对连续特征$a$,我们可考虑包含$n-1$个元素的候选划分点集合    
$$ T_a=\left \{  \frac{a^i + a^{i+1}}{2} | 1 \leq i \leq n-1  \right \} $$   
即把区间$[a^i, a^{i+1})$的中位置点$ \frac{a^i + a^{i+1}}{2} $作为候选划分点.然后,我们就可以像离散特征值一样来考察这些划分点,选取最优
的划分点进行样本集合的划分.

In [11]:
from  collections import  Counter
import numpy as np
import pandas as pd

In [12]:
string = """编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否"""

In [13]:
def out_df(string):
    """将字符串转换为DataFrame"""
    lst = list()
    for i in string.split('\n'):
        lst.append(i.split(','))
    arr = np.array(lst)
    
    return pd.DataFrame(arr[1:], columns=arr[0])

In [14]:
df = out_df(string) 
df

Unnamed: 0,编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
0,1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
1,2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
2,3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
3,4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
4,5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
5,6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
6,7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
7,8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
8,9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
9,10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否


In [15]:
def info_gain(dataframe, feature_name):
    """特征feature_name划分数据集dataframe的信息增益"""
    def calc_ent(dataframe): # 局部函数
        """计算熵"""
        counter = Counter(dataframe.iloc[:, -1]) # 不同类的个数
        pro_vector = np.array(list(counter.values())) / len(dataframe) # 不同类所占的比例
        res =  - pro_vector @ np.log2(pro_vector)

        return  res
    
    def calc_cond_ent(dataframe, feature_name):
        """计算条件熵"""
        conter = Counter(dataframe.loc[:, feature_name]) # 特征feature_name不同取值的个数
        data_length = len(dataframe)
        pro_vector = np.array(list(conter.values())) / data_length # 特征feature_name不同取值所占的比例
        hd_vector = list()
        for i in conter.keys():
            hd_vector.append(calc_ent(dataframe.iloc[np.argwhere(dataframe.loc[:, feature_name].values==i).reshape(1, -1)[0]]))
        result = pro_vector @ hd_vector # 条件熵
    
        return result
    
    return calc_ent(dataframe) - calc_cond_ent(dataframe, feature_name) # 信息增益
    

In [16]:
def contin_del(df, feature):
    """划分连续特征"""
    ser = df.loc[:, feature].astype(np.float).values
    val = list(set(ser))
    val.sort()
    ser_div = list() # 不同划分点对(连续)特征feature的划分
    div_point = list() # 划分点
    for i in range(len(val)-1):
        point = float("{:.3f}".format((val[i+1] + val[i])/2)) 
        div_point.append(point)
        ser_div.append(pd.cut(ser, [-float('inf'), point, float('inf')]).codes)        
    
    return  ser_div, div_point

In [17]:
info_gain(df, '色泽') # 色泽的信息增益

0.10812516526536531

In [18]:
info_gain(df, '纹理')

0.3805918973682686

In [19]:
density, point = contin_del(df, '密度')
density_df = pd.DataFrame(np.array(density).T)
density_df['好瓜'] = df['好瓜']
density_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,好瓜
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,是
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,是
2,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,是
3,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,是
4,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,是
5,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,是
6,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,是
7,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,是
8,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,否
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,否


In [20]:
for i in density_df.columns[:-1]:
    print('划分点:', point[list(density_df.columns).index(i)],
          '的信息增益为:', info_gain(density_df, i)) # 不同划分点下(连续)密度特征的信息增益

划分点: 0.244 的信息增益为: 0.05632607578088
划分点: 0.294 的信息增益为: 0.1179805181500242
划分点: 0.352 的信息增益为: 0.18613819904679052
划分点: 0.382 的信息增益为: 0.2624392604045632
划分点: 0.42 的信息增益为: 0.0934986902367243
划分点: 0.459 的信息增益为: 0.03020211515891169
划分点: 0.518 的信息增益为: 0.003585078590305879
划分点: 0.575 的信息增益为: 0.002226985278291793
划分点: 0.601 的信息增益为: 0.002226985278291793
划分点: 0.621 的信息增益为: 0.003585078590305879
划分点: 0.637 的信息增益为: 0.03020211515891169
划分点: 0.648 的信息增益为: 0.006046489176565584
划分点: 0.661 的信息增益为: 0.0007697888924075302
划分点: 0.681 的信息增益为: 0.024085993037174735
划分点: 0.708 的信息增益为: 0.00033345932649475607
划分点: 0.746 的信息增益为: 0.06696192680347068
