需解决两个问题:   
1. 如何在特征值缺失的情况下进行划分特征选择?
2. 给定划分特征,若样本在该特征上的值缺失,如何对样本进行划分?

&emsp;&emsp;给定数据集$D$和特征$a$,令$ \tilde{D} $表示$D$中在特征$a$上没有缺失值的样本子集,对问题1,显然
仅可根据$ \tilde{D}$来判断特征$a$的优劣.假定特征$a$有$V$个可取值$ \{ a^1, a^2, \dots,a^V \} $.令$\tilde{D}^v$表示$\tilde{D}$在
特征$a$上取值为$a^v$的样本子集,$\tilde{D}_{k}$表示$ \tilde{D} $中属于第$k$类$(k=1,2,\dots,|\mathcal{Y}|)$的样本子集,则
显然有$ \tilde{D} = \cup_{k=1}^{|\mathcal{Y}|} \tilde{D}_{k},\tilde{D} = \cup_{v=1}^{V} \tilde{D}^{v}$.假定我们为每个样本$\mathbf{x}$赋予
一个权重$w_\mathbf{x}$(在决策树学习开始阶段,根结点中各样本的权重初始化为1),并定义       

\begin{align}
\rho &= \frac{\sum_{\mathbf{x} \in \tilde{D}} w_\mathbf{x}}{\sum_{\mathbf{x} \in D} w_\mathbf{x}} \\
\tilde{p}_k &= \frac{\sum_{\mathbf{x} \in \tilde{D}_k} w_\mathbf{x}}{\sum_{\mathbf{x} \in \tilde{D}} w_\mathbf{x}} \quad (1 \leq k \leq |\mathcal{Y}|) \\
\tilde{r}_v &= \frac{\sum_{\mathbf{x} \in \tilde{D}^v} w_\mathbf{x}}{\sum_{\mathbf{x} \in \tilde{D}} w_\mathbf{x}} \quad (1 \leq v \leq V) \\
\end{align}    
直观的看,对特征$a, \rho$表示无缺失值样本所占的比例,$\tilde{p}_k$表示无缺失值样本中第$k$类所占的比例,$\tilde{r}_v $则表示
无缺失值样本中在特征$a$上取值$ a_v $的样本所占的比例.显然,$\sum_{k=1}^{|\mathcal{Y}|} \tilde{p}_k =1, \sum_{v=1}^{V}\tilde{r}_{v}=1$    
&emsp;&emsp;基于上叙定义,信息增益的计算式推广为:   
\begin{align}\
\mathrm{Gain}(D, a) &= \rho \times \mathrm{Gain}(\tilde{D}, a) \\
                    &= \rho \times \left( H(\tilde{D}) -  \sum_{v=1}^{V} \tilde{r}_v H(\tilde{D}^v)  \right)
\end{align}     
其中,$$ H(\tilde{D}) = \sum_{k=1}^{|\mathcal{Y}|} \tilde{p}_k \log_2 \tilde{p}_k  $$   
&emsp;&emsp;对问题2,若样本$\mathbf{x}$在划分特征$a$上的取值已知,则将$\mathbf{x}$划入与其取值对应的子结点,且样本权重在子结点中保持为$w_\mathbf{x}$.若样本$\mathbf{x}$在
划分特征$a$上的取值未知,则将$\mathbf{x}$同时划入所有的子结点,且样本权重在特征值$a^v$对应的子结点中调整为$\tilde{r}_v \cdot w_\mathbf{x}$

In [2]:
import numpy as np
import pandas as pd
from collections import Counter

In [3]:
string = """编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜
1,???,蜷缩,浊响,清晰,凹陷,硬滑,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,???,是
3,乌黑,蜷缩,???,清晰,凹陷,硬滑,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,是
5,???,蜷缩,浊响,清晰,凹陷,硬滑,是
6,青绿,稍蜷,浊响,清晰,???,软粘,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,是
8,乌黑,稍蜷,浊响,???,稍凹,硬滑,是
9,乌黑,???,沉闷,稍糊,稍凹,硬滑,否
10,青绿,硬挺,清脆,???,平坦,软粘,否
11,浅白,硬挺,清脆,模糊,平坦,???,否
12,浅白,蜷缩,???,模糊,平坦,软粘,否
13,???,稍蜷,浊响,稍糊,凹陷,硬滑,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,否
15,乌黑,稍蜷,浊响,清晰,???,软粘,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,否
17,青绿,???,沉闷,稍糊,稍凹,硬滑,否"""

In [4]:
def out_df(string):
    """将字符串转换为DataFrame"""
    lst = list()
    for i in string.split('\n'):
        lst.append(i.split(','))
    arr = np.array(lst)
    frame = pd.DataFrame(arr[1:], columns=arr[0])
    frame.replace('???', np.nan, inplace=True)
    frame['权重'] = 1 # 根节点各样本的权重初始化为1
    
    return frame

In [5]:
df = out_df(string) 
df

Unnamed: 0,编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜,权重
0,1,,蜷缩,浊响,清晰,凹陷,硬滑,是,1
1,2,乌黑,蜷缩,沉闷,清晰,凹陷,,是,1
2,3,乌黑,蜷缩,,清晰,凹陷,硬滑,是,1
3,4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,是,1
4,5,,蜷缩,浊响,清晰,凹陷,硬滑,是,1
5,6,青绿,稍蜷,浊响,清晰,,软粘,是,1
6,7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,是,1
7,8,乌黑,稍蜷,浊响,,稍凹,硬滑,是,1
8,9,乌黑,,沉闷,稍糊,稍凹,硬滑,否,1
9,10,青绿,硬挺,清脆,,平坦,软粘,否,1


In [6]:
df_dropna = df[['编号', '纹理', '好瓜']].dropna(axis=0)
df_dropna

Unnamed: 0,编号,纹理,好瓜
0,1,清晰,是
1,2,清晰,是
2,3,清晰,是
3,4,清晰,是
4,5,清晰,是
5,6,清晰,是
6,7,稍糊,是
8,9,稍糊,否
10,11,模糊,否
11,12,模糊,否


In [7]:
p = len(df_dropna) / len(df) # 无缺失值所占的比例
p

0.8823529411764706

In [8]:
def info_gain(dataframe, feature_name):
    """特征feature_name划分数据集dataframe的信息增益"""
    def calc_ent(dataframe): # 局部函数
        """计算熵"""
        counter = Counter(dataframe.iloc[:, -1]) # 不同类的个数
        pro_vector = np.array(list(counter.values())) / len(dataframe) # 不同类所占的比例
        res =  - pro_vector @ np.log2(pro_vector)

        return  res
    
    def calc_cond_ent(dataframe, feature_name):
        """计算条件熵"""
        conter = Counter(dataframe.loc[:, feature_name]) # 特征feature_name不同取值的个数
        data_length = len(dataframe)
        pro_vector = np.array(list(conter.values())) / data_length # 特征feature_name不同取值的所占的比例
        hd_vector = list()
        for i in conter.keys():
            hd_vector.append(calc_ent(dataframe.iloc[np.argwhere(dataframe.loc[:, feature_name].values==i).reshape(1, -1)[0]]))
        result = pro_vector @ hd_vector # 条件熵
    
        return result
    
    return calc_ent(dataframe) - calc_cond_ent(dataframe, feature_name) # 信息增益

In [9]:
info_gain(df_dropna, '纹理') * p # 信息增益的推广

0.42356026795361434

In [10]:
def weight(dataframe, feature, name):
    feature_dropna = dict(Counter(dataframe[feature].dropna())) # 特征feature的不同取值(非NaN)的个数
    na_part = dataframe[dataframe[feature].isna()]
    # 编号{8, 10}在样本属性"纹理"上出现了缺失值,因此它将同时进入三个分支中,但权重在三个子结点中分别调整为7/15,5/15,3/15
    change_weight = na_part.loc[:, '权重'] * \
                    (feature_dropna[name] / sum(feature_dropna.values())) # 更新权重
    na_part.loc[:, '权重'] = change_weight
    name_park = dataframe[dataframe[feature] == name]
    
    return  pd.concat([na_part, name_park])
    

In [15]:
weight(df, '纹理', '清晰')

Unnamed: 0,编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜,权重
7,8,乌黑,稍蜷,浊响,,稍凹,硬滑,是,0.466667
9,10,青绿,硬挺,清脆,,平坦,软粘,否,0.466667
0,1,,蜷缩,浊响,清晰,凹陷,硬滑,是,1.0
1,2,乌黑,蜷缩,沉闷,清晰,凹陷,,是,1.0
2,3,乌黑,蜷缩,,清晰,凹陷,硬滑,是,1.0
3,4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,是,1.0
4,5,,蜷缩,浊响,清晰,凹陷,硬滑,是,1.0
5,6,青绿,稍蜷,浊响,清晰,,软粘,是,1.0
14,15,乌黑,稍蜷,浊响,清晰,,软粘,否,1.0


In [16]:
weight(df, '纹理', '模糊')

Unnamed: 0,编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜,权重
7,8,乌黑,稍蜷,浊响,,稍凹,硬滑,是,0.2
9,10,青绿,硬挺,清脆,,平坦,软粘,否,0.2
10,11,浅白,硬挺,清脆,模糊,平坦,,否,1.0
11,12,浅白,蜷缩,,模糊,平坦,软粘,否,1.0
15,16,浅白,蜷缩,浊响,模糊,平坦,硬滑,否,1.0


In [17]:
weight(df, '纹理', '稍糊')

Unnamed: 0,编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜,权重
7,8,乌黑,稍蜷,浊响,,稍凹,硬滑,是,0.333333
9,10,青绿,硬挺,清脆,,平坦,软粘,否,0.333333
6,7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,是,1.0
8,9,乌黑,,沉闷,稍糊,稍凹,硬滑,否,1.0
12,13,,稍蜷,浊响,稍糊,凹陷,硬滑,否,1.0
13,14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,否,1.0
16,17,青绿,,沉闷,稍糊,稍凹,硬滑,否,1.0
