# 特征重要性算法

## 信息增益法 公式

* 熵的定义：
    * 属性 $ y $ 的熵，表示特征的不确定性：
    $$ 
    P\left(Y=y_{j}\right)=p_{j}, \quad i=1,2, \cdots, n
    $$
    $$
    H(Y)=-\sum_{j=1}^{n} p_{j} \log p_{j}
    $$
    
* 条件熵的定义：
    * 在 $ x $ 已知的情况下，$ y $ 的不确定性
    $$ 
    P\left(X=x_{i}, Y=y_{j}\right)=p_{i j}, \quad i=1,2, \cdots, n ; \quad j=1,2, \cdots, m
    $$
    $$ 
    H(Y | X)=\sum_{i=1}^{n} p_{i} H\left(Y | X=x_{i}\right)
    $$

* 信息增益计算流程
    1. 计算特征A对数据集D的熵，即计算$ y $ 的熵
    $$ 
    H(D)=-\sum_{k=1}^{K} \frac{\left|C_{k}\right|}{|D|} \log _{2} \frac{\left|C_{k}\right|}{|D|}
    $$
    2. 计算$ x $不同取值的情况下，$ y $的熵
    $$ 
    H(D | A)=\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} H\left(D_{i}\right)=-\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} \sum_{k=1}^{K} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|} \log _{2} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|}
    $$
    3. 做差计算增益
    $$ 
    g(D, A)=H(D)-H(D | A)
    $$


In [48]:
import numpy as np
import math
'''
熵的计算
'''
def entropy(y_values):
    e = 0
    unique_vals = np.unique(y_values)
    for val in unique_vals:
        p = np.sum(y_values == val)/len(y_values)
        e += (p * math.log(p, 2))
    return -1 * e

'''
条件熵的计算
'''
def entropy_condition(x_values, y_values):
    ey = entropy(y_values)
    ey_condition = 0
    xy = np.hstack((x_values, y_values))
    unique_x = np.unique(x_values)
    for x_val in unique_x:
        px = np.sum(x_values == x_val) / len(x_values)
        xy_condition_x = xy[np.where(xy[:, 0] == x_val)]
        ey_condition_x = entropy(xy_condition_x[:, 1])
        ey_condition += (px * ey_condition_x)
    return ey - ey_condition

'''
信息增益比：摒弃了选择取值多的特征为重要特征的缺点
'''
def entropy_condition_ratio(x_values, y_values):
    return entropy_condition(x_values, y_values) / entropy(x_values)

* 以书中P62页的例子作为测试，以下分别为A1， A2的信息增益

In [49]:
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0], 
             [0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(entropy_condition(xy[:, 0].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))
#A2
print(entropy_condition(xy[:, 1].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A3
print(entropy_condition(xy[:, 2].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A4
print(entropy_condition(xy[:, 3].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))


0.08300749985576883
0.32365019815155627
0.4199730940219749
0.36298956253708536


* 与书中结果相合

In [50]:
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0], 
             [0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(entropy_condition_ratio(xy[:, 0].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))
#A2
print(entropy_condition_ratio(xy[:, 1].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A3
print(entropy_condition_ratio(xy[:, 2].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A4
print(entropy_condition_ratio(xy[:, 3].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))


0.05237190142858302
0.3524465495205019
0.4325380677663126
0.23185388128724224


## 基尼指数 公式

$$ 
\operatorname{Gini}(p)=\sum_{k=1}^{K} p_{k}\left(1-p_{k}\right)=1-\sum_{k=1}^{K} p_{k}^{2}
 $$
 $$ 
\operatorname{Gini}(D)=1-\sum_{k=1}^{K}\left(\frac{\left|C_{k}\right|}{|D|}\right)^{2}
 $$
 $$ 
\operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)
 $$

In [51]:
'''
基尼指数计算
'''
def gini(y_values):
    g = 0
    unique_vals = np.unique(y_values)
    for val in unique_vals:
        p = np.sum(y_values == val)/len(y_values)
        g += (p * p)
    return 1 - g

'''
按照x取值的基尼指数的计算
'''
def gini_condition(x_values, y_values):
    g_condition = {}
    xy = np.hstack((x_values, y_values))
    unique_x = np.unique(x_values)
    for x_val in unique_x:
        xy_condition_x = xy[np.where(xy[:, 0] == x_val)]
        xy_condition_notx = xy[np.where(xy[:, 0] != x_val)]
        g_condition[x_val] = len(xy_condition_x)/len(x_values) * gini(xy_condition_x[:, 1]) + len(xy_condition_notx)/len(x_values) * gini(xy_condition_notx[:, 1])
    return g_condition

In [52]:
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0], 
             [0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(gini_condition(xy[:, 0].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))
#A2
print(gini_condition(xy[:, 1].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A3
print(gini_condition(xy[:, 2].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))

#A4
print(gini_condition(xy[:, 3].reshape(-1, 1), 
                        xy[:, -1].reshape(-1, 1)))


{0: 0.44, 1: 0.4799999999999999, 2: 0.43999999999999995}
{0: 0.31999999999999995, 1: 0.31999999999999995}
{0: 0.26666666666666666, 1: 0.26666666666666666}
{0: 0.31999999999999984, 1: 0.4740740740740741, 2: 0.3636363636363637}


* 与书中p71相符，选择最小的特征及$ x $取值作为最优特征及分切点。
* 其实选取基尼指数最小，即选择在哪个特征下以及该特征取哪个值的情况下，$ y $的不确定性最小

## 特征重要性的对比
### 以随机森林算法进行特征重要性计算，以书中数据为例

In [53]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42).fit(xy[:, :-1], xy[:, -1])

print(rf. feature_importances_)


[0.16228836 0.29464286 0.44417989 0.09888889]




* 总体上相符