过滤法特征选择，按特征发散程度或与目标变量的之间的相关性对各个特征进行评分
单变量过滤方法：考虑单个特征与目标变量的相关性
多变量过滤方法：同时考虑特征变量和目标变量的相关关系和特征之间的相互关系

1. 方差阈值法（Variance Threshold)
是一种无监督特征选择方法

In [1]:
import numpy as np
from sklearn.feature_selection import VarianceThreshold

In [3]:
# dataset
X = np.array([
    [1, 2, 3, 4],
    [1, 6, 7, 9],
    [1, 4, 4, 2],
    [1, 4, 6, 1],
    [0, 7, 3, 2],
    [1, 5, 2, 6]
])
# 设置阈值为1
selector = VarianceThreshold(1.0)
# train dataset
selector.fit(X)
# select feature
selected_X = selector.transform(X)
print("特征的方差: ", selector.variances_)
print("特征选择后的数据集:", selected_X)

特征的方差:  [0.13888889 2.55555556 3.13888889 7.66666667]
特征选择后的数据集: [[2 3 4]
 [6 7 9]
 [4 4 2]
 [4 6 1]
 [7 3 2]
 [5 2 6]]


2. 卡方统计量法（Chi-squared Statistic)-分类

In [5]:
from sklearn.feature_selection import chi2, SelectKBest

In [7]:
# dataset 
Y = np.array([1, 0, 1, 1, 1, 0])
selector = SelectKBest(chi2, k=2)
selector.fit(X, Y)
selected_X = selector.transform(X)

# 值越大对分类结果的贡献越大
print("特征卡方统计量值：", selector.scores_)
print("特征选择后的数据集：", selected_X)

特征卡方统计量值 [0.1        0.44642857 0.08       9.1875    ]
特征选择后的数据集 [[2 4]
 [6 9]
 [4 2]
 [4 1]
 [7 2]
 [5 6]]


3. 互信息法（Mutual Information) 
分类问题(mutual_info_classif) 当y为离散变量时
回归问题(mutual_info_regression) 当y为连续型变量时

In [9]:
from sklearn.feature_selection import mutual_info_classif

In [10]:
selector = SelectKBest(mutual_info_classif, k=2)
selector.fit(X, Y)
selected_X = selector.transform(X)

print("特征和目标变量的互信息值", selector.scores_)
print("特征选择的数据集", selected_X)

特征和目标变量的互信息值 [0.33888889 0.         0.         0.39444444]
特征选择的数据集 [[1 4]
 [1 9]
 [1 2]
 [1 1]
 [0 2]
 [1 6]]


4. F统计量法（F-score）
分类问题
回归问题

F值较大，表示变量之间具有较强的相关性

In [11]:
from sklearn.feature_selection import f_classif

In [13]:
selector = SelectKBest(f_classif, k=2)
selector.fit(X, Y)
selected_X = selector.transform(X)

print("特征F-统计值", selector.scores_)
print("特征选择后的数据值", selected_X)

特征F-统计值 [ 0.44444528  0.62892926  0.07207263 15.8918915 ]
特征选择后的数据值 [[2 4]
 [6 9]
 [4 2]
 [4 1]
 [7 2]
 [5 6]]


5. 皮尔逊相关系数（Pearson Correlation）
回归问题

用于衡量两个连续变量之间的线性相关性程度（两个随机变量协方差和标准差的商）

皮尔逊相关系数越大，表示特征与目标变量越相关

In [14]:
from scipy.stats import pearsonr
import pandas as pd

In [23]:
house_train_df = pd.read_csv("../data/california_house/train.csv", index_col=['id'])
house_train_df
Y = house_train_df[['MedHouseVal']]
X = house_train_df.drop(columns=['MedHouseVal'])


def ud_pearsonr(X, y):
    result = np.array([pearsonr(x, y) for x in X.T])
    return np.absolute(result[:, 0]), result[:, 1]

selector = SelectKBest(ud_pearsonr, k=4)
selector.fit(X, Y)
selected_X = selector.transform(X)

print("", pd.Series(selector.scores_, index = X.columns))
print("", np.array(X.columns)[selector.get_support(indices=True)])

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''