### （特征工程2 - 特征选择） - 分别用IV值和随机森林挑选特征，再构建模型，进行模型评估

在二分类问题中，IV值（Information Value）主要用来对输入变量进行编码和预测能力评估。

一般选择中等和强预测能力的变量用于模型开发，一些学派也只提倡具有中等IV值的变量来进行模型开发。

IV值主要用于特征选择，如果想对变量的预测能力进行排序，可以按 IV 值从高到低筛选。

In [1]:
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split

# 导入数据
data = pd.read_csv('data_2.csv')
data.drop_duplicates(inplace=True)

# 载入特征
with open('feature.pkl', 'rb') as f:
    X = pickle.load(f)

# 提取标签
y = data.status

# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2333)

In [2]:
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))


#### 1. IV值进行特征选择

stats.scoreatpercentile(x, 50) # 得到x在50%处的数值

np.in1d(B,A) # 在序列B中寻找与序列A相同的值，并返回一逻辑值（True,False）

处理上述特征时, 遇到了IV的极端情况, 响应数为0或未响应数为0。

为简单起见, 我们在代码中对极端值进行平滑处理。

In [6]:
import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv


In [7]:
import warnings
warnings.filterwarnings("ignore")

iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv


[('loans_latest_time_month', 0.23536009610676906),
 ('latest_query_time_weekday', 0.22341492175897784),
 ('latest_query_time_month', 0.22281703262580477),
 ('avg_consume_less_12_valid_month', 0.22239702810015521),
 ('first_transaction_time_weekday', 0.22040179517292707),
 ('reg_preference_for_trad', 0.2177870321526657),
 ('first_transaction_time_month', 0.21771183020755758),
 ('loans_cash_count', 0.21757359026066675),
 ('loans_org_count_current', 0.21757359026066675),
 ('history_suc_fee', 0.21659344037069103),
 ('loans_latest_day', 0.2164524029470496),
 ('loans_latest_time_weekday', 0.21550477980307856),
 ('latest_query_day', 0.21265811933623147),
 ('apply_credibility', 0.21236457565886172),
 ('trans_fail_top_count_enum_last_12_month', 0.2123504757203065),
 ('loans_product_count', 0.21194036422792928),
 ('loans_score', 0.21110907500076165),
 ('transd_mcc', 0.211048655685347),
 ('latest_three_month_loan', 0.21045694705509993),
 ('consfin_max_limit', 0.20992973971407752),
 ('consfin_cred

#### 2. 随机森林挑选特征

首先网格调参，求得模型参数。

In [8]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 观察默认参数的性能
rf0 = RandomForestClassifier(oob_score=True, random_state=2333)
rf0.fit(X_train, y_train)
print('袋外分数：', rf0.oob_score_)
model_metrics(rf0, X_train, X_test, y_train, y_test)


袋外分数： 0.7333934475503456
[准确率] 训练集： 0.9802 测试集： 0.7730
[auc值] 训练集： 0.9995 测试集： 0.7201


In [9]:
# 网格法调参, 步骤省略...

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
# param_test = {'min_samples_split':range(10,100,20), 'min_samples_leaf':range(10,60,10)}
# param_test = {'max_features':range(3,17,2)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50, 
                                                          min_samples_leaf=20, max_features = 9,random_state=2333), 
                       param_grid = param_test, scoring='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_



({'n_estimators': 180}, 0.7841192009211578)

最终参数及性能

In [10]:
rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数：', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)


袋外分数： 0.7844905320108205
[准确率] 训练集： 0.8115 测试集： 0.7912
[auc值] 训练集： 0.8949 测试集： 0.7922


##### 2.1 平均不纯度减少 mean decrease impurity

对于每颗树，按照impurity（此处是gini指数 ）给特征排序，然后整个森林取平均

In [12]:
rf.fit(X_train, y_train)
feature_impotance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)


In [13]:
feature_impotance1[:10]


[('0.1231', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0801', 'history_fail_fee'),
 ('0.0737', 'loans_score'),
 ('0.0672', 'apply_score'),
 ('0.0667', 'latest_one_month_fail'),
 ('0.0416', 'loans_overdue_count'),
 ('0.0320', 'trans_fail_top_count_enum_last_12_month'),
 ('0.0250', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0208', 'trans_day_last_12_month'),
 ('0.0167', 'rank_trad_1_month')]

##### 2.2 平均精确率减少 Mean decrease accuracy

打乱每个特征的特征值顺序，并且度量顺序变动对模型的精确率的影响。（也可以measure每个特征加躁，看对结果的准确率的影响。）

In [14]:
import numpy as np
from collections import defaultdict
from sklearn.model_selection import cross_val_score, ShuffleSplit

scores = defaultdict(list)
rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
for train_idx, test_idx in rs.split(X_train):
    x_train, x_test = X_train.values[train_idx], X_train.values[test_idx]
    Y_train, Y_test = y_train.values[train_idx], y_train.values[test_idx]
    r = rf.fit(x_train, Y_train)
    acc = accuracy_score(Y_test, rf.predict(x_test))
    for i in range(x_train.shape[1]):
        X_t = x_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = accuracy_score(Y_test, rf.predict(X_t))
        scores[X_train.columns[i]].append((acc - shuff_acc) / acc)
        
feature_impotance2=sorted([('%.4f'%np.mean(score), feat) for feat, score in scores.items()], reverse=True)


In [15]:
feature_impotance2[:10]

[('0.0184', 'history_fail_fee'),
 ('0.0171', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0120', 'latest_one_month_fail'),
 ('0.0089', 'apply_score'),
 ('0.0087', 'loans_score'),
 ('0.0076', 'loans_overdue_count'),
 ('0.0056', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0043', 'latest_one_month_suc'),
 ('0.0026', 'rank_trad_1_month'),
 ('0.0023', 'latest_query_day')]

#### 3 综合挑选特征

In [16]:
useless = []
for feature in X_train.columns:
    if feature in [t[1] for t in feature_impotance1[50:]] and feature in [t[1] for t in feature_impotance2[50:]]:
        useless.append(feature)
        print(feature, iv_dict[feature])


first_transaction_time_year 0.2046103090844746
first_transaction_time_weekday 0.22040179517292707
latest_query_time_year 0.19785800765281902
latest_query_time_month 0.22281703262580477
loans_latest_time_year 0.19963733017168203
loans_latest_time_month 0.23536009610676906
loans_latest_time_weekday 0.21550477980307856
regional_mobility 0.20654433409120623
is_high_user 0.19615128275454694
avg_consume_less_12_valid_month 0.22239702810015521
reg_preference_for_trad 0.2177870321526657
consume_top_time_last_6_month 0.20011434179160154
railway_consume_count_last_12_month 0.1961481366164229
jewelry_consume_count_last_6_month 0.1961481366164229
apply_credibility 0.21236457565886172
query_org_count 0.20111878037869466
loans_credibility_behavior 0.20913259783498153
consfin_org_count_behavior 0.20557154838757058
latest_one_month_loan 0.19737896713498773
loans_credibility_limit 0.20533354487764222
loans_product_count 0.21194036422792928
consfin_org_count_current 0.20557154838757058
consfin_product_c

In [17]:
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)


### 模型选择与模型评估

In [18]:
from sklearn.preprocessing import StandardScaler

# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train.values)
X_test = std.transform(X_test.values)


In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

lr = LogisticRegression(C = 0.1, penalty = 'l1')
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
dt = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5, 
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5, 
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11, 
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1, seed=27)


In [20]:
sclf = StackingClassifier(classifiers=[svm_linear, svm_poly, svm_rbf, svm_sigmoid, dt, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=False)
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)


[准确率] 训练集： 0.8563 测试集： 0.8024
[auc值] 训练集： 0.9026 测试集： 0.7909


分析：调参后单模型性能有所提升。Stacking后和未特征选择时的结果对比，相差不大（AUC略有下降）。

起码说明，删除某些特征后，对性能影响不大 → 这些特征冗余。