# Permutation Importanceの実装
予測モデルにおいてどの変数が予測結果に効いているかを説明するためにPermutation Importanceを使う。  
Permutaton Importanceでは、学習した予測モデルに対してそれぞれの変数のデータをランダムに入れ替えたときの  
予測誤差の変化を比較することで、変数の重要度を判定する。  

In [1]:
from pathlib import Path
import pandas as pd
from dfply import *
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from eli5.sklearn import PermutationImportance

In [2]:
%load_ext autoreload
%autoreload 2
pd.set_option("max_columns", 10000)

In [3]:
result_dir_path = Path(".").joinpath("result")
if not result_dir_path.exists():
    result_dir_path.mkdir(parents=True)

## データの読み込み
今回使用するデータはscikit-learnに組み込みのデータから乳がんのデータを使う。  
データの特徴量は、細胞核の特徴を表したものである。  

In [4]:
## データをロードし、データフレームにする
cancer_data = load_breast_cancer()

data_y = cancer_data.target
data_x = pd.DataFrame(
    cancer_data.data,
    columns=cancer_data.feature_names
)
data_x.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
train_x, test_x, train_y, test_y = train_test_split(
    data_x,
    data_y,
    test_size=0.3
)

## 予測モデルの構築

In [6]:
model = XGBClassifier()
model.fit(train_x, train_y)
print("正答率 : {:.0f}%".format(model.score(test_x, test_y)*100))

正答率 : 97%


## Permutation Importanceの実装

In [7]:
perm = PermutationImportance(model, random_state=1).fit(test_x, test_y)

変数の重要度をデータテーブルに格納し、結果を保存する

In [8]:
perm_weights = pd.DataFrame({
    "column" : data_x.columns.tolist(),
    "weight":perm.feature_importances_,
    "std" : perm.feature_importances_std_
    }
) >> mutate(weight=X.weight.astype(float)) >> arrange(X.weight, ascending=False)
perm_weights.to_csv(result_dir_path.joinpath("feature_permutations.csv"), index=False)
perm_weights

Unnamed: 0,column,weight,std
21,worst texture,0.01637427,0.011928
26,worst concavity,0.007017544,0.00573
13,area error,0.003508772,0.004678
1,mean texture,0.002339181,0.002865
0,mean radius,0.0,0.0
28,worst symmetry,0.0,0.0
25,worst compactness,0.0,0.0
19,fractal dimension error,0.0,0.0
18,symmetry error,0.0,0.0
17,concave points error,0.0,0.0


Permutation Importanceで不要な変数を取り除き再学習

In [9]:
model = XGBClassifier()

target_perm_werights = perm_weights >> filter_by(X.weight > 0)
model.fit(train_x >> select(target_perm_werights["column"].tolist()), train_y)
print("正答率 : {:.0f}%".format(model.score(test_x >> select(target_perm_werights["column"].tolist()), test_y)*100))

正答率 : 92%


正答率は変数削減する前と変わらないが、変数が減っている分簡素なモデルになっている

## XGBoostの変数重要度で見る

In [10]:
model = XGBClassifier()
model.fit(train_x, train_y)

coef_dt = pd.DataFrame({
    "column" : train_x.columns.tolist(),
    "weight":model.feature_importances_
}) >> mutate(weight=X.weight.astype(float)) >> arrange(X.weight, ascending=False)

target_coef_dt = coef_dt >> filter_by(X.weight > 0)
model.fit(train_x >> select(coef_dt["column"].tolist()), train_y)
model.score(test_x >> select(coef_dt["column"].tolist()), test_y)

0.9707602339181286

XGBoostが算出した重要な変数とPermutaion Importanceによって算出した重要な変数は異なる

In [11]:
coef_dt.to_csv(result_dir_path.joinpath("xgboost_coef_dt.csv"), index=False)
coef_dt

Unnamed: 0,column,weight
20,worst radius,0.20034
22,worst perimeter,0.160787
7,mean concave points,0.153448
27,worst concave points,0.114759
3,mean area,0.057425
23,worst area,0.056538
6,mean concavity,0.033484
26,worst concavity,0.028849
1,mean texture,0.024475
8,mean symmetry,0.019769
