## Preprocessing: one-hotエンコーディング・欠損値処理（練習用）

one-hot encodingと欠損値処理を学ぶため、ローン審査結果データを用います。

In [1]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('./data/av_loan_u6lujuX_CVtuZ9i.csv',header=0)
X = df.iloc[:,:-1]           # 最終列以前を特徴量X
X = X.drop('Loan_ID',axis=1) # 1列目はID情報のため特徴量から削除
y = df.iloc[:,-1]            # 最終列を正解データ

# check the shape
print('X shape: (%i,%i)' %X.shape)

# ローン審査でNOとなったサンプルを1（正例）へ変換
class_mapping = {'N':1, 'Y':0}
y = y.map(class_mapping)
print('--------------------')
print(y.value_counts())
X.join(y).head()

X shape: (614,11)
--------------------
0    422
1    192
Name: Loan_Status, dtype: int64


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


In [2]:
## データの型確認
X.join(y).dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object

In [3]:
## 数値型（int64, float64など）に対する要約統計量
## countを見ることによって、欠損数を把握することができる
X.join(y).describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
count,614.0,614.0,592.0,600.0,564.0,614.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199,0.312704
std,6109.041673,2926.248369,85.587325,65.12041,0.364878,0.463973
min,150.0,0.0,9.0,12.0,0.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0,0.0
50%,3812.5,1188.5,128.0,360.0,1.0,0.0
75%,5795.0,2297.25,168.0,360.0,1.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0,1.0


In [4]:
## 欠損の箇所をTrueで示す
X.join(y).isnull().head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,False,False,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False


In [5]:
## 列に対して一つでも欠損があるかどうか
X.join(y).isnull().any()

Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

In [6]:
## 欠損の数の確認
X.join(y).isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

上記例えば、LoanAmountの1行目に欠損値を確認できます。<b>この欠損をLoanAmount列の平均値で置き換えることを欠損値補完、GenderやMarriedのようなカテゴリ変数を0/1のバイナリ変数に変換することをone-hotエンコーディング</b>と言います。

本講座では一連の前処理の統一のため、<b>(1)まずone-hotエンコードをしてカテゴリ変数の欠損をフラグ変数化し解決した上で、(2)残った連続変数の欠損値を平均値で置き換えることとします</b>。それではone-hotエンコードの実施です。オプションのdummy_na=Trueとしておきましょう。これにより欠損が入っていたというのが情報化されます。

In [17]:
ohe_columns = ['Dependents',
               'Gender',
               'Married',
               'Education',
               'Self_Employed',
               'Property_Area']       ## 文字データ(object)のカラム

X_new = pd.get_dummies(X, dummy_na=True, columns=ohe_columns)    ## dummy_na=True：文字型のカラムに欠損があれば、欠損を示すカラムを作成
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0


In [18]:
## columns=ohe_columnsを指定しなくても、自動で認識される
pd.get_dummies(X, dummy_na=True).head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,0,1,0,1,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,1,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,0,1,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,0,1,0,0,1,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,0,1,0,1,0,...,1,0,0,1,0,0,0,0,1,0


上記まででカテゴリ変数の数量化と欠損処理は終了です。<b>次に連続変数の欠損を平均値で置き換えます。この処理はsklearnのImputerクラスで実現できます</b>。処理の正常確認のため、LoanAmountの基礎統計量を確認しておきましょう。平均値が146.412162であることが確認して下さい。

In [19]:
X_new.describe()     ## countを確認すると、連続変数の欠損数がわかる

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
count,614.0,614.0,592.0,600.0,564.0,614.0,614.0,614.0,614.0,614.0,...,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199,0.561889,0.166124,0.164495,0.083062,0.02443,...,0.781759,0.218241,0.0,0.814332,0.13355,0.052117,0.291531,0.379479,0.32899,0.0
std,6109.041673,2926.248369,85.587325,65.12041,0.364878,0.496559,0.372495,0.371027,0.276201,0.154506,...,0.413389,0.413389,0.0,0.389155,0.340446,0.222445,0.454838,0.485653,0.470229,0.0
min,150.0,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3812.5,1188.5,128.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5795.0,2297.25,168.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
max,81000.0,41667.0,700.0,480.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [22]:
## 欠損のあるデータで実行するとエラーが出る

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
#lr.fit(X_new, y)

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
#gb.fit(X_new, y)

それでは連続変数の欠損値の平均値補完の実行です。preporcessingクラスからImputerを読み込みます。Imputerクラスのメソッドtransfomrを適用することで、LoanAmountの欠損値（1行目など）を、NaNから平均値（146.412162）に置き換えることができます。

In [23]:
from sklearn.preprocessing import Imputer

# インピュータークラスの実体化
imp = Imputer(missing_values='NaN', # 欠損値NaNを
              strategy='mean',      # 平均値で置換
              axis=0)               # 列方向に処理

# 各特徴量の平均値を学習
imp.fit(X_new)

# 学習済みのImputerを適用し, X_newの欠損値を置き換える.
X_new_columns = X_new.columns.values
X_new = pd.DataFrame(imp.transform(X_new),
                     columns=X_new_columns)
# 結果表示
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [24]:
## Imputerのデフォルト設定
Imputer()

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [25]:
## 各カラムの平均値
imp.statistics_

array([5.40345928e+03, 1.62124580e+03, 1.46412162e+02, 3.42000000e+02,
       8.42198582e-01, 5.61889251e-01, 1.66123779e-01, 1.64495114e-01,
       8.30618893e-02, 2.44299674e-02, 1.82410423e-01, 7.96416938e-01,
       2.11726384e-02, 3.46905537e-01, 6.48208469e-01, 4.88599349e-03,
       7.81758958e-01, 2.18241042e-01, 0.00000000e+00, 8.14332248e-01,
       1.33550489e-01, 5.21172638e-02, 2.91530945e-01, 3.79478827e-01,
       3.28990228e-01, 0.00000000e+00])

In [26]:
X_new.columns.values     ## カラム名の取得方法

array(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Dependents_0',
       'Dependents_1', 'Dependents_2', 'Dependents_3+', 'Dependents_nan',
       'Gender_Female', 'Gender_Male', 'Gender_nan', 'Married_No',
       'Married_Yes', 'Married_nan', 'Education_Graduate',
       'Education_Not Graduate', 'Education_nan', 'Self_Employed_No',
       'Self_Employed_Yes', 'Self_Employed_nan', 'Property_Area_Rural',
       'Property_Area_Semiurban', 'Property_Area_Urban',
       'Property_Area_nan'], dtype=object)

In [27]:
## 欠損への平均値の挿入
imp.transform(X_new)

array([[5.84900000e+03, 0.00000000e+00, 1.46412162e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [4.58300000e+03, 1.50800000e+03, 1.28000000e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.00000000e+03, 0.00000000e+00, 6.60000000e+01, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       ...,
       [8.07200000e+03, 2.40000000e+02, 2.53000000e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [7.58300000e+03, 0.00000000e+00, 1.87000000e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [4.58300000e+03, 0.00000000e+00, 1.33000000e+02, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [28]:
X_new.describe()    ## countはすべて614

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,...,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199,0.561889,0.166124,0.164495,0.083062,0.02443,...,0.781759,0.218241,0.0,0.814332,0.13355,0.052117,0.291531,0.379479,0.32899,0.0
std,6109.041673,2926.248369,84.037468,64.372489,0.349681,0.496559,0.372495,0.371027,0.276201,0.154506,...,0.413389,0.413389,0.0,0.389155,0.340446,0.222445,0.454838,0.485653,0.470229,0.0
min,150.0,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2877.5,0.0,100.25,360.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3812.5,1188.5,129.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5795.0,2297.25,164.75,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
max,81000.0,41667.0,700.0,480.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


以上で、one-hotエンコーディングと欠損値補完の処理は終了です。この時点でも特徴量の次元数が小さければ、そのままアルゴリズムに投入しても構いません。ただし、実務においては数百・数千次元を超えることがしばしばありますので、その際は以下の次元圧縮を行います。

## Preprocessing: 次元圧縮（RFE&PCA)

さて、特徴量が元の11次元が26次元まで増加しました。<br><b>ここではRFEを使って、予測に役立つと判断された上位10変数に絞り込むこととします。</b>

In [29]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# 特徴量因子の重要度を推定する分類器をRandomForestClassifierに設定
# 最終的に残す特徴量を10に設定
# 1回のstepで削除する次元数は5%ずつとする
selector = RFE(estimator=RandomForestClassifier(random_state=0),
               n_features_to_select=10,
               step=.05)
selector.fit(X_new,y)            ## 除外する/残す特徴量を学習
print('Done normally')

Done normally


sklearn.feature_selection.RFE
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
- estimatorに指定するモデル（.fit()メソッドを持つモデル）をあてはめた後、coef_ attributeもしくはfeature_importances_で判定

In [30]:
## verboseで進捗の確認が可能。26 * 0.05 = 1.3なので、毎回1変数ずつ削られる
RFE(estimator=RandomForestClassifier(random_state=0), n_features_to_select=10, step=.05, verbose=1).fit(X_new,y)

Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.


RFE(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
  n_features_to_select=10, step=0.05, verbose=1)

In [31]:
## 削られた順序、大きいほど重要と判断されなかった
selector.ranking_

array([ 1,  1,  1,  1,  1,  1,  1,  7, 10, 14,  4,  8, 12,  1, 13, 15,  5,
        2, 16,  9,  3, 11,  1,  1,  6, 17])

RFEをfitすることで、26変数のうちどの変数を残すかが決定されました。<br><b>残された変数の確認は"support_"属性を呼び出すことで可能です。</b><br>Trueが採用された変数の場所を表しています。

In [32]:
print(selector.support_)

[ True  True  True  True  True  True  True False False False False False
 False  True False False False False False False False False  True  True
 False False]


In [38]:
## 内部でRandomForestClassifierの変数重要度指標（feature importances）が選択に用いられている
## 重要でない（数値が低い）変数が除かれる
rf = RandomForestClassifier()
rf.fit(X_new,y)
rf.feature_importances_

array([0.19131698, 0.10825263, 0.17097205, 0.05086249, 0.23800176,
       0.01892412, 0.01886573, 0.01339865, 0.00571424, 0.00455582,
       0.01162349, 0.00871389, 0.00808601, 0.01536651, 0.0122019 ,
       0.        , 0.01420829, 0.01635809, 0.        , 0.01171654,
       0.01092854, 0.01218058, 0.01599709, 0.02586954, 0.01588505,
       0.        ])

fitまでで選択すべき変数を決めることができたので、実際にデータの絞り込み処理をしましょう。<br>Imputerと同様にデータの変換はtransformでできます。

In [41]:
# 26次元を10次元を圧縮
X_new_selected=selector.transform(X_new)            ## 学習済みのselectorのtransformメソッドに元のデータを渡す。残す特徴量が返る
X_new_selected=pd.DataFrame(X_new_selected,
                            columns=X_new_columns[selector.support_])

print('---------------------------------------')
print('X shape after RFE:', X_new_selected.shape)
print('---------------------------------------')
print(X_new_selected.dtypes)
X_new_selected.head()

---------------------------------------
X shape after RFE: (614, 10)
---------------------------------------
ApplicantIncome            float64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Dependents_0               float64
Dependents_1               float64
Married_No                 float64
Property_Area_Rural        float64
Property_Area_Semiurban    float64
dtype: object


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Married_No,Property_Area_Rural,Property_Area_Semiurban
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,1.0,0.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,1.0,0.0,0.0


In [42]:
## 残された変数のみ、array形式で出力される
selector.transform(X_new)

array([[5.84900000e+03, 0.00000000e+00, 1.46412162e+02, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.58300000e+03, 1.50800000e+03, 1.28000000e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [3.00000000e+03, 0.00000000e+00, 6.60000000e+01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [8.07200000e+03, 2.40000000e+02, 2.53000000e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.58300000e+03, 0.00000000e+00, 1.87000000e+02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.58300000e+03, 0.00000000e+00, 1.33000000e+02, ...,
        1.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

In [43]:
## 残された変数名
X_new_columns[selector.support_]

array(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Dependents_0',
       'Dependents_1', 'Married_No', 'Property_Area_Rural',
       'Property_Area_Semiurban'], dtype=object)

またRFEの亜種としてRFECVというライブラリが用意されており、これは選択する特徴量の個数を自動で判定してくれます。計算負荷は大きくなりますが、データ件数が少なく、ハイパーパラメータを減らしたい場合には有用です。

In [44]:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

selector = RFECV(estimator=RandomForestClassifier(random_state=0),
                 step=0.05)
X_new_selected = selector.fit_transform(X_new, y)
print(X_new_selected.shape)

(614, 22)


sklearn.feature_selection.RFECV
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

RFEによる特徴量次元の絞り込みは以上で終了です。予測モデリングでは、このRFE済み特徴量Xを、交叉検証にかけベストモデルを選択することになります。さて最後に、<b>PCAによる次元圧縮の方法を確認しましょう。</b>最もシンプルな実装はPCAをパイプラインに組み込む方法です。X_new（RFEをする前の26次元の特徴量）を対象にPCAをさせる方法は以下です。

In [52]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score

# パイプラインにPCAを埋め込めば自動的に次元圧縮してくれる
clf = Pipeline([('scl', StandardScaler()),
                ('reduct', PCA(n_components=10,random_state=1)),               ## 10次元に圧縮と指定
                ('clf', GradientBoostingClassifier(random_state=1))])

# 学習時に自動的にPCA処理が施される
clf.fit(X_new, y)
print('Normally done')

Normally done


学習器clfの実態はパイプラインですから、<b>これを学習済みモデルとして保存しておけば、学習済みscl、学習済みreduct(PCA)、学習済みclf（モデル）の3つが学習状態で保存されます</b>。

In [62]:
X_new.shape

(614, 26)

In [57]:
clf_pca = clf.named_steps['reduct']       ## 主成分分析部分のオブジェクト
clf_pca

PCA(copy=True, iterated_power='auto', n_components=10, random_state=1,
  svd_solver='auto', tol=0.0, whiten=False)

In [66]:
## 10次元に圧縮したデータの確認
pd.DataFrame( clf_pca.transform(X_new) ).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-567.461858,1958.813917,319.239812,-352.505394,1094.945827,2281.3507,-376.883728,1249.747394,-108.276765,-741.583342
1,-556.083236,1628.159876,173.526494,-481.212264,835.839651,1675.059168,-320.040921,1071.059013,266.506298,137.262842
2,-271.574554,1008.46027,146.875476,-185.161151,584.693951,1159.121405,-162.660544,613.505158,-58.505329,-409.279887
3,-420.766213,1024.43682,14.14765,-481.966769,466.28575,839.050225,-193.9473,691.250168,504.051205,785.123102
4,-581.627569,2005.919092,328.237999,-360.520594,1120.04986,2337.798753,-387.593756,1281.818803,-111.312893,-759.964959


In [69]:
## X_newに対する予測結果
clf.predict_proba(X_new)

array([[0.78096757, 0.21903243],
       [0.47887936, 0.52112064],
       [0.89160544, 0.10839456],
       ...,
       [0.93715737, 0.06284263],
       [0.76268201, 0.23731799],
       [0.50406041, 0.49593959]])