## スコアリングフェーズにおけるデータ処理（課題把握編）（練習用）

ローン審査データを使って<b>モデリング段階のデータ処理をおさらいし、</b><br>その後、スコアリング段階のデータ処理で必要となるテクニックを学びましょう。

In [1]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('./data/av_loan_u6lujuX_CVtuZ9i.csv',header=0)
X  = df.iloc[:,:-1]           # 最終列以前を特徴量X
ID = X.iloc[:,[0]]            # 最初列がPK（Loan_ID）なのでID情報としてセット
X  = X.drop('Loan_ID',axis=1) # 1列目(Loan_ID)は特徴量ベクトルから削除
y  = df.iloc[:,-1]            # 最終列を正解データ

# check the shape
print('--------------------------------------')
print('Raw shape: (%i,%i)' %df.shape)
print('X shape: (%i,%i)' %X.shape)

# converting stirng to number(binary flag)
# ローン審査でNOとなったサンプルを1（正例）として変換
class_mapping = {'N':1, 'Y':0}
y = y.map(class_mapping)
print('---------------------------------------')
print(y.value_counts())
print('---------------------------------------')
print(ID.join(X).join(y).dtypes)
ID.join(X).join(y).head()

--------------------------------------
Raw shape: (614,13)
X shape: (614,11)
---------------------------------------
0    422
1    192
Name: Loan_Status, dtype: int64
---------------------------------------
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


In [2]:
## 欠損の確認
ID.join(X).join(y).isnull().any()

Loan_ID              False
Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

モデリング段階の前処理として、まずはone-hotエンコーディングを実施します。

In [4]:
ohe_columns = ['Dependents',
               'Gender',
               'Married',
               'Education',
               'Self_Employed',
               'Property_Area']
X_ohe = pd.get_dummies(X,
                       dummy_na=True,
                       columns=ohe_columns)
print('X_ohe shape:(%i,%i)' % X_ohe.shape)
X_ohe.head()

X_ohe shape:(614,26)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0


続いて、連続変数の欠損を平均値で置き換えます。

In [5]:
from sklearn.preprocessing import Imputer

# 欠損値NaNを平均値(mean)で置換
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X_ohe)

# 学習済みImputerを適用しX_newの欠損値を置換
X_ohe_columns = X_ohe.columns.values
X_ohe = pd.DataFrame(imp.transform(X_ohe), columns=X_ohe_columns)

# 結果表示
X_ohe.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [6]:
X_ohe.join(y).isnull().any()

ApplicantIncome            False
CoapplicantIncome          False
LoanAmount                 False
Loan_Amount_Term           False
Credit_History             False
Dependents_0               False
Dependents_1               False
Dependents_2               False
Dependents_3+              False
Dependents_nan             False
Gender_Female              False
Gender_Male                False
Gender_nan                 False
Married_No                 False
Married_Yes                False
Married_nan                False
Education_Graduate         False
Education_Not Graduate     False
Education_nan              False
Self_Employed_No           False
Self_Employed_Yes          False
Self_Employed_nan          False
Property_Area_Rural        False
Property_Area_Semiurban    False
Property_Area_Urban        False
Property_Area_nan          False
Loan_Status                False
dtype: bool

In [10]:
## 処理前と後のshapeの確認
print( X.shape )
print( X_ohe.shape )

(614, 11)
(614, 26)


最後に、RFEによる特徴量選択を実施します。

In [11]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

selector = RFE(RandomForestClassifier(random_state=1),
               n_features_to_select=10,                      ## 10特徴量選択
               step=.05)

selector.fit(X_ohe,y)

X_fin = pd.DataFrame(selector.transform(X_ohe),
                     columns=X_ohe_columns[selector.support_])

print('X_fin shape:(%i,%i)' % X_fin.shape)
X_fin.head()

X_fin shape:(614,10)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Education_Graduate,Self_Employed_No,Property_Area_Semiurban,Property_Area_Urban
0,5849.0,0.0,146.412162,360.0,1.0,1.0,1.0,1.0,0.0,1.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,1.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,1.0,0.0,0.0,1.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,1.0,0.0,1.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,1.0,1.0,0.0,1.0


In [12]:
## 特徴量選択前 26
X_ohe.columns

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Dependents_0', 'Dependents_1',
       'Dependents_2', 'Dependents_3+', 'Dependents_nan', 'Gender_Female',
       'Gender_Male', 'Gender_nan', 'Married_No', 'Married_Yes', 'Married_nan',
       'Education_Graduate', 'Education_Not Graduate', 'Education_nan',
       'Self_Employed_No', 'Self_Employed_Yes', 'Self_Employed_nan',
       'Property_Area_Rural', 'Property_Area_Semiurban', 'Property_Area_Urban',
       'Property_Area_nan'],
      dtype='object')

In [13]:
## 特徴量選択後 10
X_fin.columns

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Dependents_0',
       'Education_Graduate', 'Self_Employed_No', 'Property_Area_Semiurban',
       'Property_Area_Urban'],
      dtype='object')

ここまでがモデリング段階でのデータ加工でした。さて未知のデータ（スコア用データ）に対しては、<b>上記10次元の特徴量をこの並びの通りに変換しなくてはいけません</b>。なぜなら、本データを学習したモデルは、0次元目から9次元目まで当該データ列が並んでいることを前提に予測値が計算されるからです。それでは、<b>いよいよスコア用データへの処理</b>です。

In [14]:
# import sample data for classificatio
df_s = pd.read_csv('./data/av_loan_test_Y3wMUE5_7gLdaTN.csv', header=0)
ID_s = df_s.iloc[:,[0]]            # 第0列はPK（Loan_ID）なのでIDとしてセット
X_s  = df_s.drop('Loan_ID',axis=1) # Loan_IDはID情報なので特徴ベクトルから削除

# check the shape
print('Raw shape: (%i,%i)' %df_s.shape)
print('X shape: (%i,%i)' %X_s.shape)
print('-------------------------------')
print(X_s.dtypes)

Raw shape: (333,12)
X shape: (333,11)
-------------------------------
Gender                object
Married               object
Dependents           float64
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome      int64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
dtype: object


In [15]:
## モデルを作成した元のデータ
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban


In [17]:
## 新しく読み込んだスコア用データ
X_s.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,Male,Yes,0.0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,Male,Yes,1.0,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,Male,Yes,2.0,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,Male,Yes,2.0,Graduate,No,2340,2546,100.0,360.0,,Urban
4,Male,No,0.0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


まずはモデリング段階と同様、one-hotエンコーディングを実施します。

In [20]:
X_ohe_s = pd.get_dummies(X_s,
                         dummy_na=True,
                         columns=ohe_columns)
print('X_ohe_s shape:(%i,%i)' % X_ohe_s.shape)
X_ohe_s.head()

X_ohe_s shape:(333,26)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0.0,Dependents_1.0,Dependents_2.0,Dependents_nan,Gender_Female,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5720,0,110.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,3076,1500,126.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,0,0,1,0
2,5000,1800,208.0,360.0,1.0,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
3,2340,2546,100.0,360.0,,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
4,3276,0,78.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0


In [23]:
print( X_ohe.shape )
print( X_ohe_s.shape )
## モデル用もスコア用もカラム数は同じ

(614, 26)
(333, 26)


In [24]:
X_ohe_s.isnull().any()

ApplicantIncome            False
CoapplicantIncome          False
LoanAmount                  True
Loan_Amount_Term            True
Credit_History              True
Dependents_0.0             False
Dependents_1.0             False
Dependents_2.0             False
Dependents_nan             False
Gender_Female              False
Gender_Male                False
Gender_Unknown             False
Gender_nan                 False
Married_No                 False
Married_Yes                False
Married_nan                False
Education_Graduate         False
Education_Not Graduate     False
Education_nan              False
Self_Employed_No           False
Self_Employed_Yes          False
Self_Employed_nan          False
Property_Area_Rural        False
Property_Area_Semiurban    False
Property_Area_Urban        False
Property_Area_nan          False
dtype: bool

one-hotエンコーディング後のスコアリングデータの特徴量次元は26次元とモデリング時点（26次元）と同じですが、もう少し厳密に特徴量リストを比較してみます。徴量リストの集合の差を見てみましょう。

In [26]:
cols_model = set(X_ohe.columns.values)
cols_score = set(X_ohe_s.columns.values)

# モデルにはあったスコアにはないデータ項目
diff1 = cols_model - cols_score                        ## setに変換することにより「-」で差分を確認することができる
print('Modelのみ:%s' % diff1)

# スコアにはあるがモデルになかったデータ項目
diff2 = cols_score - cols_model
print('Scoreのみ:%s' % diff2)

Modelのみ:{'Dependents_1', 'Dependents_2', 'Dependents_3+', 'Dependents_0'}
Scoreのみ:{'Dependents_2.0', 'Dependents_0.0', 'Gender_Unknown', 'Dependents_1.0'}


In [29]:
## "Dependents"と"Gender"カラムに注目

In [27]:
## モデル用元データ
X[["Dependents","Gender"]]

Unnamed: 0,Dependents,Gender
0,0,Male
1,1,Male
2,0,Male
3,0,Male
4,0,Male
5,2,Male
6,0,Male
7,3+,Male
8,2,Male
9,1,Male


In [28]:
## スコア用元データ
X_s[["Dependents","Gender"]]

Unnamed: 0,Dependents,Gender
0,0.0,Male
1,1.0,Male
2,2.0,Male
3,2.0,Male
4,0.0,Male
5,0.0,Male
6,1.0,Female
7,2.0,Male
8,2.0,Male
9,0.0,Male


In [33]:
X[["Dependents","Gender"]].dtypes

Dependents    object
Gender        object
dtype: object

In [34]:
X_s[["Dependents","Gender"]].dtypes

Dependents    float64
Gender         object
dtype: object

In [35]:
X["Gender"].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [36]:
X_s["Gender"].value_counts()

Male       256
Female      65
Unknown      1
Name: Gender, dtype: int64

実はこのスコアデータ、以下２つの細工がされたデータです。
1. Gender変数に"Unknown"という項目を新しく追加
2. Dependents変数の"3+"という項目を除外（残された値は、0,1,2の3つのカテゴリ値）

上記のため、モデルのみに存在のダミー変数「Dependent_3+」、スコアのみに存在のダミー変数「Gender_Unknow」が出現しました。残りの差異はPythonがDependentsをObject型と判断したか（モデルデータ時点）、float64型と判断したか（スコアデータ時点）の違いに起因したものです。まとめると、スコアデータのone-hotエンコーディング処理の結果は、以下のような不整合が発生し得るということです。
1. モデルデータにないカラムが生成される可能性（Gender_Unknown)
1. モデルデータにあったカラムが消える可能性（Dependents_3+）
1. データ型の違いが理由で①/②が生じる可能性