## Sprint 機械学習フロー

## 2.機械学習フロー


Kaggleの Home Credit Default Risk コンペティションを題材に、機械学習の実践的な流れを学びます。特に適切な 検証 を行い、高い 汎化性能 のあるモデルを完成させることを目指します。

[Home Credit Default Risk | Kaggle](https://www.kaggle.com/c/home-credit-default-risk)

#### 【問題1】クロスバリデーション
事前学習期間では検証データをはじめに分割しておき、それに対して指標値を計算することで検証を行っていました。（ホールドアウト法）しかし、分割の仕方により精度は変化します。実践的には クロスバリデーション（交差検証） を行います。分割を複数回行い、それぞれに対して学習と検証を行う方法です。複数回の分割のためにscikit-learnにはKFoldクラスが用意されています。

事前学習期間の課題で作成したベースラインモデルに対してKFoldクラスによるクロスバリデーションを行うコードを作成し実行してください。

[sklearn.model_selection.KFold — scikit-learn 0.21.3 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression  # 疑似データ作成用
from sklearn.model_selection import train_test_split  # 疑似データ作成用
from sklearn.model_selection import KFold

In [3]:
train = pd.read_csv("../../Week4/application_train.csv")
test = pd.read_csv("../../Week4/application_test.csv")
train_X = train.drop(["TARGET"],axis=1)
train_y = train["TARGET"]
train_X.shape

(307511, 121)

In [4]:
train_X = train_X.drop(["SK_ID_CURR"],axis=1)
test = test.drop(["SK_ID_CURR"],axis=1)

In [5]:
train_X.shape

(307511, 120)

In [6]:
train_num = train_X.dtypes[train_X.dtypes!=object].index.values
test_num = test.dtypes[test.dtypes!=object].index.values

In [7]:
train_num.shape,test_num.shape

((104,), (104,))

In [8]:
cols = ['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH']

In [9]:
train_X2 = train_X[cols].fillna(0)
train_X2.isnull().sum()

CNT_CHILDREN                  0
AMT_INCOME_TOTAL              0
AMT_CREDIT                    0
AMT_ANNUITY                   0
AMT_GOODS_PRICE               0
REGION_POPULATION_RELATIVE    0
DAYS_BIRTH                    0
dtype: int64

In [10]:
X_train,X_test,y_train,y_test=train_test_split(train_X2.values,train_y.values,random_state=0)

In [11]:
X_train.shape,X_test.shape,y_train.shape

((230633, 7), (76878, 7), (230633,))

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score

In [13]:
type(y_train)

numpy.ndarray

In [14]:
clf = LogisticRegression()
clf.fit(X_train,y_train)
y_pre = clf.predict_proba(X_test)[:,1]
roc_auc_score(y_test,y_pre)


0.6041217758872266

#### クロスバリデーション

In [15]:
X = train_X2.values
y = train_y.values

In [16]:
kf = KFold(n_splits=4)
kf.get_n_splits(X)
scores=[]

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train,y_train)
    y_pre = clf.predict_proba(X_test)[:,1]
    score = roc_auc_score(y_test,y_pre)
    scores.append(score)

TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]
TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]


In [17]:
scores,"mean{}".format(np.mean(scores))

([0.6063397847859013,
  0.6061090615614468,
  0.6020403292004952,
  0.6087892076616084],
 'mean0.6058195958023629')

#### 【問題2】グリッドサーチ
これまで分類器のパラメータには触れず、デフォルトの設定を使用していました。パラメータの詳細は今後のSprintで学んでいくことになります。機械学習の前提として、パラメータは状況に応じて最適なものを選ぶ必要があります。最適なパラメータを探していくことを パラメータチューニング と呼びます。パラメータチューニングをある程度自動化する単純な方法としては グリッドサーチ があります。

scikit-learnのGridSearchCVを使い、グリッドサーチを行うコードを作成してください。そして、ベースラインモデルに対して何らかしらのパラメータチューニングを行なってください。どのパラメータをチューニングするかは、使用した手法の公式ドキュメントを参考にしてください。

[sklearn.model_selection.GridSearchCV — scikit-learn 0.21.3 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

GridSearchCVクラスには引数としてモデル、探索範囲、さらにクロスバリデーションを何分割で行うかを与えます。クロスバリデーションの機能も含まれているため、これを使用する場合はKFoldクラスを利用する必要はありません。

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
X.shape,y.shape

((307511, 7), (307511,))

In [20]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C' : [0.001, 0.01, 0.1, 1, 10, 100]}
clf = GridSearchCV(LogisticRegression(), param_grid)
clf.fit(X, y)


GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]})

In [21]:
print(clf.best_params_)
print(clf.best_score_)

{'C': 0.001}
0.9192451652313359


In [22]:
lr2 = LogisticRegression(C=0.001)

In [141]:
kf = KFold(n_splits=4)
kf.get_n_splits(X)
scores0=[]
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    lr2.fit(X_train,y_train)
    y_pre = lr2.predict_proba(X_test)[:,1]
    score = roc_auc_score(y_test,y_pre)
    scores0.append(score)
scores2,"mean{}".format(np.mean(scores0))

TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]
TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]


([0.6059724980057503,
  0.6053849490781649,
  0.6060261192669107,
  0.6017049294727999,
  0.6100858508661682],
 'mean0.605819586194731')

#### 【問題3】Kaggle Notebooksからの調査
KaggleのNotebooksから様々なアイデアを見つけ出して、列挙してください。


- ハイパーパラメータのランダムサーチ　randomizedsearchCV
- ベイズ最適化　(hyperopt,optunna,scikit-optimize)
- 計算時間を節約するためにクロスバリデーション全てのfoldでなく  そのうち一つのfoldを使って精度を確認する。
- foldの分け方を変えて平均を使う
- 同じモデルの乱数シードを変えて平均をとる
- ２つ以上のモデルを組み合わせて予測する
- アンサンブルを行う


#### 【問題4】高い汎化性能のモデル作成
問題3で見つけたアイデアと、独自のアイデアを組み合わせ高い汎化性能のモデル作りを進めてください。

その過程として、何を行うことで、クロスバリデーションの結果がどの程度変化したかを表にまとめてください。


#### splitの数を変える

In [192]:
kf = KFold(n_splits=5,random_state=71)
kf.get_n_splits(X)
scores2=[]
lft=[]
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    lr2.fit(X_train,y_train)
    y_pre2 = lr2.predict_proba(X_test)[:,1]
    score = roc_auc_score(y_test,y_pre2)
    scores2.append(score)
    lft.append(y_pre2)
scores2,"mean{}".format(np.mean(scores2))



TRAIN: [ 61503  61504  61505 ... 307508 307509 307510] TEST: [    0     1     2 ... 61500 61501 61502]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 61503  61504  61505 ... 123002 123003 123004]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [123005 123006 123007 ... 184504 184505 184506]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [184507 184508 184509 ... 246006 246007 246008]
TRAIN: [     0      1      2 ... 246006 246007 246008] TEST: [246009 246010 246011 ... 307508 307509 307510]


([0.6059724980057503,
  0.6053849490781649,
  0.6060261192669107,
  0.6017049294727999,
  0.6100858508661682],
 'mean0.6058348693379589')

#### 違うモデルを試す

ランダムフォレスト

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
rfc = RandomForestClassifier()

In [138]:
kf = KFold(n_splits=4,random_state=71)
kf.get_n_splits(X)
scores3=[]
rant=[]
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    rfc.fit(X_train,y_train)
    y_pre3 = rfc.predict_proba(X_test)[:,1]
    score = roc_auc_score(y_test,y_pre3)
    scores3.append(score)
    rant.append(y_pre3)
scores3,"mean{}".format(np.mean(scores3))

TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]
TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]


([0.6207453587857042,
  0.6187088857664038,
  0.6096555763208752,
  0.6162109540320113],
 'mean0.6163301937262486')

XGboost

In [139]:
import xgboost as xgb
xmo = xgb.XGBClassifier()
kf = KFold(n_splits=4,random_state=71)
kf.get_n_splits(X)
scores4=[]
xmot=[]
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    xmo.fit(X_train,y_train)
    y_pre4 = xmo.predict_proba(X_test)[:,1]
    score = roc_auc_score(y_test,y_pre4)
    scores4.append(score)
    xmot.append(y_pre4)
scores4,"mean{}".format(np.mean(scores4))



TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]




TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]




TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]






([0.6565378127614647,
  0.6555051190798276,
  0.6501837187469788,
  0.6554634097356806],
 'mean0.6544225150809879')

エクストラツリー

In [140]:
et = ExtraTreesClassifier()
kf = KFold(n_splits=4,random_state=71)
kf.get_n_splits(X)
scores6=[]
ett = []
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    et.fit(X_train,y_train)
    y_pre6 = et.predict_proba(X_test)[:,1]
    score6 = roc_auc_score(y_test,y_pre6)
    scores6.append(score6)
    ett.append(y_pre6)
scores6,"mean{}".format(np.mean(scores6))



TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]
TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]


([0.603602517303539,
  0.6034187535895288,
  0.5946225975297057,
  0.6039120433613492],
 'mean0.6013889779460306')

#### スタッキング

In [51]:
test_x = test[cols].fillna(0).values
test_x.shape,X.shape

((48744, 7), (307511, 7))

In [110]:
# パラメータ
ntrain = train.shape[0] # 891
ntest = test.shape[0] # 418
SEED = 0
NFOLDS = 5 # 5分割
kf = KFold(n_splits= NFOLDS, random_state=SEED)

# Sclearn分類機を拡張
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)

    def fit(self,x,y):
        return self.clf.fit(x,y)

    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)



In [111]:
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train,y_train)): # NFOLDS回まわる
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

In [112]:
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)

In [113]:
logi = SklearnHelper(clf=LogisticRegression,params={'C' : 0.001 })
ran = SklearnHelper(clf=RandomForestClassifier,params={'n_jobs':-1, 'max_depth': 6,})
gb = SklearnHelper(clf=GradientBoostingClassifier,params={'max_depth': 5})
ada = SklearnHelper(clf=AdaBoostClassifier,params={'n_estimators': 500})
et = SklearnHelper(clf=ExtraTreesClassifier,params={'max_depth': 8})

In [114]:
logi_train,logi_test = get_oof(logi,X,y,test_x)
ran_train,ran_test = get_oof(ran,X,y,test_x)
gb_train,gb_test = get_oof(gb,X,y,test_x)
ada_train,ada_test = get_oof(ada,X,y,test_x)
et_train,et_test = get_oof(et,X,y,test_x)

In [116]:
#logi_feature = logi.feature_importances(X,y)
ran_feature = ran.feature_importances(X,y)
gb_feature = gb.feature_importances(X,y)
ada_feature = ada.feature_importances(X,y)
et_feature = et.feature_importances(X,y)

[0.01057125 0.03914087 0.17121599 0.13723603 0.21867806 0.08960133
 0.33355648]
[0.01349136 0.0529016  0.17151891 0.18503347 0.2068336  0.12543991
 0.24478115]
[0.02  0.038 0.204 0.204 0.164 0.248 0.122]
[0.01819274 0.01596718 0.1104713  0.05911038 0.14122406 0.13360715
 0.52142719]


In [119]:
ran_feature = [0.01057125, 0.03914087, 0.17121599, 0.13723603, 0.21867806, 0.08960133, 0.33355648]
gb_feature = [0.01349136, 0.0529016,  0.17151891, 0.18503347, 0.2068336,  0.12543991, 0.24478115]
ada_feature = [0.02,  0.038, 0.204, 0.204, 0.164, 0.248, 0.122]
et_feature = [0.01819274, 0.01596718, 0.1104713,  0.05911038, 0.14122406, 0.13360715, 0.52142719]

In [125]:
base_predictions_train = pd.DataFrame( 
    {'RandomForest': ran_train.ravel(),
     'ExtraTrees': et_train.ravel(),
     'AdaBoost': ada_train.ravel(),
      'GradientBoost': gb_train.ravel()
    })
print('base_predictions_train.shape : ', base_predictions_train.shape)
base_predictions_train.head()

base_predictions_train.shape :  (307511, 4)


Unnamed: 0,RandomForest,ExtraTrees,AdaBoost,GradientBoost
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0


In [130]:
x_train = np.concatenate(( et_train, ran_train, ada_train, gb_train,logi_train), axis=1)
x_test = np.concatenate(( et_test, ran_test, ada_test, gb_test,logi_test), axis=1)
print('x_train.shape : ', x_train.shape)
print('x_test.shape : ', x_test.shape)

x_train.shape :  (307511, 5)
x_test.shape :  (48744, 5)


In [134]:
xmo = xgb.XGBClassifier(n_estimators= 2000,max_depth= 8,)
kf = KFold(n_splits=4,random_state=71)
kf.get_n_splits(x_train)
scores5=[]
for train_index, test_index in kf.split(x_train):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train1, X_test1 = x_train[train_index], x_train[test_index]
    y_train1, y_test1 = y[train_index], y[test_index]
    
    xmo.fit(X_train1,y_train1)
    y_pre5 = xmo.predict_proba(X_test1)[:,1]
    score1 = roc_auc_score(y_test1,y_pre5)
    scores5.append(score1)
scores5,"mean{}".format(np.mean(scores5))



TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]




TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]




TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]




([0.5001195928210687,
  0.4998723567204187,
  0.5000915977754025,
  0.5000392588876704],
 'mean0.50003070155114')

0.5と値が下がってしまった

#### スタッキングその２

それぞれのモデルの予測値を結合させる

In [229]:
a,b,c,d,e=lft


In [230]:
lf_all = np.concatenate([a,b,c,d,e])
lf_all.shape

(307511,)

In [231]:
roc_auc_score(y,lf_all)

0.6057810870956006

In [232]:
a,b,c,d = rant
ran_all = np.concatenate([a,b,c,d])
ran_all.shape

(307511,)

In [233]:
roc_auc_score(y,ran_all)

0.6163022051896226

In [235]:
a,b,c,d = xmot
xg_all = np.concatenate([a,b,c,d])
xg_all.shape

(307511,)

In [237]:
roc_auc_score(y,xg_all)

0.6543636154994501

In [238]:
a,b,c,d = ett
et_all = np.concatenate([a,b,c,d])
et_all.shape

(307511,)

In [239]:
roc_auc_score(y,et_all)

0.6013854872364193

In [242]:
sta = pd.DataFrame( 
    {'ロジスティック回帰': lf_all.ravel(),
     'ランダムフォレスト': ran_all.ravel(),
     'XGBoost': xg_all.ravel(),
     'エクストラツリー': et_all.ravel(),})
print('全体 : ', sta.shape)
sta.head()

全体 :  (307511, 4)


Unnamed: 0,ロジスティック回帰,ランダムフォレスト,XGBoost,エクストラツリー
0,0.155668,0.22,0.176915,0.27
1,0.048792,0.04,0.051697,0.05
2,0.069418,0.02,0.034552,0.0
3,0.058801,0.08,0.059227,0.08
4,0.043945,0.03,0.057191,0.03


4つの予測値からxgboostにて一つの予測値を出す

In [243]:
X1 = sta.values

In [244]:
kf = KFold(n_splits=4,random_state=71)
kf.get_n_splits(X)
scores7=[]
end=[]
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X1[train_index], X1[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    xmo.fit(X_train,y_train)
    y_pre7 = xmo.predict_proba(X_test)[:,1]
    score7 = roc_auc_score(y_test,y_pre7)
    scores7.append(score7)
    end.append(y_pre7)
scores7,"mean{}".format(np.mean(scores7))



TRAIN: [ 76878  76879  76880 ... 307508 307509 307510] TEST: [    0     1     2 ... 76875 76876 76877]
TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [ 76878  76879  76880 ... 153753 153754 153755]




TRAIN: [     0      1      2 ... 307508 307509 307510] TEST: [153756 153757 153758 ... 230631 230632 230633]




TRAIN: [     0      1      2 ... 230631 230632 230633] TEST: [230634 230635 230636 ... 307508 307509 307510]






([0.6477740959758902,
  0.6483415889949594,
  0.6436278273959045,
  0.6502155792589552],
 'mean0.6474897729064274')

#### 結果

In [245]:
base = pd.DataFrame( 
    {'ロジスティック回帰クロスバリデーション（ノーマル）': np.mean(scores).ravel(),
     'ロジスティック回帰グリッドサーチ後': np.mean(scores0).ravel(),
     'ロジスティック回帰分割数変更': np.mean(scores2).ravel(),
     'ランダムフォレスト': np.mean(scores3).ravel(),
     'XGBoost': np.mean(scores4).ravel(),
     'エクストラツリー': np.mean(scores6).ravel(),
     'スタッキング': np.mean(scores5).ravel(),
     'スタッキングその2': np.mean(scores7).ravel(),
    })
print('base : ', base.shape)
base.head(5)

base :  (1, 8)


Unnamed: 0,ロジスティック回帰クロスバリデーション（ノーマル）,ロジスティック回帰グリッドサーチ後,ロジスティック回帰分割数変更,ランダムフォレスト,XGBoost,エクストラツリー,スタッキング,スタッキングその2
0,0.60582,0.60582,0.605835,0.61633,0.654423,0.601389,0.500031,0.64749


In [None]:
普通にxgboostを試した方が良い値であった。

#### 【問題5】最終的なモデルの選定
最終的にこれは良いというモデルを選び、推定した結果をKaggleに提出してスコアを確認してください。どういったアイデアを取り入れ、どの程度のスコアになったかを記載してください。

グリッドサーチや、クロスバリデーションの分割数を変えたり、アンサンブルのスタッキングという方法で推測したが、XGBoostを普通に試したモデルが一番roc評価が高くなった。それでも大きな値の変化はなく、モデルチューニングや、モデル変更ではなく前処理が一番重要だということが今回の課題を通じてわかった。


前処理でターゲットエンコーディングをしたxgboostをkaggleに提出することとした。

In [248]:
train = pd.read_csv("../../Week4/application_train.csv")
test = pd.read_csv("../../Week4/application_test.csv")
train_X = train.drop(["TARGET"],axis=1)
train_y = train["TARGET"]
test_y = pd.read_csv("../../Week4/application_test.csv")

In [249]:
ob = train_X[train_X==object].columns.values

In [250]:
for i in ob:
    data_tmp = pd.DataFrame({i:train_X[i],"target":train_y})
    target_mean = data_tmp.groupby(i)["target"].mean()
    test[i] = test[i].map(target_mean)
    
    tmp = np.repeat(np.nan,train_X.shape[0])
    
    kf = KFold(n_splits=4,shuffle=True,random_state=72)
    for idx_1,idx_2 in kf.split(train_X):
        target_mean = data_tmp.iloc[idx_1].groupby(i)["target"].mean()
        tmp[idx_2]  = train_X[i].iloc[idx_2].map(target_mean)
        
    train_X[i] = tmp

In [251]:
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=20,random_state=71)
model.fit(train_X,train_y)

pred = model.predict_proba(test)[:,1]
submission = pd.DataFrame({"SK_ID_CURR":test_y["SK_ID_CURR"],
                           "TARGET":pred})





In [252]:
submission.to_csv("sub10.csv",index=False)

#### kaggleで0.71883という点数となった。