# Part3 実践例 Chap.7 2値分類のコンペ

## 7.1 Home Credit Default Risk コンペの概要

## 7.2 分析のステップ

- ベースライン作成
  - 利用データ: 1テーブル
  - モデル: LightGBM
  - 目的変数: 貸倒有無（0 or 1）
  - バリデーション設計: 5fold クロスバリデーション（StratifiedKFold）
  - 評価指標: AUC
- 特徴量エンジニアリング
  - ほかのテーブルも活用して特徴量を生成
  - 主な特徴量生成: 仮説に基づく特徴量、集約特徴量
- モデルチューニング
  - ハイパーパラメータのチューニング

## 7.3 ベースライン作成

### 7.3.1 分析設計

- 目的変数: 1: 貸倒あり、0: 貸倒なし
- モデル: 貸し倒れの有無を分類する2値モデル（予測値は0~1の連続値）
- 評価指数: AUC（Area Under the Curve）

### 7.3.2 データ前処理

In [5]:
# ライブラリの読み込み
import numpy as np
import pandas as pd
import re
import pickle
import gc

# scikit-learn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

# LightGBM
# !pip install lightgbm==3.2.1  #LightGBMバージョン指定（書籍の再現性のため）
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

pd.options.display.float_format = "{:.4f}".format
pd.set_option("display.max_columns", None)

In [6]:
# ファイルの読み込み・データ確認
application_train = pd.read_csv("./home-credit-default-risk/application_train.csv")
print(application_train.shape)
application_train.head()

(307511, 122)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.0188,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083,0.2629,0.1394,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.0035,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.3113,0.6222,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.01,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.5559,0.7296,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.6504,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.0287,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.3227,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
def reduce_mem_usage(df):
    """
    メモリ削減のための関数
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print(f"Memory usage of dataframe is {start_mem:.2f} MB")

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            pass

    end_mem = df.memory_usage().sum() / 1024**2
    print(f"Memory usage after optimization is: {end_mem:.2f} MB")
    print(f"Decreased by {100 * (start_mem - end_mem) / start_mem:.1f}%")

    return df

In [8]:
# メモリ削減の実行
application_train = reduce_mem_usage(application_train)

Memory usage of dataframe is 286.23 MB
Memory usage after optimization is: 92.38 MB
Decreased by 67.7%


### 7.3.3 データセット作成

In [9]:
# データセットの作成
x_train = application_train.drop(columns=["TARGET", "SK_ID_CURR"])
y_train = application_train["TARGET"]
id_train = application_train[["SK_ID_CURR"]]
print(x_train.shape, y_train.shape, id_train.shape)

(307511, 120) (307511,) (307511, 1)


In [10]:
# カテゴリ変数をcategory型に変換
for col in x_train.columns:
    if x_train[col].dtype == "object":
        x_train[col] = x_train[col].astype("category")

x_train.info("object")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 120 columns):
 #    Column                        Dtype   
---   ------                        -----   
 0    NAME_CONTRACT_TYPE            category
 1    CODE_GENDER                   category
 2    FLAG_OWN_CAR                  category
 3    FLAG_OWN_REALTY               category
 4    CNT_CHILDREN                  int8    
 5    AMT_INCOME_TOTAL              float32 
 6    AMT_CREDIT                    float32 
 7    AMT_ANNUITY                   float32 
 8    AMT_GOODS_PRICE               float32 
 9    NAME_TYPE_SUITE               category
 10   NAME_INCOME_TYPE              category
 11   NAME_EDUCATION_TYPE           category
 12   NAME_FAMILY_STATUS            category
 13   NAME_HOUSING_TYPE             category
 14   REGION_POPULATION_RELATIVE    float16 
 15   DAYS_BIRTH                    int16   
 16   DAYS_EMPLOYED                 int32   
 17   DAYS_REGISTRATION          

### 7.3.4 バリデーション設計

- クロスバリデーション

In [11]:
# 1の割合とそれぞれの件数を確認
print(f"mean: {y_train.mean():.4f}")
y_train.value_counts()

mean: 0.0807


TARGET
0    282686
1     24825
Name: count, dtype: int64

In [13]:
print(f"{24825 / 282686:.4f}")

0.0878


In [14]:
# バリデーションの index リストの作成

# 層化分割したバリデーションのindexのリスト作成
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=123).split(x_train, y_train))

# indexの確認：fold=0のtrainデータ
print("index(train):", cv[0][0])

# indexの確認：fold=0のvalidデータ
print("index(valid):", cv[0][1])

index(train): [     0      1      3 ... 307508 307509 307510]
index(valid): [     2     11     22 ... 307488 307495 307497]


### 7.3.5 モデル学習

- クロスバリデーションにおける学習の流れ
1. foldごとの処理
   1. 学習データと検証データに分離
   2. モデル学習
   3. モデル評価
   4. OOFデータの推論値取得
   5. 説明変数の重要度取得
2. モデル評価
3. OOFデータの推論値取得（全foldのサマリ）
4. 説明変数の重要度取得（全foldのサマリ）

#### 1 foldごとの処理(1-1~1-5)

##### 1-1 学習データと検証データに分離（foldごとの処理）

In [19]:
# 学習データと検証デーに分離

# foldごとのindexのリスト作成
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=123).split(x_train, y_train))

# 0fold目のindexのリスト取得
nfold = 0
idx_tr, idx_va = cv[nfold][0], cv[nfold][1]

# 学習データと検証データに分離
x_tr, y_tr, id_tr = x_train.loc[idx_tr, :], y_train[idx_tr], id_train.loc[idx_tr, :]
x_va, y_va, id_va = x_train.loc[idx_va, :], y_train[idx_va], id_train.loc[idx_va, :]
print(x_tr.shape, y_tr.shape, id_tr.shape)
print(x_va.shape, y_va.shape, id_va.shape)

(246008, 120) (246008,) (246008, 1)
(61503, 120) (61503,) (61503, 1)


##### 1-2 モデル学習（foldごとの処理）

In [20]:
# モデル学習
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.05,
    "num_leaves": 32,
    "n_estimators": 100000,
    "random_state": 123,
    "importance_type": "gain",
}

# モデルの学習
model = lgb.LGBMClassifier(**params)
# model.fit(x_tr, y_tr, eval_set=[(x_tr, y_tr), (x_va, y_va)], early_stopping_rounds=100, verbose=100)
# # 2024/02/14環境で動かしたい場合はこのコードを利用してください。
model.fit(
    x_tr,
    y_tr,
    eval_set=[(x_tr, y_tr), (x_va, y_va)],
    callbacks=[
        lgb.early_stopping(stopping_rounds=100, verbose=True),
        lgb.log_evaluation(100),
    ],
)

# モデルの保存
with open("model_lgb_fold0.pickle", "wb") as f:
    pickle.dump(model, f, protocol=4)

[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.048685 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11367
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 116
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.782506	valid_1's auc: 0.755903
[200]	training's auc: 0.808961	valid_1's auc: 0.758356
[300]	training's auc: 0.829245	valid_1's auc: 0.757774
Early stopping, best iteration is:
[217]	training's auc: 0.812578	valid_1's auc: 0.758595


##### 1-3 モデル評価（foldごとの処理）

In [22]:
# モデル評価

# 学習データの推論値取得とROC計算
y_tr_pred = model.predict_proba(x_tr)[:, 1]
metric_tr = roc_auc_score(y_tr, y_tr_pred)

# 検証データの推論値取得とROC計算
y_va_pred = model.predict_proba(x_va)[:, 1]
metric_va = roc_auc_score(y_va, y_va_pred)

# 評価値を入れる変数の作成（最初のfoldのときのみ）
metrics = list()

# 評価値を格納
metrics.append([nfold, metric_tr, metric_va])

# 結果の表示
print(f"[auc] tr: {metric_tr:.4f}, va: {metric_va:.4f}")

[auc] tr: 0.8126, va: 0.7586


##### 1-4 OOFデータの推論値取得（foldごとの処理）

- OOF(out of fold)
  - 学習データのうち学習に使わなかったデータのこと（検証データのこと）
  - foldごとに検証データの推論値を結合することで、学習用データセット全体の推論値を取得できる

In [25]:
# OOFデータの推論値取得

# oofの予測値を入れる変数の作成
train_oof = np.zeros(len(x_train))

# validデータのindexに予測値を格納
train_oof[idx_va] = y_va_pred

print(train_oof.shape)

(307511,)


##### 1-5 説明変数の重要度取得（foldごとの処理）

In [26]:
# 説明変数の重要度取得

# 重要度の取得
imp_fold = pd.DataFrame({"col": x_train.columns, "imp": model.feature_importances_, "nfold": nfold})
# 確認（重要度の上位10個）
display(imp_fold.sort_values("imp", ascending=False)[:10])

# 重要度を格納する5fold用データフレームの作成
imp = pd.DataFrame()
# imp_foldを5fold用データフレームに結合
imp = pd.concat([imp, imp_fold])

Unnamed: 0,col,imp,nfold
41,EXT_SOURCE_3,66225.0205,0
40,EXT_SOURCE_2,52568.8338,0
38,ORGANIZATION_TYPE,20218.5235,0
39,EXT_SOURCE_1,19776.2523,0
6,AMT_CREDIT,8111.3212,0
8,AMT_GOODS_PRICE,7120.9604,0
15,DAYS_BIRTH,7042.223,0
7,AMT_ANNUITY,6992.5518,0
16,DAYS_EMPLOYED,5236.5141,0
26,OCCUPATION_TYPE,4376.6517,0


#### 2 モデル評価（全foldのサマリ）

In [27]:
# モデル評価（全foldのサマリ）

# リスト型をarray型に変換
metrics = np.array(metrics)
print(metrics)

# 学習/検証データの評価値の平均値と標準偏差を算出
print(
    f"[cv] tr:{metrics[:, 1].mean():.4f}+-{metrics[:, 1].std():.4f}, va:{metrics[:, 2].mean():.4f}+-{metrics[:, 2].std():.4f}"
)

# oofの評価値を算出
print(f"[oof] {roc_auc_score(y_train, train_oof):.4f}")

[[0.         0.81257796 0.75859528]]
[cv] tr:0.8126+-0.0000, va:0.7586+-0.0000
[oof] 0.5103


#### 3 OOFデータの推論値取得（全foldのサマリ）

In [28]:
# OOFデータの推論値取得（全foldのサマリ）

train_oof = pd.concat(
    [
        id_train,
        pd.DataFrame({"true": y_train, "pred": train_oof}),
    ],
    axis=1,
)
train_oof.head()

Unnamed: 0,SK_ID_CURR,true,pred
0,100002,1,0.0
1,100003,0,0.0
2,100004,0,0.0319
3,100006,0,0.0
4,100007,0,0.0


#### 4 説明変数の重要度取得（全foldのサマリ）

In [29]:
# 説明変数の重要度取得（全foldのサマリ）

imp = imp.groupby("col")["imp"].aggregate(["mean", "std"]).reset_index(drop=False)
imp.columns = ["col", "imp", "imp_std"]
imp.head()

Unnamed: 0,col,imp,imp_std
0,AMT_ANNUITY,6992.5518,
1,AMT_CREDIT,8111.3212,
2,AMT_GOODS_PRICE,7120.9604,
3,AMT_INCOME_TOTAL,1595.7406,
4,AMT_REQ_CREDIT_BUREAU_DAY,128.8429,


#### モデル学習関数の定義

In [30]:
# 学習関数の定義
def train_lgb(
    input_x,
    input_y,
    input_id,
    params,
    list_nfold=[0, 1, 2, 3, 4],
    n_splits=5,
):
    train_oof = np.zeros(len(input_x))
    metrics = list()
    imp = pd.DataFrame()

    # cross-validation
    cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=123).split(input_x, input_y))
    for nfold in list_nfold:
        print("-" * 20, nfold, "-" * 20)

        # make dataset
        idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
        x_tr, y_tr, id_tr = input_x.loc[idx_tr, :], input_y[idx_tr], input_id.loc[idx_tr, :]
        x_va, y_va, id_va = input_x.loc[idx_va, :], input_y[idx_va], input_id.loc[idx_va, :]
        print(x_tr.shape, x_va.shape)

        # train
        model = lgb.LGBMClassifier(**params)
        # model.fit(x_tr, y_tr, eval_set=[(x_tr, y_tr), (x_va, y_va)], early_stopping_rounds=100, verbose=100)
        # # 2024/02/14環境で動かしたい場合はこのコードを利用してください。
        model.fit(
            x_tr,
            y_tr,
            eval_set=[(x_tr,y_tr), (x_va,y_va)],
            callbacks=[
                lgb.early_stopping(stopping_rounds=100, verbose=True),
                lgb.log_evaluation(100),
            ],
        )

        # saving the model
        fname_lgb = "model_lgb_fold{}.pickle".format(nfold)
        with open(fname_lgb, "wb") as f:
            pickle.dump(model, f, protocol=4)

        # evaluate
        y_tr_pred = model.predict_proba(x_tr)[:, 1]
        y_va_pred = model.predict_proba(x_va)[:, 1]
        metric_tr = roc_auc_score(y_tr, y_tr_pred)
        metric_va = roc_auc_score(y_va, y_va_pred)
        metrics.append([nfold, metric_tr, metric_va])
        print(f"[auc] tr:{metric_tr:.4f}, va:{metric_va:.4f}")

        # oof
        train_oof[idx_va] = y_va_pred

        # imp
        _imp = pd.DataFrame({"col": input_x.columns, "imp": model.feature_importances_, "nfold": nfold})
        imp = pd.concat([imp, _imp])

    print("-" * 20, "result", "-" * 20)

    # metric
    metrics = np.array(metrics)
    print(metrics)
    print(
        f"[cv] tr:{metrics[:, 1].mean():.4f}+-{metrics[:, 1].std():.4f}, va:{metrics[:, 2].mean():.4f}+-{metrics[:, 2].std():.4f}"
    )
    print(f"[oof] {roc_auc_score(input_y, train_oof):.4f}")

    # oof
    train_oof = pd.concat([input_id, pd.DataFrame({"pred": train_oof})], axis=1)

    # importance
    imp = imp.groupby("col")["imp"].aggregate(["mean", "std"]).reset_index(drop=False)
    imp.columns = ["col", "imp", "imp_std"]

    return train_oof, imp, metrics

#### 学習の実行

In [31]:
# 学習の実行
# ハイパーパラメータの設定
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.05,
    "num_leaves": 32,
    "n_estimators": 100000,
    "random_state": 123,
    "importance_type": "gain",
}

# 学習の実行
train_oof, imp, metrics = train_lgb(
    x_train,
    y_train,
    id_train,
    params,
    list_nfold=[0, 1, 2, 3, 4],
    n_splits=5,
)

-------------------- 0 --------------------
(246008, 120) (61503, 120)
[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.050502 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11367
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 116
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.782506	valid_1's auc: 0.755903
[200]	training's auc: 0.808961	valid_1's auc: 0.758356
[300]	training's auc: 0.829245	valid_1's auc: 0.757774
Early stopping, best iteration is:
[217]	training's auc: 0.812578	valid_1's auc: 0.758595
[auc] tr:0.8126, va:0.7586
-------------------- 1 

#### 説明変数の重要度（TOP10）の確認

In [32]:
# 説明変数の重要度（TOP10）の確認
imp.sort_values("imp", ascending=False)[:10]

Unnamed: 0,col,imp,imp_std
38,EXT_SOURCE_3,65353.9075,1558.2012
37,EXT_SOURCE_2,54545.3883,1251.7989
102,ORGANIZATION_TYPE,21441.9175,1450.2462
36,EXT_SOURCE_1,20051.9342,685.8522
1,AMT_CREDIT,8263.2287,410.3844
22,DAYS_BIRTH,7645.5891,689.4588
2,AMT_GOODS_PRICE,7263.0546,405.837
0,AMT_ANNUITY,6762.9536,479.302
23,DAYS_EMPLOYED,5810.2884,552.9377
101,OCCUPATION_TYPE,5502.6759,831.8724


### 7.3.6 モデル推論

#### 推論用データセットの作成

In [33]:
# 推論用データセットの作成
# ファイルの読み込み
application_test = pd.read_csv("./home-credit-default-risk/application_test.csv")
application_test = reduce_mem_usage(application_test)

# データセットの作成
x_test = application_test.drop(columns=["SK_ID_CURR"])
id_test = application_test[["SK_ID_CURR"]]

# カテゴリ変数をcategory型に変換
for col in x_test.columns:
    if x_test[col].dtype == "object":
        x_test[col] = x_test[col].astype("category")

print(x_test.shape)
print(x_test.info("object"))

Memory usage of dataframe is 45.00 MB
Memory usage after optimization is: 14.60 MB
Decreased by 67.6%
(48744, 120)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 120 columns):
 #    Column                        Dtype   
---   ------                        -----   
 0    NAME_CONTRACT_TYPE            category
 1    CODE_GENDER                   category
 2    FLAG_OWN_CAR                  category
 3    FLAG_OWN_REALTY               category
 4    CNT_CHILDREN                  int8    
 5    AMT_INCOME_TOTAL              float32 
 6    AMT_CREDIT                    float32 
 7    AMT_ANNUITY                   float32 
 8    AMT_GOODS_PRICE               float32 
 9    NAME_TYPE_SUITE               category
 10   NAME_INCOME_TYPE              category
 11   NAME_EDUCATION_TYPE           category
 12   NAME_FAMILY_STATUS            category
 13   NAME_HOUSING_TYPE             category
 14   REGION_POPULATION_RELATIVE    float16 
 15   DAYS

#### クロスバリデーション時の推論の流れ

1. foldごとの処理
   1. 学習モデルの読み込み
   2. モデルを用いた推論
2. 推論値の取得（全foldのサマリ）

#### foldごとの処理

##### 1-1 学習済みモデルの読み込み（foldごとの処理）

In [34]:
# 1-1 学習済みモデルの読み込み（foldごとの処理）
with open("model_lgb_fold0.pickle", "rb") as f:
    model = pickle.load(f)

##### 1-2 モデルを用いた推論（foldごとの処理）

In [35]:
# 1-2 モデルを用いた推論（foldごとの処理）
# 推論
test_pred_fold = model.predict_proba(x_test)[:, 1]

# 推論値を格納する変数を作成
test_pred = np.zeros((len(x_test), 5))

# 1fold目の予測値を格納
test_pred[:, 0] = test_pred_fold

#### 2 推論値の取得

In [36]:
# 推論用データセットの推論値算出

# 各foldの推論値の平均値を算出
test_pred_mean = test_pred.mean(axis=1)

# 推論値のデータフレームを作成
df_test_pred = pd.concat(
    [
        id_test,
        pd.DataFrame({"pred": test_pred_mean}),
    ],
    axis=1,
)
print(df_test_pred.shape)
df_test_pred.head()

(48744, 2)


Unnamed: 0,SK_ID_CURR,pred
0,100001,0.0066
1,100005,0.0239
2,100013,0.0042
3,100028,0.009
4,100038,0.0308


#### 推論処理の関数の定義

In [37]:
# 推論関数の定義
def predict_lgb(
    input_x,
    input_id,
    list_nfold=[0, 1, 2, 3, 4],
):
    pred = np.zeros((len(input_x), len(list_nfold)))
    for nfold in list_nfold:

        print("-" * 20, nfold, "-" * 20)

        fname_lgb = f"model_lgb_fold{nfold}.pickle"
        with open(fname_lgb, "rb") as f:
            model = pickle.load(f)
        pred[:, nfold] = model.predict_proba(input_x)[:, 1]

    pred = pd.concat(
        [
            input_id,
            pd.DataFrame({"pred": pred.mean(axis=1)}),
        ],
        axis=1,
    )

    print("Done.")

    return pred

#### 推論処理の実行

In [38]:
# 推論処理の実行

test_pred = predict_lgb(
    x_test,
    id_test,
    list_nfold=[0, 1, 2, 3, 4]
)

-------------------- 0 --------------------
-------------------- 1 --------------------
-------------------- 2 --------------------
-------------------- 3 --------------------
-------------------- 4 --------------------
Done.


#### 提出ファイルの作成

In [39]:
# 提出ファイルの作成
df_submit = test_pred.rename(columns={"pred": "TARGET"})
print(df_submit.shape)
display(df_submit.head())

# ファイル出力
df_submit.to_csv("submission_baseline.csv", index=None)

(48744, 2)


Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.0418
1,100005,0.1264
2,100013,0.0225
3,100028,0.0397
4,100038,0.1566


## 7.4 特徴量エンジニアリング

- 特徴量の生成
- モデル学習・評価
- 特徴量の採否


### 7.4.1 特徴量エンジニアリング

- application_train.csv

#### データの異常値の確認・対処

- `DAYS_EMPLOYED`（就労日数）

In [40]:
# データの確認
display(application_train["DAYS_EMPLOYED"].value_counts())
print(f"正の値の割合: {(application_train["DAYS_EMPLOYED"] > 0).mean():.4f}")
print(f"正の値の個数: {(application_train["DAYS_EMPLOYED"] > 0).sum()}")
# -> 正の値が18%。しかもすべて8割が365243と同一値。働き始めてからの日数をマイナス表記しているためこれは欠損と判断。

DAYS_EMPLOYED
 365243    55374
-200         156
-224         152
-230         151
-199         151
           ...  
-13961         1
-11827         1
-10176         1
-9459          1
-8694          1
Name: count, Length: 12574, dtype: int64

正の値の割合: 0.1801
正の値の個数: 55374


In [43]:
# 異常値の対処（NaNに置換）
# 365243 -> NaN
application_train["DAYS_EMPLOYED"] = application_train["DAYS_EMPLOYED"].replace(365243, np.nan)

In [44]:
application_train["DAYS_EMPLOYED"].isna().sum()

55374

#### 特徴量の生成 P.246

In [45]:
# 仮説に基づく特徴量の生成
# 特徴量1: 総所得金額を世帯人数で割った値
application_train["INCOME_div_PERSON"] = application_train["AMT_INCOME_TOTAL"] / application_train["CNT_FAM_MEMBERS"]

# 特徴量2: 総所得金額を就労期間で割った値
application_train["INCOME_div_EMPLOYED"] = application_train["AMT_INCOME_TOTAL"] / application_train["DAYS_EMPLOYED"]

# 特徴量3: 外部スコアの平均値など
application_train["EXT_SOURCE_mean"] = application_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].mean(axis=1)
application_train["EXT_SOURCE_max"] = application_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].max(axis=1)
application_train["EXT_SOURCE_min"] = application_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].min(axis=1)
application_train["EXT_SOURCE_std"] = application_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].std(axis=1)
application_train["EXT_SOURCE_count"] = (
    application_train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].notnull().sum(axis=1)
)

# 特徴量4: 就労期間を年齢で割った値 (年齢に占める就労期間の割合)
application_train["DAYS_EMPLOYED_div_BIRTH"] = application_train["DAYS_EMPLOYED"] / application_train["DAYS_BIRTH"]

# 特徴量5: 年金支払額を所得金額で割った値
application_train["ANNUITY_div_INCOME"] = application_train["AMT_ANNUITY"] / application_train["AMT_INCOME_TOTAL"]

# 特徴量6: 年金支払額を借入金で割った値
application_train["ANNUITY_div_CREDIT"] = application_train["AMT_ANNUITY"] / application_train["AMT_CREDIT"]

In [46]:
application_train.shape

(307511, 132)

In [49]:
# 学習用データセットの作成
x_train = application_train.drop(columns=["TARGET", "SK_ID_CURR"])
y_train = application_train["TARGET"]
id_train = application_train[["SK_ID_CURR"]]

for col in x_train.columns:
    if x_train[col].dtype == "object":
        x_train[col] = x_train[col].astype("category")

In [50]:
# モデル学習
train_oof, imp, metrics = train_lgb(
    x_train,
    y_train,
    id_train,
    params,
    list_nfold=[0, 1, 2, 3, 4],
    n_splits=5,
)

-------------------- 0 --------------------
(246008, 130) (61503, 130)
[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.057827 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13680
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 126
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.787817	valid_1's auc: 0.760032
[200]	training's auc: 0.816788	valid_1's auc: 0.763696
[300]	training's auc: 0.838351	valid_1's auc: 0.764008
[400]	training's auc: 0.856611	valid_1's auc: 0.764045
[500]	training's auc: 0.871304	valid_1's auc: 0.764075
Early stopping, best iteration

In [51]:
# 説明変数の重要度の確認
imp.sort_values("imp", ascending=False)[:10]

Unnamed: 0,col,imp,imp_std
44,EXT_SOURCE_mean,114005.2147,1381.6456
10,ANNUITY_div_CREDIT,23720.3016,805.3975
112,ORGANIZATION_TYPE,22660.2106,1372.2304
41,EXT_SOURCE_3,12046.8546,886.6537
24,DAYS_BIRTH,8108.6841,578.9724
45,EXT_SOURCE_min,7727.3916,314.2032
39,EXT_SOURCE_1,7155.6192,472.4225
2,AMT_GOODS_PRICE,6148.1679,364.159
0,AMT_ANNUITY,6091.8052,581.9879
46,EXT_SOURCE_std,5830.3907,679.9639


In [52]:
# 推論用データセットの作成
# NaNに置き換え
application_test["DAYS_EMPLOYED"] = application_test["DAYS_EMPLOYED"].replace(365243, np.nan)

# 特徴量の生成
application_test["INCOME_div_PERSON"] = application_test["AMT_INCOME_TOTAL"] / application_test["CNT_FAM_MEMBERS"]
application_test["INCOME_div_EMPLOYED"] = application_test["AMT_INCOME_TOTAL"] / application_test["DAYS_EMPLOYED"]
application_test["EXT_SOURCE_mean"] = application_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].mean(axis=1)
application_test["EXT_SOURCE_max"] = application_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].max(axis=1)
application_test["EXT_SOURCE_min"] = application_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].min(axis=1)
application_test["EXT_SOURCE_std"] = application_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].std(axis=1)
application_test["EXT_SOURCE_count"] = (
    application_test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].notnull().sum(axis=1)
)
application_test["DAYS_EMPLOYED_div_BIRTH"] = application_test["DAYS_EMPLOYED"] / application_test["DAYS_BIRTH"]
application_test["ANNUITY_div_INCOME"] = application_test["AMT_ANNUITY"] / application_test["AMT_INCOME_TOTAL"]
application_test["ANNUITY_div_CREDIT"] = application_test["AMT_ANNUITY"] / application_test["AMT_CREDIT"]

# データセット作成
x_test = application_test.drop(columns=["SK_ID_CURR"])
id_test = application_test[["SK_ID_CURR"]]

# カテゴリ変数をcategory型へ変換
for col in x_test.columns:
    if x_test[col].dtype == "object":
        x_test[col] = x_test[col].astype("category")

In [53]:
# 推論処理
test_pred = predict_lgb(
    x_test,
    id_test,
    list_nfold=[0, 1, 2, 3, 4],
)

-------------------- 0 --------------------
-------------------- 1 --------------------
-------------------- 2 --------------------
-------------------- 3 --------------------
-------------------- 4 --------------------
Done.


In [54]:
# 提出用ファイルの作成
df_submit = test_pred.rename(columns={"pred": "TARGET"})
print(df_submit.shape)
display(df_submit.head())
df_submit.to_csv("submission_FeatureEngineering1.csv", index=None)

(48744, 2)


Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.029
1,100005,0.1218
2,100013,0.0227
3,100028,0.0444
4,100038,0.1819


### 7.4.2 特徴量エンジニアリング

- POS_CASH_balance.csv

In [55]:
# ファイルの読み込み
pos = pd.read_csv("./home-credit-default-risk/POS_CASH_balance.csv")
pos = reduce_mem_usage(pos)
print(pos.shape)
pos.head()

Memory usage of dataframe is 610.43 MB
Memory usage after optimization is: 238.45 MB
Decreased by 60.9%
(10001358, 8)


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


#### カテゴリ変数の集約処理

1. カテゴリ変数を one-hot-encoding で数値に変換する
2. `SK_ID_CURR` を集約キーにして集約処理
3. `SK_ID_CURR` を結合キーにして `application_train` テーブルと結合する

In [56]:
# 1 カテゴリ変数をone-hot-encodingで数値に変換する
pos_ohe = pd.get_dummies(pos, columns=["NAME_CONTRACT_STATUS"], dummy_na=True)
col_ohe = sorted(list(set(pos_ohe.columns) - set(pos.columns)))
print(len(col_ohe))
col_ohe

10


['NAME_CONTRACT_STATUS_Active',
 'NAME_CONTRACT_STATUS_Amortized debt',
 'NAME_CONTRACT_STATUS_Approved',
 'NAME_CONTRACT_STATUS_Canceled',
 'NAME_CONTRACT_STATUS_Completed',
 'NAME_CONTRACT_STATUS_Demand',
 'NAME_CONTRACT_STATUS_Returned to the store',
 'NAME_CONTRACT_STATUS_Signed',
 'NAME_CONTRACT_STATUS_XNA',
 'NAME_CONTRACT_STATUS_nan']

In [57]:
# 2 SK_ID_CURRをキーに集約処理
pos_ohe_agg = pos_ohe.groupby("SK_ID_CURR").aggregate(
    {
        # 数値の集約
        "MONTHS_BALANCE": ["mean", "std", "min", "max"],
        "CNT_INSTALMENT": ["mean", "std", "min", "max"],
        "CNT_INSTALMENT_FUTURE": ["mean", "std", "min", "max"],
        "SK_DPD": ["mean", "std", "min", "max"],
        "SK_DPD_DEF": ["mean", "std", "min", "max"],
        # カテゴリ変数をone-hot-encodingした値の集約
        "NAME_CONTRACT_STATUS_Active": ["mean"],
        "NAME_CONTRACT_STATUS_Amortized debt": ["mean"],
        "NAME_CONTRACT_STATUS_Approved": ["mean"],
        "NAME_CONTRACT_STATUS_Canceled": ["mean"],
        "NAME_CONTRACT_STATUS_Completed": ["mean"],
        "NAME_CONTRACT_STATUS_Demand": ["mean"],
        "NAME_CONTRACT_STATUS_Returned to the store": ["mean"],
        "NAME_CONTRACT_STATUS_Signed": ["mean"],
        "NAME_CONTRACT_STATUS_XNA": ["mean"],
        "NAME_CONTRACT_STATUS_nan": ["mean"],
        # IDのユニーク数をカウント (ついでにレコード数もカウント)
        "SK_ID_PREV": ["count", "nunique"],
    }
)

# カラム名の付与
pos_ohe_agg.columns = [i + "_" + j for i, j in pos_ohe_agg.columns]
pos_ohe_agg = pos_ohe_agg.reset_index(drop=False)

print(pos_ohe_agg.shape)
pos_ohe_agg.head()

(337252, 33)


Unnamed: 0,SK_ID_CURR,MONTHS_BALANCE_mean,MONTHS_BALANCE_std,MONTHS_BALANCE_min,MONTHS_BALANCE_max,CNT_INSTALMENT_mean,CNT_INSTALMENT_std,CNT_INSTALMENT_min,CNT_INSTALMENT_max,CNT_INSTALMENT_FUTURE_mean,CNT_INSTALMENT_FUTURE_std,CNT_INSTALMENT_FUTURE_min,CNT_INSTALMENT_FUTURE_max,SK_DPD_mean,SK_DPD_std,SK_DPD_min,SK_DPD_max,SK_DPD_DEF_mean,SK_DPD_DEF_std,SK_DPD_DEF_min,SK_DPD_DEF_max,NAME_CONTRACT_STATUS_Active_mean,NAME_CONTRACT_STATUS_Amortized debt_mean,NAME_CONTRACT_STATUS_Approved_mean,NAME_CONTRACT_STATUS_Canceled_mean,NAME_CONTRACT_STATUS_Completed_mean,NAME_CONTRACT_STATUS_Demand_mean,NAME_CONTRACT_STATUS_Returned to the store_mean,NAME_CONTRACT_STATUS_Signed_mean,NAME_CONTRACT_STATUS_XNA_mean,NAME_CONTRACT_STATUS_nan_mean,SK_ID_PREV_count,SK_ID_PREV_nunique
0,100001,-72.5556,20.8633,-96,-53,4.0,0.0,4.0,4.0,1.4444,1.424,0.0,4.0,0.7778,2.3333,0,7,0.7778,2.3333,0,7,0.7778,0.0,0.0,0.0,0.2222,0.0,0.0,0.0,0.0,0.0,9,2
1,100002,-10.0,5.6273,-19,-1,24.0,0.0,24.0,24.0,15.0,5.6273,6.0,24.0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19,1
2,100003,-43.7857,24.6402,-77,-18,10.1071,2.8066,6.0,12.0,5.7857,3.8428,0.0,12.0,0.0,0.0,0,0,0.0,0.0,0,0,0.9286,0.0,0.0,0.0,0.0714,0.0,0.0,0.0,0.0,0.0,28,3
3,100004,-25.5,1.291,-27,-24,3.75,0.5,3.0,4.0,2.25,1.7078,0.0,4.0,0.0,0.0,0,0,0.0,0.0,0,0,0.75,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,4,1
4,100005,-20.0,3.3166,-25,-15,11.7,0.9487,9.0,12.0,7.2,3.6148,0.0,12.0,0.0,0.0,0,0,0.0,0.0,0,0,0.8182,0.0,0.0,0.0,0.0909,0.0,0.0,0.0909,0.0,0.0,11,1


In [58]:
# 3 SK_ID_CURRをキーにして結合
df_train = pd.merge(application_train, pos_ohe_agg, on="SK_ID_CURR", how="left")
print(df_train.shape)
df_train.head()

(307511, 164)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,INCOME_div_PERSON,INCOME_div_EMPLOYED,EXT_SOURCE_mean,EXT_SOURCE_max,EXT_SOURCE_min,EXT_SOURCE_std,EXT_SOURCE_count,DAYS_EMPLOYED_div_BIRTH,ANNUITY_div_INCOME,ANNUITY_div_CREDIT,MONTHS_BALANCE_mean,MONTHS_BALANCE_std,MONTHS_BALANCE_min,MONTHS_BALANCE_max,CNT_INSTALMENT_mean,CNT_INSTALMENT_std,CNT_INSTALMENT_min,CNT_INSTALMENT_max,CNT_INSTALMENT_FUTURE_mean,CNT_INSTALMENT_FUTURE_std,CNT_INSTALMENT_FUTURE_min,CNT_INSTALMENT_FUTURE_max,SK_DPD_mean,SK_DPD_std,SK_DPD_min,SK_DPD_max,SK_DPD_DEF_mean,SK_DPD_DEF_std,SK_DPD_DEF_min,SK_DPD_DEF_max,NAME_CONTRACT_STATUS_Active_mean,NAME_CONTRACT_STATUS_Amortized debt_mean,NAME_CONTRACT_STATUS_Approved_mean,NAME_CONTRACT_STATUS_Canceled_mean,NAME_CONTRACT_STATUS_Completed_mean,NAME_CONTRACT_STATUS_Demand_mean,NAME_CONTRACT_STATUS_Returned to the store_mean,NAME_CONTRACT_STATUS_Signed_mean,NAME_CONTRACT_STATUS_XNA_mean,NAME_CONTRACT_STATUS_nan_mean,SK_ID_PREV_count,SK_ID_PREV_nunique
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.0188,-9461,-637.0,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083,0.2629,0.1394,0.0247,0.0369,0.9722,0.6191,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6343,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6245,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,202500.0,-317.8964,0.1617,0.2629,0.083,0.092,3,0.0673,0.122,0.0607,-10.0,5.6273,-19.0,-1.0,24.0,0.0,24.0,24.0,15.0,5.6273,6.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.0035,-16765,-1188.0,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.3113,0.6221,,0.0959,0.0529,0.9849,0.7959,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9849,0.8042,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9849,0.7988,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,135000.0,-227.2727,0.4668,0.6221,0.3113,0.2198,2,0.0709,0.1322,0.0276,-43.7857,24.6402,-77.0,-18.0,10.1071,2.8066,6.0,12.0,5.7857,3.8428,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9286,0.0,0.0,0.0,0.0714,0.0,0.0,0.0,0.0,0.0,28.0,3.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.01,-19046,-225.0,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.5562,0.7295,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,67500.0,-300.0,0.6426,0.7295,0.5562,0.1226,2,0.0118,0.1,0.05,-25.5,1.291,-27.0,-24.0,3.75,0.5,3.0,4.0,2.25,1.7078,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,4.0,1.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008,-19005,-3039.0,-9832.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.6504,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,67500.0,-44.4225,0.6504,0.6504,0.6504,,1,0.1599,0.2199,0.0949,-9.619,6.0785,-20.0,-1.0,12.0,9.2793,1.0,48.0,8.65,10.1633,0.0,48.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8571,0.0,0.0,0.0,0.0952,0.0,0.0476,0.0,0.0,0.0,21.0,3.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.0287,-19932,-3038.0,-4312.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.3228,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,121500.0,-39.9934,0.3228,0.3228,0.3228,,1,0.1524,0.18,0.0426,-33.6364,22.5891,-77.0,-1.0,15.3333,4.8843,10.0,24.0,8.9697,6.3123,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9394,0.0,0.0,0.0,0.0455,0.0,0.0,0.0152,0.0,0.0,66.0,5.0


In [59]:
# 学習用データセットの作成
x_train = df_train.drop(columns=["TARGET", "SK_ID_CURR"])
y_train = df_train["TARGET"]
id_train = df_train[["SK_ID_CURR"]]

for col in x_train.columns:
    if x_train[col].dtype == "object":
        x_train[col] = x_train[col].astype("category")

print(x_train.shape)

(307511, 162)


In [60]:
# モデル学習
train_oof, imp, metrics = train_lgb(
    x_train,
    y_train,
    id_train,
    params,
    list_nfold=[0, 1, 2, 3, 4],
    n_splits=5,
)

-------------------- 0 --------------------
(246008, 162) (61503, 162)
[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.056284 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 18345
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 158
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.794833	valid_1's auc: 0.7663
[200]	training's auc: 0.825601	valid_1's auc: 0.771197
[300]	training's auc: 0.848487	valid_1's auc: 0.771543
[400]	training's auc: 0.866733	valid_1's auc: 0.771473
Early stopping, best iteration is:
[384]	training's auc: 0.864221	valid_1's auc: 0.7717

In [61]:
# 説明変数の重要度（TOP10）の確認
imp.sort_values("imp", ascending=False)[:10]

Unnamed: 0,col,imp,imp_std
52,EXT_SOURCE_mean,112560.3515,859.5278
134,ORGANIZATION_TYPE,21677.7985,2271.1035
10,ANNUITY_div_CREDIT,18437.1357,669.4396
49,EXT_SOURCE_3,10476.6146,911.0729
53,EXT_SOURCE_min,7080.9135,760.2313
32,DAYS_BIRTH,6742.9137,1121.8684
47,EXT_SOURCE_1,6468.1095,796.9921
21,CNT_INSTALMENT_FUTURE_mean,6197.1364,784.9495
108,MONTHS_BALANCE_std,5543.389,596.0482
0,AMT_ANNUITY,5530.288,555.477


In [62]:
# 推論用データセットの作成
# テーブル結合
df_test = pd.merge(application_test, pos_ohe_agg, on="SK_ID_CURR", how="left")

# データセット作成
x_test = df_test.drop(columns=["SK_ID_CURR"])
id_test = df_test[["SK_ID_CURR"]]

# カテゴリ変数をcategory型へ変換
for col in x_test.columns:
    if x_test[col].dtype == "object":
        x_test[col] = x_test[col].astype("category")

print(x_test.shape, id_test.shape)
print(x_test.info())

(48744, 162) (48744, 1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 162 entries, NAME_CONTRACT_TYPE to SK_ID_PREV_nunique
dtypes: category(16), float16(69), float32(9), float64(29), int16(2), int64(1), int8(36)
memory usage: 21.9 MB
None


In [63]:
# 推論用データセットを用いた推論処理
test_pred = predict_lgb(
    x_test,
    id_test,
    list_nfold=[0, 1, 2, 3, 4],
)

-------------------- 0 --------------------
-------------------- 1 --------------------
-------------------- 2 --------------------
-------------------- 3 --------------------
-------------------- 4 --------------------
Done.


In [64]:
# 提出用ファイルの作成
df_submit = test_pred.rename(columns={"pred": "TARGET"})
print(df_submit.shape)
display(df_submit.head())
df_submit.to_csv("submission_FeatureEngineering2.csv", index=None)

(48744, 2)


Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.0328
1,100005,0.1186
2,100013,0.0318
3,100028,0.0485
4,100038,0.2123


## 7.5 モデルチューニング

In [65]:
# 重要度を用いて絞り込んだ特徴量リストの作成
col_filter = sorted(list(imp.sort_values("imp", ascending=False)[:100]["col"]))
col_filter

['AMT_ANNUITY',
 'AMT_CREDIT',
 'AMT_GOODS_PRICE',
 'AMT_INCOME_TOTAL',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'ANNUITY_div_CREDIT',
 'ANNUITY_div_INCOME',
 'APARTMENTS_AVG',
 'APARTMENTS_MEDI',
 'APARTMENTS_MODE',
 'BASEMENTAREA_AVG',
 'BASEMENTAREA_MEDI',
 'BASEMENTAREA_MODE',
 'CNT_FAM_MEMBERS',
 'CNT_INSTALMENT_FUTURE_max',
 'CNT_INSTALMENT_FUTURE_mean',
 'CNT_INSTALMENT_FUTURE_min',
 'CNT_INSTALMENT_FUTURE_std',
 'CNT_INSTALMENT_max',
 'CNT_INSTALMENT_mean',
 'CNT_INSTALMENT_min',
 'CNT_INSTALMENT_std',
 'CODE_GENDER',
 'COMMONAREA_AVG',
 'COMMONAREA_MODE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_EMPLOYED_div_BIRTH',
 'DAYS_ID_PUBLISH',
 'DAYS_LAST_PHONE_CHANGE',
 'DAYS_REGISTRATION',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'ENTRANCES_AVG',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'EXT_SOURCE_count',
 'EXT_SOURCE_max',
 'EXT_SOURCE_mean',
 'EXT_SOURCE_min',
 'EXT_SOURCE_std',
 'FLAG_DOCUMENT_3',
 'F

### 7.5.1 optuna による自動チューニングの実行

In [66]:
# ライブラリのインポート
import optuna

In [67]:
# 学習用データセットの作成
x_train = df_train.drop(columns=["TARGET", "SK_ID_CURR"])
y_train = df_train["TARGET"]
id_train = df_train[["SK_ID_CURR"]]

for col in x_train.columns:
    if x_train[col].dtype == "object":
        x_train[col] = x_train[col].astype("category")

print(x_train.shape, y_train.shape, id_train.shape)

(307511, 162) (307511,) (307511, 1)


In [68]:
# 目的関数の定義

# 探索しないハイパーパラメータ
params_base = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "verbosity": -1,
    "learning_rate": 0.05,
    "n_estimators": 100000,
    "bagging_freq": 1,
    "seed": 123,
}


# 目的関数の定義
def objective(trial):
    # 探索するハイパーパラメータ
    params_tuning = {
        "num_leaves": trial.suggest_int("num_leaves", 8, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 200),
        "min_sum_hessian_in_leaf": trial.suggest_float("min_sum_hessian_in_leaf", 1e-5, 1e-2, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.5, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.5, 1.0),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-2, 1e2, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-2, 1e2, log=True),
    }
    params_tuning.update(params_base)

    # モデル学習・評価
    list_metrics = list()
    cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=123).split(x_train, y_train))
    list_fold = [0]  # 処理高速化のために1つめのfoldのみとする。
    for nfold in list_fold:
        idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
        x_tr, y_tr = x_train.loc[idx_tr, :], y_train[idx_tr]
        x_va, y_va = x_train.loc[idx_va, :], y_train[idx_va]
        model = lgb.LGBMClassifier(**params_tuning)
        # model.fit(
        #     x_tr,
        #     y_tr,
        #     eval_set=[(x_tr, y_tr), (x_va, y_va)],
        #     early_stopping_rounds=100,
        #     verbose=0,
        # )
        # 2024/02/14環境で動かしたい場合はこのコードを利用してください。
        model.fit(
            x_tr,
            y_tr,
            eval_set=[(x_tr, y_tr), (x_va, y_va)],
            callbacks=[
                lgb.early_stopping(stopping_rounds=100, verbose=True),
                lgb.log_evaluation(0),
            ],
        )

        y_va_pred = model.predict_proba(x_va)[:, 1]
        metric_va = roc_auc_score(y_va, y_va_pred)  # 評価指標をAUCにする
        list_metrics.append(metric_va)

    # 評価指標の算出
    metrics = np.mean(list_metrics)

    return metrics

In [69]:
# 最適化処理（探索の実行）
# n_trial: 試行回数
# n_jobs: 並列数
sampler = optuna.samplers.TPESampler(seed=123)
study = optuna.create_study(sampler=sampler, direction="maximize")
study.optimize(objective, n_trials=50, n_jobs=5)
# 32m

[I 2024-03-05 17:47:19,746] A new study created in memory with name: no-name-ac7ff1b8-68ca-4798-a166-b3df8675218d


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[535]	training's auc: 0.835049	valid_1's auc: 0.773084


[I 2024-03-05 17:48:26,915] Trial 3 finished with value: 0.7730843806464844 and parameters: {'num_leaves': 24, 'min_child_samples': 90, 'min_sum_hessian_in_leaf': 1.491734478505277e-05, 'feature_fraction': 0.6905665929242375, 'bagging_fraction': 0.5249478321412735, 'lambda_l1': 0.43181970669255415, 'lambda_l2': 11.781341928813271}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[190]	training's auc: 0.873435	valid_1's auc: 0.770563


[I 2024-03-05 17:48:44,640] Trial 4 finished with value: 0.7705628101653383 and parameters: {'num_leaves': 113, 'min_child_samples': 154, 'min_sum_hessian_in_leaf': 1.091998718850581e-05, 'feature_fraction': 0.9082262487273489, 'bagging_fraction': 0.6684323256691458, 'lambda_l1': 0.6066392740168435, 'lambda_l2': 12.143001881197026}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[159]	training's auc: 0.89124	valid_1's auc: 0.767415


[I 2024-03-05 17:49:02,090] Trial 0 finished with value: 0.7674147202621115 and parameters: {'num_leaves': 169, 'min_child_samples': 176, 'min_sum_hessian_in_leaf': 0.00011680763406220553, 'feature_fraction': 0.9967666321153877, 'bagging_fraction': 0.5829489135426986, 'lambda_l1': 2.290786479021068, 'lambda_l2': 4.18319624934137}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[251]	training's auc: 0.906397	valid_1's auc: 0.767906


[I 2024-03-05 17:50:29,031] Trial 1 finished with value: 0.7679064534553435 and parameters: {'num_leaves': 236, 'min_child_samples': 107, 'min_sum_hessian_in_leaf': 8.186231051345929e-05, 'feature_fraction': 0.8605993381919688, 'bagging_fraction': 0.6431500273316987, 'lambda_l1': 15.609939135443666, 'lambda_l2': 0.9618713061741688}. Best is trial 3 with value: 0.7730843806464844.


Early stopping, best iteration is:
[267]	training's auc: 0.904878	valid_1's auc: 0.772073


[I 2024-03-05 17:50:32,363] Trial 2 finished with value: 0.7720734233696507 and parameters: {'num_leaves': 192, 'min_child_samples': 153, 'min_sum_hessian_in_leaf': 4.1240552530253936e-05, 'feature_fraction': 0.5793323023756751, 'bagging_fraction': 0.871956811032129, 'lambda_l1': 14.337079987384774, 'lambda_l2': 0.06089551522448312}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[254]	training's auc: 0.867993	valid_1's auc: 0.767778


[I 2024-03-05 17:51:31,329] Trial 5 finished with value: 0.7677780795114066 and parameters: {'num_leaves': 205, 'min_child_samples': 87, 'min_sum_hessian_in_leaf': 4.9858739088801204e-05, 'feature_fraction': 0.9362363130437326, 'bagging_fraction': 0.6971083594354832, 'lambda_l1': 30.81100620636748, 'lambda_l2': 0.026143578221411667}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[170]	training's auc: 0.940067	valid_1's auc: 0.769165


[I 2024-03-05 17:51:49,198] Trial 6 finished with value: 0.769165195670696 and parameters: {'num_leaves': 216, 'min_child_samples': 145, 'min_sum_hessian_in_leaf': 0.001249120703167817, 'feature_fraction': 0.6793144613574608, 'bagging_fraction': 0.8750714400654913, 'lambda_l1': 0.08130465538589339, 'lambda_l2': 0.3072992455268723}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[633]	training's auc: 0.830921	valid_1's auc: 0.772105


[I 2024-03-05 17:52:06,106] Trial 8 finished with value: 0.7721052674889994 and parameters: {'num_leaves': 19, 'min_child_samples': 82, 'min_sum_hessian_in_leaf': 0.00012815591523077815, 'feature_fraction': 0.9936702929164138, 'bagging_fraction': 0.5501373536659775, 'lambda_l1': 1.578070882426181, 'lambda_l2': 13.985592814411795}. Best is trial 3 with value: 0.7730843806464844.


Early stopping, best iteration is:
[167]	training's auc: 0.94851	valid_1's auc: 0.768381
Training until validation scores don't improve for 100 rounds


[I 2024-03-05 17:52:16,227] Trial 7 finished with value: 0.7683809126654989 and parameters: {'num_leaves': 222, 'min_child_samples': 85, 'min_sum_hessian_in_leaf': 0.0014445780940415317, 'feature_fraction': 0.7392837367652674, 'bagging_fraction': 0.9032118588305915, 'lambda_l1': 0.30768544758019195, 'lambda_l2': 0.1347508248568293}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[211]	training's auc: 0.869796	valid_1's auc: 0.769733


[I 2024-03-05 17:53:40,485] Trial 11 finished with value: 0.7697328004439581 and parameters: {'num_leaves': 77, 'min_child_samples': 103, 'min_sum_hessian_in_leaf': 0.008493420091758219, 'feature_fraction': 0.8825327839177592, 'bagging_fraction': 0.6402356017260964, 'lambda_l1': 0.26710507063297123, 'lambda_l2': 1.763133303169384}. Best is trial 3 with value: 0.7730843806464844.


Early stopping, best iteration is:
[205]	training's auc: 0.872317	valid_1's auc: 0.771222


[I 2024-03-05 17:53:44,511] Trial 12 finished with value: 0.7712224775380331 and parameters: {'num_leaves': 82, 'min_child_samples': 167, 'min_sum_hessian_in_leaf': 0.005016660504227434, 'feature_fraction': 0.91544975785175, 'bagging_fraction': 0.9648475933081238, 'lambda_l1': 1.8903901005900492, 'lambda_l2': 0.2541890366376793}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[208]	training's auc: 0.879925	valid_1's auc: 0.768386


[I 2024-03-05 17:53:54,269] Trial 9 finished with value: 0.7683863737948156 and parameters: {'num_leaves': 256, 'min_child_samples': 199, 'min_sum_hessian_in_leaf': 0.0001208990529912692, 'feature_fraction': 0.7808171982521007, 'bagging_fraction': 0.5473985106445489, 'lambda_l1': 13.702395404290565, 'lambda_l2': 0.18922336367854078}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[249]	training's auc: 0.847534	valid_1's auc: 0.772166


[I 2024-03-05 17:54:33,173] Trial 13 finished with value: 0.7721663979384932 and parameters: {'num_leaves': 56, 'min_child_samples': 13, 'min_sum_hessian_in_leaf': 0.00011446077850710227, 'feature_fraction': 0.8219870199472179, 'bagging_fraction': 0.7341965099234689, 'lambda_l1': 0.06295025082145492, 'lambda_l2': 12.269267412996541}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[161]	training's auc: 0.923747	valid_1's auc: 0.768428


[I 2024-03-05 17:54:59,697] Trial 10 finished with value: 0.7684283991976522 and parameters: {'num_leaves': 205, 'min_child_samples': 186, 'min_sum_hessian_in_leaf': 0.00013384492364563967, 'feature_fraction': 0.8621411304820359, 'bagging_fraction': 0.710469280589955, 'lambda_l1': 0.029855321905251146, 'lambda_l2': 2.2830558969303167}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[572]	training's auc: 0.812318	valid_1's auc: 0.772664


[I 2024-03-05 17:57:15,572] Trial 15 finished with value: 0.7726642655509577 and parameters: {'num_leaves': 18, 'min_child_samples': 27, 'min_sum_hessian_in_leaf': 0.00034476940498004286, 'feature_fraction': 0.588873778700843, 'bagging_fraction': 0.5157036670976389, 'lambda_l1': 0.016520305511118574, 'lambda_l2': 84.14869930049673}. Best is trial 3 with value: 0.7730843806464844.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1035]	training's auc: 0.805018	valid_1's auc: 0.773329


[I 2024-03-05 17:58:05,757] Trial 16 finished with value: 0.7733292622448904 and parameters: {'num_leaves': 9, 'min_child_samples': 27, 'min_sum_hessian_in_leaf': 1.0880889260834155e-05, 'feature_fraction': 0.5885272808935014, 'bagging_fraction': 0.5117226900538523, 'lambda_l1': 0.010555054044692732, 'lambda_l2': 84.61027694438329}. Best is trial 16 with value: 0.7733292622448904.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[705]	training's auc: 0.832584	valid_1's auc: 0.774014


[I 2024-03-05 17:58:43,246] Trial 17 finished with value: 0.7740137950335215 and parameters: {'num_leaves': 21, 'min_child_samples': 12, 'min_sum_hessian_in_leaf': 1.1186800886410533e-05, 'feature_fraction': 0.6080814058980422, 'bagging_fraction': 0.7806595142664121, 'lambda_l1': 0.012742409039829173, 'lambda_l2': 64.96544736665552}. Best is trial 17 with value: 0.7740137950335215.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[912]	training's auc: 0.814407	valid_1's auc: 0.774339


[I 2024-03-05 17:59:02,375] Trial 18 finished with value: 0.774338848005229 and parameters: {'num_leaves': 11, 'min_child_samples': 12, 'min_sum_hessian_in_leaf': 1.208544038988097e-05, 'feature_fraction': 0.6216638436421144, 'bagging_fraction': 0.7864096954539077, 'lambda_l1': 0.012241152570934326, 'lambda_l2': 62.467269232509025}. Best is trial 18 with value: 0.774338848005229.


Early stopping, best iteration is:
[847]	training's auc: 0.833758	valid_1's auc: 0.774131


[I 2024-03-05 17:59:09,007] Trial 14 finished with value: 0.7741309011679157 and parameters: {'num_leaves': 20, 'min_child_samples': 13, 'min_sum_hessian_in_leaf': 1.0723624703556903e-05, 'feature_fraction': 0.5003205484064259, 'bagging_fraction': 0.770945652729496, 'lambda_l1': 0.03540965929660106, 'lambda_l2': 87.1959978702186}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1047]	training's auc: 0.810929	valid_1's auc: 0.773871


[I 2024-03-05 18:01:41,855] Trial 19 finished with value: 0.7738711501932752 and parameters: {'num_leaves': 10, 'min_child_samples': 20, 'min_sum_hessian_in_leaf': 0.00046378456563141764, 'feature_fraction': 0.5054242501697139, 'bagging_fraction': 0.5242403864488414, 'lambda_l1': 0.010361160720293059, 'lambda_l2': 55.11393379000911}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[427]	training's auc: 0.845566	valid_1's auc: 0.774038


[I 2024-03-05 18:02:31,247] Trial 20 finished with value: 0.7740377983533753 and parameters: {'num_leaves': 49, 'min_child_samples': 44, 'min_sum_hessian_in_leaf': 1.00876448084067e-05, 'feature_fraction': 0.5072359006010247, 'bagging_fraction': 0.7989394097097444, 'lambda_l1': 0.10744321914841211, 'lambda_l2': 99.26158377695306}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[369]	training's auc: 0.849823	valid_1's auc: 0.773744


[I 2024-03-05 18:02:46,052] Trial 22 finished with value: 0.7737444505681765 and parameters: {'num_leaves': 51, 'min_child_samples': 56, 'min_sum_hessian_in_leaf': 2.404916419909978e-05, 'feature_fraction': 0.507261575091893, 'bagging_fraction': 0.7980636536394629, 'lambda_l1': 0.06786951250601601, 'lambda_l2': 42.74227961032233}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[459]	training's auc: 0.854557	valid_1's auc: 0.773837


[I 2024-03-05 18:03:03,717] Trial 21 finished with value: 0.773836876530421 and parameters: {'num_leaves': 49, 'min_child_samples': 47, 'min_sum_hessian_in_leaf': 1.6693518566885993e-05, 'feature_fraction': 0.5162824429413363, 'bagging_fraction': 0.8038610834404655, 'lambda_l1': 0.011171869391200963, 'lambda_l2': 71.50057337422876}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[267]	training's auc: 0.894271	valid_1's auc: 0.771606


[I 2024-03-05 18:04:29,521] Trial 23 finished with value: 0.771606106732411 and parameters: {'num_leaves': 135, 'min_child_samples': 54, 'min_sum_hessian_in_leaf': 2.7519720944911214e-05, 'feature_fraction': 0.5033558749499052, 'bagging_fraction': 0.8003214945705489, 'lambda_l1': 0.09582650248549308, 'lambda_l2': 31.15128333985646}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[435]	training's auc: 0.867462	valid_1's auc: 0.772876


[I 2024-03-05 18:05:06,685] Trial 24 finished with value: 0.7728759386382809 and parameters: {'num_leaves': 52, 'min_child_samples': 45, 'min_sum_hessian_in_leaf': 2.932965684574063e-05, 'feature_fraction': 0.5273030772364647, 'bagging_fraction': 0.7876785985593697, 'lambda_l1': 0.0966491151501192, 'lambda_l2': 21.498089935371343}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[353]	training's auc: 0.850226	valid_1's auc: 0.773988


[I 2024-03-05 18:05:23,002] Trial 26 finished with value: 0.7739875402891876 and parameters: {'num_leaves': 48, 'min_child_samples': 48, 'min_sum_hessian_in_leaf': 2.6355243709998052e-05, 'feature_fraction': 0.5406037029645386, 'bagging_fraction': 0.8103584763128626, 'lambda_l1': 0.14973789762955667, 'lambda_l2': 22.341868955406625}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[424]	training's auc: 0.85928	valid_1's auc: 0.773233


[I 2024-03-05 18:05:44,797] Trial 25 finished with value: 0.773233088658353 and parameters: {'num_leaves': 49, 'min_child_samples': 49, 'min_sum_hessian_in_leaf': 2.4191310978449368e-05, 'feature_fraction': 0.5069001171814758, 'bagging_fraction': 0.8082267483365126, 'lambda_l1': 0.09326597752502529, 'lambda_l2': 28.35748678589741}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[261]	training's auc: 0.884553	valid_1's auc: 0.772715


[I 2024-03-05 18:06:34,515] Trial 27 finished with value: 0.7727149297265228 and parameters: {'num_leaves': 115, 'min_child_samples': 56, 'min_sum_hessian_in_leaf': 2.8171072723459607e-05, 'feature_fraction': 0.5484848950529893, 'bagging_fraction': 0.8346844946307501, 'lambda_l1': 0.03857431263765092, 'lambda_l2': 30.266052919191573}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[325]	training's auc: 0.882549	valid_1's auc: 0.772205


[I 2024-03-05 18:07:54,008] Trial 28 finished with value: 0.772204579532763 and parameters: {'num_leaves': 85, 'min_child_samples': 40, 'min_sum_hessian_in_leaf': 3.0286318567658172e-05, 'feature_fraction': 0.5525135337926876, 'bagging_fraction': 0.7517983579563073, 'lambda_l1': 0.02902998136915894, 'lambda_l2': 21.821522737847587}. Best is trial 18 with value: 0.774338848005229.


Early stopping, best iteration is:
[252]	training's auc: 0.88405	valid_1's auc: 0.77276
Early stopping, best iteration is:
[263]	training's auc: 0.889834	valid_1's auc: 0.772749


[I 2024-03-05 18:07:58,596] Trial 30 finished with value: 0.7727603500779823 and parameters: {'num_leaves': 85, 'min_child_samples': 5, 'min_sum_hessian_in_leaf': 5.821168127844469e-05, 'feature_fraction': 0.649439841901169, 'bagging_fraction': 0.863118744927204, 'lambda_l1': 0.031067956690333062, 'lambda_l2': 4.009189496301025}. Best is trial 18 with value: 0.774338848005229.
[I 2024-03-05 18:07:58,989] Trial 29 finished with value: 0.7727492888865093 and parameters: {'num_leaves': 91, 'min_child_samples': 5, 'min_sum_hessian_in_leaf': 1.9951798883255676e-05, 'feature_fraction': 0.6378049773837202, 'bagging_fraction': 0.8333666053821815, 'lambda_l1': 0.03533382114093348, 'lambda_l2': 4.925943721830825}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[239]	training's auc: 0.871354	valid_1's auc: 0.772867


[I 2024-03-05 18:08:15,055] Trial 31 finished with value: 0.7728672143684201 and parameters: {'num_leaves': 81, 'min_child_samples': 5, 'min_sum_hessian_in_leaf': 5.8527758475475315e-05, 'feature_fraction': 0.6407426702602146, 'bagging_fraction': 0.7514527671708687, 'lambda_l1': 0.0295037370002284, 'lambda_l2': 6.338231334038101}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[351]	training's auc: 0.899001	valid_1's auc: 0.772652


[I 2024-03-05 18:09:25,698] Trial 32 finished with value: 0.7726518079063258 and parameters: {'num_leaves': 77, 'min_child_samples': 7, 'min_sum_hessian_in_leaf': 6.340023617484815e-05, 'feature_fraction': 0.6605531931016393, 'bagging_fraction': 0.7426248151369592, 'lambda_l1': 0.027486742999175445, 'lambda_l2': 4.928042409348317}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[399]	training's auc: 0.835023	valid_1's auc: 0.774196


[I 2024-03-05 18:09:58,548] Trial 33 finished with value: 0.7741960677945234 and parameters: {'num_leaves': 30, 'min_child_samples': 6, 'min_sum_hessian_in_leaf': 6.065893620239605e-05, 'feature_fraction': 0.6186540813220591, 'bagging_fraction': 0.7517312206745859, 'lambda_l1': 4.293284520904504, 'lambda_l2': 3.160952812050787}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[610]	training's auc: 0.847047	valid_1's auc: 0.774177


[I 2024-03-05 18:11:46,663] Trial 36 finished with value: 0.7741774579187568 and parameters: {'num_leaves': 33, 'min_child_samples': 30, 'min_sum_hessian_in_leaf': 1.0749166753092036e-05, 'feature_fraction': 0.6219296377317414, 'bagging_fraction': 0.7603988133503716, 'lambda_l1': 0.01842431086953345, 'lambda_l2': 86.30089149510675}. Best is trial 18 with value: 0.774338848005229.


Early stopping, best iteration is:
[591]	training's auc: 0.851794	valid_1's auc: 0.773669


[I 2024-03-05 18:11:59,791] Trial 35 finished with value: 0.7736689957866657 and parameters: {'num_leaves': 38, 'min_child_samples': 31, 'min_sum_hessian_in_leaf': 1.0305934546469622e-05, 'feature_fraction': 0.6217702335479749, 'bagging_fraction': 0.7530462956251059, 'lambda_l1': 0.01940229803604593, 'lambda_l2': 97.23638149156733}. Best is trial 18 with value: 0.774338848005229.


Training until validation scores don't improve for 100 rounds
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[481]	training's auc: 0.835511	valid_1's auc: 0.774615


[I 2024-03-05 18:12:21,910] Trial 37 finished with value: 0.7746152566711185 and parameters: {'num_leaves': 32, 'min_child_samples': 29, 'min_sum_hessian_in_leaf': 1.1406122742331612e-05, 'feature_fraction': 0.6092896660516469, 'bagging_fraction': 0.9448701543486733, 'lambda_l1': 0.016539525197213148, 'lambda_l2': 97.94301511767947}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[193]	training's auc: 0.902273	valid_1's auc: 0.771521


[I 2024-03-05 18:12:48,705] Trial 34 finished with value: 0.7715210370859129 and parameters: {'num_leaves': 149, 'min_child_samples': 33, 'min_sum_hessian_in_leaf': 0.00021628375871291414, 'feature_fraction': 0.6392023456795418, 'bagging_fraction': 0.7534334576766978, 'lambda_l1': 0.8641757042857885, 'lambda_l2': 5.242338719507973}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[466]	training's auc: 0.865085	valid_1's auc: 0.773755


[I 2024-03-05 18:13:35,282] Trial 38 finished with value: 0.7737549168421052 and parameters: {'num_leaves': 40, 'min_child_samples': 32, 'min_sum_hessian_in_leaf': 0.00021331353769349988, 'feature_fraction': 0.7441697323194054, 'bagging_fraction': 0.9289168797995269, 'lambda_l1': 5.7800128837452025, 'lambda_l2': 0.7609366332808881}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[347]	training's auc: 0.837393	valid_1's auc: 0.771868


[I 2024-03-05 18:14:20,471] Trial 40 finished with value: 0.771867870452038 and parameters: {'num_leaves': 34, 'min_child_samples': 68, 'min_sum_hessian_in_leaf': 1.503598167441926e-05, 'feature_fraction': 0.7094442445531383, 'bagging_fraction': 0.6244386097065968, 'lambda_l1': 4.391104376055152, 'lambda_l2': 0.5121023530560861}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[493]	training's auc: 0.82771	valid_1's auc: 0.772878


[I 2024-03-05 18:14:44,206] Trial 39 finished with value: 0.772877677079968 and parameters: {'num_leaves': 29, 'min_child_samples': 69, 'min_sum_hessian_in_leaf': 1.629059319974351e-05, 'feature_fraction': 0.7138224313237866, 'bagging_fraction': 0.6109027223563842, 'lambda_l1': 8.725387443426191, 'lambda_l2': 41.14280340250841}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[466]	training's auc: 0.860987	valid_1's auc: 0.774162


[I 2024-03-05 18:15:38,104] Trial 41 finished with value: 0.77416215036972 and parameters: {'num_leaves': 36, 'min_child_samples': 119, 'min_sum_hessian_in_leaf': 4.061680018339771e-05, 'feature_fraction': 0.7136934861782758, 'bagging_fraction': 0.9660445152189178, 'lambda_l1': 4.428034073787398, 'lambda_l2': 0.5915201222665156}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[402]	training's auc: 0.840287	valid_1's auc: 0.77254


[I 2024-03-05 18:16:13,124] Trial 44 finished with value: 0.7725397567898705 and parameters: {'num_leaves': 30, 'min_child_samples': 128, 'min_sum_hessian_in_leaf': 8.069090007223803e-05, 'feature_fraction': 0.7067066549550891, 'bagging_fraction': 0.6927166591748504, 'lambda_l1': 3.8339963240148704, 'lambda_l2': 0.02999022103254685}. Best is trial 37 with value: 0.7746152566711185.


Early stopping, best iteration is:
[467]	training's auc: 0.848864	valid_1's auc: 0.773143
Training until validation scores don't improve for 100 rounds


[I 2024-03-05 18:16:25,917] Trial 43 finished with value: 0.773143281045781 and parameters: {'num_leaves': 33, 'min_child_samples': 67, 'min_sum_hessian_in_leaf': 1.6358961603810537e-05, 'feature_fraction': 0.5713968584697342, 'bagging_fraction': 0.6192062244934453, 'lambda_l1': 5.262058042001089, 'lambda_l2': 0.01168358442051133}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[1164]	training's auc: 0.82261	valid_1's auc: 0.772808


[I 2024-03-05 18:17:27,432] Trial 42 finished with value: 0.7727803100959608 and parameters: {'num_leaves': 34, 'min_child_samples': 71, 'min_sum_hessian_in_leaf': 1.63559734180479e-05, 'feature_fraction': 0.7045047812649514, 'bagging_fraction': 0.96814467983717, 'lambda_l1': 73.84274342313871, 'lambda_l2': 0.7266567695598871}. Best is trial 37 with value: 0.7746152566711185.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[589]	training's auc: 0.851671	valid_1's auc: 0.772405


[I 2024-03-05 18:19:12,826] Trial 45 finished with value: 0.7724048138162796 and parameters: {'num_leaves': 62, 'min_child_samples': 131, 'min_sum_hessian_in_leaf': 4.327152323641915e-05, 'feature_fraction': 0.5718763114587616, 'bagging_fraction': 0.6934685110992704, 'lambda_l1': 27.51479571338394, 'lambda_l2': 49.42385357832582}. Best is trial 37 with value: 0.7746152566711185.


Early stopping, best iteration is:
[318]	training's auc: 0.846056	valid_1's auc: 0.772409


[I 2024-03-05 18:19:19,123] Trial 47 finished with value: 0.7724092561047714 and parameters: {'num_leaves': 67, 'min_child_samples': 109, 'min_sum_hessian_in_leaf': 4.4735355351425136e-05, 'feature_fraction': 0.6752763400953709, 'bagging_fraction': 0.9792021755022551, 'lambda_l1': 24.36128360941631, 'lambda_l2': 0.08847953691604753}. Best is trial 37 with value: 0.7746152566711185.


Early stopping, best iteration is:
[561]	training's auc: 0.850194	valid_1's auc: 0.772639
Early stopping, best iteration is:
[410]	training's auc: 0.84695	valid_1's auc: 0.7727


[I 2024-03-05 18:19:34,175] Trial 46 finished with value: 0.7726388052174767 and parameters: {'num_leaves': 62, 'min_child_samples': 132, 'min_sum_hessian_in_leaf': 3.816882789408487e-05, 'feature_fraction': 0.6814841620419297, 'bagging_fraction': 0.9919985462141654, 'lambda_l1': 47.012841108584006, 'lambda_l2': 0.08006822619539991}. Best is trial 37 with value: 0.7746152566711185.
[I 2024-03-05 18:19:36,572] Trial 48 finished with value: 0.7727001244731373 and parameters: {'num_leaves': 60, 'min_child_samples': 111, 'min_sum_hessian_in_leaf': 4.0588898286676534e-05, 'feature_fraction': 0.7880160756985722, 'bagging_fraction': 0.9656031799335398, 'lambda_l1': 34.84281773891688, 'lambda_l2': 2.1843056847173785}. Best is trial 37 with value: 0.7746152566711185.


Early stopping, best iteration is:
[189]	training's auc: 0.918307	valid_1's auc: 0.771348


[I 2024-03-05 18:19:51,839] Trial 49 finished with value: 0.7713483756275178 and parameters: {'num_leaves': 163, 'min_child_samples': 115, 'min_sum_hessian_in_leaf': 4.67721443246508e-05, 'feature_fraction': 0.6679853133050253, 'bagging_fraction': 0.9357252561195047, 'lambda_l1': 2.690605004865937, 'lambda_l2': 0.11689360492625384}. Best is trial 37 with value: 0.7746152566711185.


In [70]:
# 探索結果の確認
trial = study.best_trial
print(f"acc(best)={trial.value:.4f}")
display(trial.params)

acc(best)=0.7746


{'num_leaves': 32,
 'min_child_samples': 29,
 'min_sum_hessian_in_leaf': 1.1406122742331612e-05,
 'feature_fraction': 0.6092896660516469,
 'bagging_fraction': 0.9448701543486733,
 'lambda_l1': 0.016539525197213148,
 'lambda_l2': 97.94301511767947}

In [71]:
# ベストなハイパーパラメータの取得
params_best = trial.params
params_best.update(params_base)
display(params_best)

{'num_leaves': 32,
 'min_child_samples': 29,
 'min_sum_hessian_in_leaf': 1.1406122742331612e-05,
 'feature_fraction': 0.6092896660516469,
 'bagging_fraction': 0.9448701543486733,
 'lambda_l1': 0.016539525197213148,
 'lambda_l2': 97.94301511767947,
 'boosting_type': 'gbdt',
 'objective': 'binary',
 'metric': 'auc',
 'verbosity': -1,
 'learning_rate': 0.05,
 'n_estimators': 100000,
 'bagging_freq': 1,
 'seed': 123}

In [72]:
# ベストなハイパーパラメータを用いたモデル学習
train_oof, imp, metrics = train_lgb(
    x_train,
    y_train,
    id_train,
    list_nfold=[0, 1, 2, 3, 4],
    n_splits=5,
    params=params_best,
)
# 4m

-------------------- 0 --------------------
(246008, 162) (61503, 162)
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.784121	valid_1's auc: 0.764129
[200]	training's auc: 0.803207	valid_1's auc: 0.771165
[300]	training's auc: 0.816935	valid_1's auc: 0.773429
[400]	training's auc: 0.8278	valid_1's auc: 0.774152
[500]	training's auc: 0.837195	valid_1's auc: 0.774482
Early stopping, best iteration is:
[481]	training's auc: 0.835511	valid_1's auc: 0.774615
[auc] tr:0.8355, va:0.7746
-------------------- 1 --------------------
(246009, 162) (61502, 162)
Training until validation scores don't improve for 100 rounds
[100]	training's auc: 0.783849	valid_1's auc: 0.767894
[200]	training's auc: 0.802897	valid_1's auc: 0.77515
[300]	training's auc: 0.816275	valid_1's auc: 0.777699
[400]	training's auc: 0.827064	valid_1's auc: 0.778903
[500]	training's auc: 0.836339	valid_1's auc: 0.779559
[600]	training's auc: 0.844631	valid_1's auc: 0.779922
Early stopping,

In [73]:
# 推論データ作成・モデル推論・提出用ファイル作成

# 推論用のデータセット作成
x_test = df_test.drop(columns=["SK_ID_CURR"])
id_test = df_test[["SK_ID_CURR"]]

# カテゴリ変数をcategory型へ変換
for col in x_test.columns:
    if x_test[col].dtype == "object":
        x_test[col] = x_test[col].astype("category")

# predict
test_pred = predict_lgb(
    x_test,
    id_test,
    list_nfold=[0, 1, 2, 3, 4],
)

# make submission-file
df_submit = test_pred.rename(columns={"pred": "TARGET"})
print(df_submit.shape)
display(df_submit.head())
df_submit.to_csv("submission_HyperParameterTuning.csv", index=None)

-------------------- 0 --------------------
-------------------- 1 --------------------
-------------------- 2 --------------------
-------------------- 3 --------------------
-------------------- 4 --------------------
Done.
(48744, 2)


Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.0424
1,100005,0.1253
2,100013,0.0262
3,100028,0.0469
4,100038,0.2131
