<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/LightGbmSample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LightGBM で回帰タスクをするサンプル
- [ボストンの住宅価格に関する回帰タスク](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)
- [説明変数をDataFrame形式で用意](https://scikit-learn.org/stable/datasets/index.html#boston-dataset)

In [1]:
import pandas as pd
from sklearn.datasets import load_boston

data = load_boston()
train_x: pd.DataFrame = pd.DataFrame(data.data, columns = data.feature_names)
train_x.head

<bound method NDFrame.head of         CRIM    ZN  INDUS  CHAS    NOX  ...  RAD    TAX  PTRATIO       B  LSTAT
0    0.00632  18.0   2.31   0.0  0.538  ...  1.0  296.0     15.3  396.90   4.98
1    0.02731   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  396.90   9.14
2    0.02729   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  392.83   4.03
3    0.03237   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  394.63   2.94
4    0.06905   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  396.90   5.33
..       ...   ...    ...   ...    ...  ...  ...    ...      ...     ...    ...
501  0.06263   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  391.99   9.67
502  0.04527   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   9.08
503  0.06076   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   5.64
504  0.10959   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  393.45   6.48
505  0.04741   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   7.88

[506 rows

目的変数も DataFrame 形式で用意

In [2]:
train_y: pd.DataFrame = pd.DataFrame(data.target, columns = ["MEDV"])
train_y.head

<bound method NDFrame.head of      MEDV
0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
..    ...
501  22.4
502  20.6
503  23.9
504  22.0
505  11.9

[506 rows x 1 columns]>

学習を行ってモデルを戻す関数を定義

In [3]:
# !pip install optuna
# import optuna.integration.lightgbm as lgb
import lightgbm as lgb

def Fit(tr_x, tr_y, va_x, va_y):
    lgb_train = lgb.Dataset(tr_x, tr_y)
    lgb_eval = lgb.Dataset(va_x, va_y)
    params = {'objective': 'regression', 'metrics': 'l1', 'seed': 71}
    booster = lgb.train(params, lgb_train, 
                        valid_names=['train', 'valid'],
                        valid_sets=[lgb_train, lgb_eval],
                        verbose_eval=False)
    return booster

クロスバリデーションの準備

In [4]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=4, shuffle=True, random_state=71)

学習の実施

In [5]:
from sklearn.metrics import mean_absolute_error

scores = []
for tr_idx, va_idx in kf.split(train_x):
    # 学習・バリデーションデータの分離
    tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
    tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

    # 学習
    model = Fit(tr_x, tr_y, va_x, va_y)

    # 特徴量評価
    importance = pd.DataFrame(model.feature_importance(),
                              index=tr_x.columns.values, 
                              columns=["importance"])
    display(importance)

    # バリデーション
    va_pred = model.predict(va_x)
    score = mean_absolute_error(va_y, va_pred)
    scores.append(score)

Unnamed: 0,importance
CRIM,145
ZN,5
INDUS,55
CHAS,6
NOX,128
RM,205
AGE,149
DIS,222
RAD,28
TAX,67


Unnamed: 0,importance
CRIM,133
ZN,6
INDUS,50
CHAS,2
NOX,79
RM,184
AGE,164
DIS,226
RAD,50
TAX,66


Unnamed: 0,importance
CRIM,145
ZN,12
INDUS,46
CHAS,15
NOX,104
RM,192
AGE,166
DIS,212
RAD,35
TAX,68


Unnamed: 0,importance
CRIM,152
ZN,8
INDUS,53
CHAS,16
NOX,142
RM,230
AGE,125
DIS,161
RAD,38
TAX,67


評価

In [6]:
import numpy as np
print(f'MAE: {np.mean(scores):.4f}')

MAE: 2.2818
