<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/LightGbmSample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- [ボストンの住宅価格に関する回帰タスク](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)
- [説明変数をDataFrame形式で用意](https://scikit-learn.org/stable/datasets/index.html#boston-dataset)

In [1]:
import pandas as pd
from sklearn.datasets import load_boston

data = load_boston()
train_x: pd.DataFrame = pd.DataFrame(data.data, columns = data.feature_names)
train_x.head

<bound method NDFrame.head of         CRIM    ZN  INDUS  CHAS    NOX  ...  RAD    TAX  PTRATIO       B  LSTAT
0    0.00632  18.0   2.31   0.0  0.538  ...  1.0  296.0     15.3  396.90   4.98
1    0.02731   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  396.90   9.14
2    0.02729   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  392.83   4.03
3    0.03237   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  394.63   2.94
4    0.06905   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  396.90   5.33
..       ...   ...    ...   ...    ...  ...  ...    ...      ...     ...    ...
501  0.06263   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  391.99   9.67
502  0.04527   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   9.08
503  0.06076   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   5.64
504  0.10959   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  393.45   6.48
505  0.04741   0.0  11.93   0.0  0.573  ...  1.0  273.0     21.0  396.90   7.88

[506 rows

目的変数も DataFrame 形式で用意

In [2]:
train_y: pd.DataFrame = pd.DataFrame(data.target, columns = ["MEDV"])
train_y.head

<bound method NDFrame.head of      MEDV
0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
..    ...
501  22.4
502  20.6
503  23.9
504  22.0
505  11.9

[506 rows x 1 columns]>

- 勾配ブースティング決定木ライブラリ LightGBM
- LightGBM 用の学習データ用意
- 学習を行って

In [3]:
# !pip install optuna
# import optuna.integration.lightgbm as lgb
import lightgbm as lgb

def Fit(tr_x, tr_y, va_x, va_y):
    lgb_train = lgb.Dataset(tr_x, tr_y)
    lgb_eval = lgb.Dataset(va_x, va_y)
    params = {'objective': 'regression', 'metrics': 'l1', 
              'seed': 71, 'verbose': 0}
    booster = lgb.train(params, lgb_train, 
                        valid_names=['train', 'valid'],
                        valid_sets=[lgb_train, lgb_eval])
    return booster

クロスバリデーション

In [4]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

scores = []

kf = KFold(n_splits=4, shuffle=True, random_state=71)
for tr_idx, va_idx in kf.split(train_x):
    tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
    tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

    model = Fit(tr_x, tr_y, va_x, va_y)
    va_pred = model.predict(va_x)
    score = mean_absolute_error(va_y, va_pred)
    scores.append(score)

[1]	train's l1: 6.13826	valid's l1: 5.91354
[2]	train's l1: 5.6415	valid's l1: 5.4627
[3]	train's l1: 5.19888	valid's l1: 5.08207
[4]	train's l1: 4.77993	valid's l1: 4.72924
[5]	train's l1: 4.41776	valid's l1: 4.41847
[6]	train's l1: 4.09875	valid's l1: 4.16244
[7]	train's l1: 3.81636	valid's l1: 3.94985
[8]	train's l1: 3.5714	valid's l1: 3.73869
[9]	train's l1: 3.35946	valid's l1: 3.5482
[10]	train's l1: 3.1667	valid's l1: 3.41049
[11]	train's l1: 2.98996	valid's l1: 3.26836
[12]	train's l1: 2.83606	valid's l1: 3.17448
[13]	train's l1: 2.69567	valid's l1: 3.0942
[14]	train's l1: 2.55712	valid's l1: 2.99249
[15]	train's l1: 2.44295	valid's l1: 2.93069
[16]	train's l1: 2.34523	valid's l1: 2.86741
[17]	train's l1: 2.25535	valid's l1: 2.81889
[18]	train's l1: 2.17257	valid's l1: 2.7607
[19]	train's l1: 2.10596	valid's l1: 2.70483
[20]	train's l1: 2.03825	valid's l1: 2.67443
[21]	train's l1: 1.98793	valid's l1: 2.63265
[22]	train's l1: 1.94837	valid's l1: 2.60926
[23]	train's l1: 1.91034	v

評価

In [5]:
import numpy as np
print(np.mean(scores))

2.2817945860126856
