## LightGBM用法速查表
### by 《网易云课程 x 稀牛学院 机器学习工程师微专业》寒小阳

#### 1.读取csv数据并指定参数建模

**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [14]:
# coding: utf-8
import lightgbm as lgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# 加载数据集合
print('Load data...')

df_train = load_boston()

# 设定训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    df_train.data, df_train.target, test_size=0.25, random_state=42
)

# 数据预处理
ss_X = StandardScaler()
ss_y = StandardScaler()
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))

# 构建lgb中的Dataset格式
lgb_train = lgb.Dataset(X_train, y_train.ravel())
lgb_eval = lgb.Dataset(X_test, y_test.ravel(), reference=lgb_train)

# 敲定好一组参数
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'auc'},
    'num_leaves': 10,
    'n_estimators': 40,
    'learning_rate': 0.1,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

print('开始训练...')
# 训练
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                early_stopping_rounds=10)

# 保存模型
print('保存模型...')
# 保存模型到文件中
gbm.save_model('model.txt')

print('开始预测...')
# 预测
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# 评估
print('预估结果的rmse为:')
print(mean_squared_error(y_test, y_pred) ** 0.5)

Load data...
开始训练...
[1]	valid_0's l2: 0.698317	valid_0's auc: 0.895122
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's l2: 0.607243	valid_0's auc: 0.899458
[3]	valid_0's l2: 0.531016	valid_0's auc: 0.902439
[4]	valid_0's l2: 0.469681	valid_0's auc: 0.902981
[5]	valid_0's l2: 0.419501	valid_0's auc: 0.905285
[6]	valid_0's l2: 0.375309	valid_0's auc: 0.907453
[7]	valid_0's l2: 0.339459	valid_0's auc: 0.904743
[8]	valid_0's l2: 0.315451	valid_0's auc: 0.907859
[9]	valid_0's l2: 0.291734	valid_0's auc: 0.909214
[10]	valid_0's l2: 0.270602	valid_0's auc: 0.908537
[11]	valid_0's l2: 0.253084	valid_0's auc: 0.911382
[12]	valid_0's l2: 0.242599	valid_0's auc: 0.909214
[13]	valid_0's l2: 0.22814	valid_0's auc: 0.911518
[14]	valid_0's l2: 0.215387	valid_0's auc: 0.91748
[15]	valid_0's l2: 0.20552	valid_0's auc: 0.922087
[16]	valid_0's l2: 0.196092	valid_0's auc: 0.926152
[17]	valid_0's l2: 0.1908	valid_0's auc: 0.927236
[18]	valid_0's l2: 0.186486	valid_0's auc: 0.92

### sklearn与LightGBM配合使用
#### 1.LightGBM建模，sklearn评估
**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [13]:
# coding: utf-8
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

# 加载数据
print('加载数据...')
df_train = load_boston()

# 设定训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    df_train.data, df_train.target, test_size=0.25, random_state=42
)

# 数据预处理
ss_X = StandardScaler()
ss_y = StandardScaler()
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))

print('开始训练...')
# 直接初始化LGBMRegressor
# 这个LightGBM的Regressor和sklearn中其他Regressor基本是一致的
gbm = lgb.LGBMRegressor(objective='regression',
                        num_leaves=10,
                        learning_rate=0.1,
                        n_estimators=40)

# 使用fit函数拟合
gbm.fit(X_train, y_train.ravel(),
        eval_set=[(X_test, y_test.ravel())],
        eval_metric='l1',
        early_stopping_rounds=5)

# 预测
print('开始预测...')
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# 评估预测结果
print('预测结果的rmse是:')
print(mean_squared_error(y_test, y_pred) ** 0.5)

加载数据...
开始训练...
[1]	valid_0's l1: 0.611117
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l1: 0.565039
[3]	valid_0's l1: 0.524798
[4]	valid_0's l1: 0.489442
[5]	valid_0's l1: 0.457042
[6]	valid_0's l1: 0.431108
[7]	valid_0's l1: 0.411458
[8]	valid_0's l1: 0.393641
[9]	valid_0's l1: 0.378178
[10]	valid_0's l1: 0.365914
[11]	valid_0's l1: 0.3535
[12]	valid_0's l1: 0.343257
[13]	valid_0's l1: 0.33072
[14]	valid_0's l1: 0.317305
[15]	valid_0's l1: 0.305088
[16]	valid_0's l1: 0.293354
[17]	valid_0's l1: 0.283972
[18]	valid_0's l1: 0.276938
[19]	valid_0's l1: 0.271041
[20]	valid_0's l1: 0.264199
[21]	valid_0's l1: 0.257531
[22]	valid_0's l1: 0.254052
[23]	valid_0's l1: 0.250127
[24]	valid_0's l1: 0.247797
[25]	valid_0's l1: 0.247851
[26]	valid_0's l1: 0.245651
[27]	valid_0's l1: 0.244956
[28]	valid_0's l1: 0.244051
[29]	valid_0's l1: 0.242614
[30]	valid_0's l1: 0.241154
[31]	valid_0's l1: 0.238583
[32]	valid_0's l1: 0.237489
[33]	valid_0's l1: 0.235866
[34]	valid_

#### 2.网格搜索查找最优超参数
**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [9]:
# 配合scikit-learn的网格搜索交叉验证选择最优超参数
estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {
    'n_estimators': [20, 40],
    'learning_rate': [0.01, 0.1, 1],
    'num_leaves': [10, 100],
}

gbm = GridSearchCV(estimator, param_grid)

gbm.fit(X_train, y_train)

print('用网格搜索找到的最优超参数为:')
print(gbm.best_params_)

用网格搜索找到的最优超参数为:
{'n_estimators': 40, 'learning_rate': 0.1, 'num_leaves': 10}
