## lightgbm主要调节参数及其含义

1. 其他参数
    * boosting
    * n_jobs/num_threads===>xgboost(nthread)
    * objective===>xgboost(objective)
    * num_class===>xgboost(num_class)
    * verbosity/verbose===>xgboost(verbosity)
    * metric===>xgboost(eval_metric)
    * *****************************
    * num_boost_round===>xgboost(num_boost_round)
    * trainset===>xgboost(dtrain)
    * valid_sets===>xgboost(evals)
    * feval===>xgboost(feval)

2. 树调节参数
    * max_depth===>xgboost(max_depth)
    * min_sum_hessian_in_leaf/min_child_weight===>xgboost(min_child_weight)
    * min_data_in_leaf/min_child_samples===>xgboost(gamma)
    * num_leaves/max_leaf

3. 防止过拟合参数
    * bagging_fraction/subsample===>xgboost(subsample)
    * bagging_freq/subsample_freq===>xgboost(colsample_bylevel)
    * learning_rate===>xgboost(learning_rate)
    * feature_fraction/sub_feature/colsample_bytree===>xgboost(colsample_bytree)
    * lambda_l1/reg_alpha===>xgboost(reg_alpha)
    * lambda_l2/reg_lambda===>xgboost(reg_lambda)

In [11]:
import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

In [12]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(y_train.shape)
print(np.unique(y_train))  # 7分类任务

(2250, 54)
(2250,)
[1 2 3 4 5 6 7]


In [13]:
enc = OrdinalEncoder()
y_train_enc = enc.fit_transform(y_train.reshape(-1, 1))
y_test_enc = enc.transform(y_test.reshape(-1, 1))
print(np.unique(y_train_enc))

[0. 1. 2. 3. 4. 5. 6.]


In [14]:
train_dataset = lgb.Dataset(data=X_train, label=y_train_enc)

In [15]:
objective = ["regression",  # 回归:L2损失
             "regression_l1",  # 回归:L1损失
             "binary",  #  二分类:binary log loss classification(此时num_class必须设置为1)
             'multiclass',  # 多分类;别名softmax
             'cross_entropy']  # 交叉熵损失

params = {"objective": "multiclass",
          # for the best speed, set this to the number of real CPU cores, not the number of threads
          "num_threads": 8,
          "num_class": 7}

# 默认: objective=regression
# 默认: num_class=1(used only in multi-class classification application)
model = lgb.train(params=params, train_set=train_dataset)  # 分类问题中y标签必须从0开始
model.predict(X_test).shape

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1873
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 34
[LightGBM] [Info] Start training from score -1.744876
[LightGBM] [Info] Start training from score -1.241713
[LightGBM] [Info] Start training from score -2.455995
[LightGBM] [Info] Start training from score -3.123566
[LightGBM] [Info] Start training from score -1.353935
[LightGBM] [Info] Start training from score -2.298150
[LightGBM] [Info] Start training from score -3.036554






(750, 7)

In [16]:
boosting = ['gbdt', 'rf', 'dart', 'goss']

for i in boosting:  # 默认boosting='gbdt'
    if i == 'goss':
        # 注意:Cannot use bagging in GOSS
        model = lgb.train(params={"objective": "multiclass",
                                  "num_class": 7,
                                  "boosting": i}, train_set=train_dataset)
    else:
        # 注意:若boosting_type='rf',则必须进行bagging操作
        '''
        bagging_freq:frequency for bagging
            0 means disable bagging; k means perform bagging at every k iteration.

        bagging_fraction:Subsample ratio of the training instances
            * 0.0 < bagging_fraction <= 1.0
            * to enable bagging, bagging_freq should be set to a non zero value as well
        '''
        model = lgb.train({"n_jobs": 1, "objective": "multiclass",
                           "num_class": 7,
                           "boosting": i,
                           "subsample_freq": 1,  # 默认subsample_freq=0
                           "bagging_fraction": 0.9,  # 默认bagging_fraction=1
                           "bagging_fraction_seed": 1},
                          train_set=train_dataset)
    print(model.predict(X_test).shape)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1873
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 34
[LightGBM] [Info] Start training from score -1.744876
[LightGBM] [Info] Start training from score -1.241713
[LightGBM] [Info] Start training from score -2.455995
[LightGBM] [Info] Start training from score -3.123566
[LightGBM] [Info] Start training from score -1.353935
[LightGBM] [Info] Start training from score -2.298150
[LightGBM] [Info] Start training from score -3.036554
(750, 7)
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1873
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 34
[LightGBM] [Info] Start training from score -1.744876
[LightGBM] [Info] Start training from score -1.241713
[LightGBM] [Info] Start training from score -2.455995
[LightGBM] [Info] Start training from score -3.123566
[LightGBM] [Info] Start train

In [17]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4}
# controls the level of LightGBM’s verbosity(< 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: Debug
model = lgb.train(params=params, train_set=train_dataset)  # 默认verbosity=1
model.predict(X_test).shape

(750, 7)

In [18]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "learning_rate": 0.2}  # 默认learning_rate=0.1
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [19]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "feature_fraction": 0.5}  # 默认feature_fraction=1.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [20]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "lambda_l1": 0.2}  # 默认lambda_l1=0.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [21]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "lambda_l2": 0.2}  # 默认lambda_l2=0.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [22]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "max_depth": 5}  # 默认max_depth=-1(<= 0 means no limit)
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [23]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "min_sum_hessian_in_leaf": 5}  # 默认min_sum_hessian_in_leaf=1e-3
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [24]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "min_data_in_leaf": 25}  # 默认min_data_in_leaf=20(constraints: min_data_in_leaf >= 0)
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [25]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "num_leaves": 62}  # 默认num_leaves=31
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [26]:
# Evaluation metrics for validation data
# 默认metric="" (empty string or not specified) means that metric corresponding to specified objective will be used
'''
l1:absolute loss
l2:square loss
rmse:root square loss
cross_entropy
multi_error:error rate for multi-class classification
multi_logloss:log loss for multi-class classification
binary_logloss
auc
'''
val_dataset = lgb.Dataset(data=X_test, label=y_test_enc)
eval_set = [train_dataset, val_dataset]

params = {"objective": "multiclass",
          "num_class": 7,
          "metric": "multi_error"}
model = lgb.train(params=params,
                  train_set=train_dataset,
                  valid_sets=eval_set)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1873
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 34
[LightGBM] [Info] Start training from score -1.744876
[LightGBM] [Info] Start training from score -1.241713
[LightGBM] [Info] Start training from score -2.455995
[LightGBM] [Info] Start training from score -3.123566
[LightGBM] [Info] Start training from score -1.353935
[LightGBM] [Info] Start training from score -2.298150
[LightGBM] [Info] Start training from score -3.036554
[1]	training's multi_error: 0.411111	valid_1's multi_error: 0.412
[2]	training's multi_error: 0.225778	valid_1's multi_error: 0.286667
[3]	training's multi_error: 0.153778	valid_1's multi_error: 0.234667
[4]	training's multi_error: 0.133333	valid_1's multi_error: 0.222667
[5]	training's multi_error: 0.119556	valid_1's multi_error: 0.210667
[6]	training's multi_error: 0.113333	valid_1's multi_error: 0.208
[7]	training's multi_error: 0.