## lightgbm主要调节参数及其含义

1. 其他参数
    * boosting
    * n_jobs/num_threads===>xgboost(nthread)
    * objective===>xgboost(objective)
    * num_class===>xgboost(num_class)
    * verbosity/verbose===>xgboost(verbosity)
    * metric===>xgboost(eval_metric)
    * *****************************
    * num_boost_round===>xgboost(num_boost_round)
    * trainset===>xgboost(dtrain)
    * early_stopping_rounds===>xgboost(early_stopping_rounds)
    * valid_sets===>xgboost(evals)
    * evals_result===>xgboost(evals_result)
    * feval===>xgboost(feval)
    * verbose_eval===>xgboost(verbose_eval)

2. 树调节参数
    * max_depth===>xgboost(max_depth)
    * min_sum_hessian_in_leaf/min_child_weight===>xgboost(min_child_weight)
    * min_data_in_leaf/min_child_samples===>xgboost(gamma)
    * num_leaves/max_leaf

2. 防止过拟合参数
    * bagging_fraction/subsample===>xgboost(subsample)
    * bagging_freq/subsample_freq===>xgboost(colsample_bylevel)
    * learning_rate===>xgboost(learning_rate)
    * feature_fraction/sub_feature/colsample_bytree===>xgboost(colsample_bytree)
    * lambda_l1/reg_alpha===>xgboost(reg_alpha)
    * lambda_l2/reg_lambda===>xgboost(reg_lambda)

In [1]:
import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

In [2]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(y_train.shape)
print(np.unique(y_train))  # 7分类任务

(2250, 54)
(2250,)
[1 2 3 4 5 6 7]


In [3]:
enc = OrdinalEncoder()
y_train_enc = enc.fit_transform(y_train.reshape(-1, 1))
y_test_enc = enc.transform(y_test.reshape(-1, 1))
print(np.unique(y_train_enc))

[0. 1. 2. 3. 4. 5. 6.]


In [4]:
train_dataset = lgb.Dataset(data=X_train, label=y_train_enc)

In [5]:
objective = ["regression",  # 回归:L2损失
             "regression_l1",  # 回归:L1损失
             "binary",  #  二分类:binary log loss classification
             'multiclass',  # 多分类;别名softmax
             'cross_entropy']  # 交叉熵损失

params = {"objective": "multiclass",
          "num_class": 7}

# 默认: objective=regression
# 默认: num_class=1(used only in multi-class classification application)
model = lgb.train(params=params, train_set=train_dataset)  # 分类问题中y标签必须从0开始
model.predict(X_test).shape

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1864
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 33
[LightGBM] [Info] Start training from score -1.734749
[LightGBM] [Info] Start training from score -1.215895
[LightGBM] [Info] Start training from score -2.509199
[LightGBM] [Info] Start training from score -3.055246
[LightGBM] [Info] Start training from score -1.399717
[LightGBM] [Info] Start training from score -2.238047
[LightGBM] [Info] Start training from score -3.093713






(750, 7)

In [6]:
boosting = ['gbdt', 'rf', 'dart', 'goss']

for i in boosting:
    if i == 'goss':
        # 注意:Cannot use bagging in GOSS
        model = lgb.train(params={"objective": "multiclass",
                                  "num_class": 7,
                                  "boosting": i}, train_set=train_dataset)
    else:
        # 注意:若boosting_type='rf',则必须进行bagging操作
        '''
        bagging_freq:frequency for bagging
            0 means disable bagging; k means perform bagging at every k iteration.

        bagging_fraction:Subsample ratio of the training instances
            * 0.0 < bagging_fraction <= 1.0
            * to enable bagging, bagging_freq should be set to a non zero value as well
        '''
        model = lgb.train({"n_jobs": 1, "objective": "multiclass",
                           "num_class": 7,
                           "boosting": i,
                           "subsample_freq": 1,  # 默认subsample_freq=0
                           "bagging_fraction": 0.9,  # 默认bagging_fraction=1
                           "bagging_fraction_seed": 1},
                          train_set=train_dataset)
    print(model.predict(X_test).shape)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1864
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 33
[LightGBM] [Info] Start training from score -1.734749
[LightGBM] [Info] Start training from score -1.215895
[LightGBM] [Info] Start training from score -2.509199
[LightGBM] [Info] Start training from score -3.055246
[LightGBM] [Info] Start training from score -1.399717
[LightGBM] [Info] Start training from score -2.238047
[LightGBM] [Info] Start training from score -3.093713
(750, 7)
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1864
[LightGBM] [Info] Number of data points in the train set: 2250, number of used features: 33
[LightGBM] [Info] Start training from score -1.734749
[LightGBM] [Info] Start training from score -1.215895
[LightGBM] [Info] Start training from score -2.509199
[LightGBM] [Info] Start training from score -3.055246
[LightGBM] [Info] Start train

In [7]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4}
# controls the level of LightGBM’s verbosity(< 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: Debug
model = lgb.train(params=params, train_set=train_dataset)  # 默认verbosity=1
model.predict(X_test).shape

(750, 7)

In [8]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "learning_rate": 0.2}  # 默认learning_rate=0.1
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [9]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "feature_fraction": 0.5}  # 默认feature_fraction=1.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [10]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "lambda_l1": 0.2}  # 默认lambda_l1=0.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [11]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "lambda_l2": 0.2}  # 默认lambda_l2=0.0
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [13]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "max_depth": 5}  # 默认max_depth=-1(<= 0 means no limit)
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [14]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "min_sum_hessian_in_leaf": 5}  # 默认min_sum_hessian_in_leaf=1e-3
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [15]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "min_data_in_leaf": 25}  # 默认min_data_in_leaf=20(constraints: min_data_in_leaf >= 0)
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [16]:
params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "num_leaves": 62}  # 默认num_leaves=31
model = lgb.train(params=params, train_set=train_dataset)
model.predict(X_test).shape

(750, 7)

In [17]:
# Evaluation metrics for validation data
# 默认metric="" (empty string or not specified) means that metric corresponding to specified objective will be used
'''
l1:absolute loss
l2:square loss
rmse:root square loss
cross_entropy
multi_error:error rate for multi-class classification
multi_logloss:log loss for multi-class classification
binary_logloss
auc
'''
evals_result = {}
val_dataset = lgb.Dataset(data=X_test, label=y_test_enc)
eval_set = [train_dataset, val_dataset]

params = {"objective": "multiclass",
          "num_class": 7,
          "verbosity": -4,
          "metric": "multi_error"}
model = lgb.train(params=params,
                  train_set=train_dataset,
                  valid_sets=eval_set,
                  evals_result=evals_result)

[1]	training's multi_error: 0.433333	valid_1's multi_error: 0.428
[2]	training's multi_error: 0.250222	valid_1's multi_error: 0.308
[3]	training's multi_error: 0.160889	valid_1's multi_error: 0.238667
[4]	training's multi_error: 0.131556	valid_1's multi_error: 0.221333
[5]	training's multi_error: 0.123556	valid_1's multi_error: 0.217333
[6]	training's multi_error: 0.112889	valid_1's multi_error: 0.202667
[7]	training's multi_error: 0.104889	valid_1's multi_error: 0.201333
[8]	training's multi_error: 0.0968889	valid_1's multi_error: 0.197333
[9]	training's multi_error: 0.0888889	valid_1's multi_error: 0.185333
[10]	training's multi_error: 0.0871111	valid_1's multi_error: 0.177333
[11]	training's multi_error: 0.0804444	valid_1's multi_error: 0.181333
[12]	training's multi_error: 0.0773333	valid_1's multi_error: 0.177333
[13]	training's multi_error: 0.0715556	valid_1's multi_error: 0.173333
[14]	training's multi_error: 0.0684444	valid_1's multi_error: 0.173333
[15]	training's multi_error:

In [18]:
print(evals_result['training'].keys())
print(evals_result['valid_1'].keys())

odict_keys(['multi_error'])
odict_keys(['multi_error'])
