## lightgbm主要调节参数

1. 其他参数
    * thread_count==>xgboost(n_jobs)
    * verbose==>xgboost(verbosity)
    * allow_writing_files
    * loss_function==>xgboost(objective)
    * eval_metric==>xgboost.fit
    * task_type
    * leaf_estimation_method

2. 树调节参数
    * n_estimators==>xgboost
    * max_depth==>xgboost

3. 防止过拟合参数
    * colsample_bylevel
    * learning_rate==>xgboost
    * reg_lambda==>xgboost

In [1]:
from catboost import CatBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [3]:
print(X_train.shape)  # 数据集有54个特征
print(np.unique(y_train))  # 7分类

(2250, 54)
[1 2 3 4 5 6 7]


In [4]:
'''
The purpose of this parameter depends on the type of the given value:

1. bool — Defines the logging level:
    * “True”  corresponds to the Verbose logging level
    * “False” corresponds to the Silent logging level
2. int — Use the Verbose logging level and set the logging period to the value of this parameter.
'''
cat = CatBoostClassifier(thread_count=-1,
                         verbose=100,
                         # Allow to write analytical and snapshot files during
                         allow_writing_files=False)  # 默认allow_writing_files=True
cat.fit(X_train, y_train)
print(cat.score(X_test, y_test))

Learning rate set to 0.082468
0:	learn: 1.7863484	total: 190ms	remaining: 3m 9s
100:	learn: 0.4575110	total: 1.03s	remaining: 9.2s
200:	learn: 0.3393051	total: 1.89s	remaining: 7.51s
300:	learn: 0.2688756	total: 2.72s	remaining: 6.32s
400:	learn: 0.2229096	total: 3.55s	remaining: 5.3s
500:	learn: 0.1885924	total: 4.38s	remaining: 4.36s
600:	learn: 0.1602304	total: 5.22s	remaining: 3.46s
700:	learn: 0.1374790	total: 6.05s	remaining: 2.58s
800:	learn: 0.1211865	total: 6.88s	remaining: 1.71s
900:	learn: 0.1071322	total: 7.72s	remaining: 848ms
999:	learn: 0.0946276	total: 8.6s	remaining: 0us
0.86


In [5]:
# loss_function:The metric to use in training.
# eval_metric:The metric used for overfitting detection (if enabled) and best model selection (if enabled).
'''
loss_function/eval_metric可选参数:
RMSE: 均方根误差
MAE:平均绝对误差
R2: R平方
Logloss: 负对数似然函数值
CrossEntropy:交叉熵
Precision:查准率
Recall:召回率
F1:F1值
Accuracy:精度
AUC
MultiClass:多分类logloss
'''
cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                         loss_function='MultiClass',
                         eval_metric='MultiClass')  # 可自定义,不支持多个多个评估指标
cat.fit(X_train, y_train)
print(cat.score(X_test, y_test))

0.86


In [6]:
cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                         )
cat.fit(X_train, y_train)
print(cat.score(X_test, y_test))

0.86


In [None]:
'''
The processing unit type to use for training.

Possible values:
    * CPU
    * GPU
'''
cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                         task_type='GPU')  # 默认task_type='CPU'
cat.fit(X_train, y_train)
print(cat.score(X_test, y_test))

In [None]:
'''
The method used to calculate the values in leaves.

Possible values:
    * Newton
    * Gradient
    * Exact
'''

# 默认值
"""
Depends on the mode and the selected loss function:
    * Regression with Quantile or MAE loss functions — One Exact iteration.
    * Regression with any loss function but Quantile or MAE – One Gradient iteration.
    * Classification mode – Ten Newton iterations.
    * Multiclassification mode – One Newton iteration.
"""
cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                         leaf_estimation_method="Gradient")
cat.fit(X_train, y_train)
print(cat.score(X_test, y_test))

In [None]:
n_estimators = [10, 20, 50, 100]  # 默认n_estimators=100
for i in n_estimators:
    cat = CatBoostClassifier(thread_count=-1, allow_writing_files=False, verbose=False,
                             n_estimators=i)
    cat.fit(X_train, y_train)
    print('n_estimators=' + str(i) + ',  score=', cat.score(X_test, y_test))

In [None]:
max_depth = [1, 3, 6, 9]
for i in max_depth:
    cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                             max_depth=i)  # 默认6 (16 if the growing policy is set to Lossguide)
    cat.fit(X_train, y_train)
    print('max_depth=' + str(i) + ',  score=', cat.score(X_test, y_test))

In [None]:
colsample_bylevel = [0.1, 0.3, 0.6, 0.7, 0.8, 0.95, 1]
for i in colsample_bylevel:
    cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                             # Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random
                             colsample_bylevel=i)  # 默认colsample_bytree=1.0
    cat.fit(X_train, y_train)
    print('colsample_bylevel=' + str(i) + ',  score=', cat.score(X_test, y_test))

In [None]:
learning_rate = [0.01, 0.02, 0.05, 0.5, 0.7, 0.9]  # 默认learning_rate=0.1
for i in learning_rate:
    cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                             learning_rate=i)
    cat.fit(X_train, y_train)
    print('learing_rate=' + str(i) + ',  score=', cat.score(X_test, y_test))

In [None]:
reg_lambda = [0, 1, 3, 9, 27, 81]
for i in reg_lambda:
    cat = CatBoostClassifier(thread_count=-1, verbose=False, allow_writing_files=False,
                             # Coefficient at the L2 regularization term of the cost function.
                             reg_lambda=i)  # 默认reg_lambda=3.0
    cat.fit(X_train, y_train)
    print('reg_lambda=' + str(i) + ',  score=', cat.score(X_test, y_test))