## cat主要调节参数及其含义

1. 其他参数
    * loss_function/objective===>xgboost(objective)
    * thread_count===>xgboost(nthread)
    * allow_writing_files
    * eval_metric===>xgboost(eval_metric)
    * task_type
    * leaf_estimation_method
    * iterations/n_estimators===>xgboost(num_boost_round)
    * use_best_model
    * *****************************
    * pool===>xgboost(dtrain)
    * early_stopping_rounds===>xgboost(early_stopping_rounds)
    * eval_set===>xgboost(evals)
    * verbose_eval===>xgboost(verbose_eval)

3. 树调节参数
    * max_depth/depth===>xgboost(max_depth)

4. 防止过拟合参数
    * colsample_bylevel/rsm===>xgboost(colsample_bylevel)
    * learning_rate===>xgboost(learning_rate)
    * reg_lambda===>xgboost(reg_lambda)

In [12]:
import catboost as cat
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np

In [13]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(y_train.shape)
print(np.unique(y_train))  # 7分类任务

(2250, 54)
(2250,)
[1 2 3 4 5 6 7]


In [14]:
train_dataset = cat.Pool(data=X_train, label=y_train)

In [15]:
'''
# loss_function:The metric to use in training.
# eval_metric:The metric used for overfitting detection (if enabled) and best model selection (if enabled).
# loss_function/eval_metric可自定义:具体参考https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric
loss_function/eval_metric可选参数:
RMSE: 均方根误差==>loss_function/eval_metric
MAE:平均绝对误差===>loss_function/eval_metric
R2: R平方==>eval_metric

CrossEntropy:交叉熵===>loss_function/eval_metric
Logloss: 负对数似然函数值(二分类)===>loss_function/eval_metric

MultiClass:多分类logloss===>loss_function/eval_metric

Precision:查准率(多分类)===>eval_metric
Recall:召回率(多分类)===>eval_metric
F1:F1值(多分类)===>eval_metric
Accuracy:精度(多分类)===>eval_metric
AUC:多分类===>eval_metric
'''

'''
verbose:
The purpose of this parameter depends on the type of the given value:

1. bool — Defines the logging level:
    * “True”  corresponds to the Verbose logging level
    * “False” corresponds to the Silent logging level
2. int — Use the Verbose logging level and set the logging period to the value of this parameter.
'''
params = {"loss_function": "MultiClass",
          # 不支持定义多个eval_metric
          # 默认与loss_function定义相同
          "eval_metric": "MultiClass",
          # Allow to write analytical and snapshot files during training.
          # If set to “False”, the snapshot and data visualization tools are unavailable.
          # 默认allow_writing_files=True
          "allow_writing_files": False,
          # The number of threads to use during the training.
          "thread_count": -1,
          }

'''
相比xgboost/lightgbm
    * y标签不要求0开始
    * 不需要指定多分类类别数量,自动识别是否为多分类任务
'''
model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

Learning rate set to 0.082468
0:	learn: 1.7633679	total: 8.98ms	remaining: 8.98s
1:	learn: 1.6485576	total: 16.9ms	remaining: 8.41s
2:	learn: 1.5421658	total: 24.9ms	remaining: 8.29s
3:	learn: 1.4646668	total: 33ms	remaining: 8.21s
4:	learn: 1.3887613	total: 40.5ms	remaining: 8.06s
5:	learn: 1.3191996	total: 49.1ms	remaining: 8.14s
6:	learn: 1.2585012	total: 56.7ms	remaining: 8.05s
7:	learn: 1.2050367	total: 65.6ms	remaining: 8.13s
8:	learn: 1.1603612	total: 73.2ms	remaining: 8.06s
9:	learn: 1.1165846	total: 81.9ms	remaining: 8.11s
10:	learn: 1.0789472	total: 89.3ms	remaining: 8.03s
11:	learn: 1.0440096	total: 97.3ms	remaining: 8.01s
12:	learn: 1.0157868	total: 105ms	remaining: 7.95s
13:	learn: 0.9863562	total: 112ms	remaining: 7.9s
14:	learn: 0.9579050	total: 121ms	remaining: 7.92s
15:	learn: 0.9295251	total: 131ms	remaining: 8.03s
16:	learn: 0.9073169	total: 141ms	remaining: 8.13s
17:	learn: 0.8869492	total: 151ms	remaining: 8.23s
18:	learn: 0.8679071	total: 163ms	remaining: 8.41s
19

(750, 7)

In [16]:
"""
The method used to calculate the values in leaves.
    * Possible values:
    * Newton
    * Gradient
    * Exact

默认值:
Depends on the mode and the selected loss function:
    * Regression with Quantile or MAE loss functions — One Exact iteration.
    * Regression with any loss function but Quantile or MAE – One Gradient iteration.
    * Classification mode – Ten Newton iterations.
    * Multiclassification mode – One Newton iteration.
"""
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "leaf_estimation_method": "Gradient"
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [17]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "n_estimators": 200  # 默认n_estimators=1000
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [18]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # 默认6 (16 if the growing policy is set to Lossguide)
          "max_depth": 5  # 设置最高16
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [19]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.
          "colsample_bylevel": 0.8  # 默认colsample_bylevel=1
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [20]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Used for reducing the gradient step.
          "learning_rate": 0.01
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [21]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Coefficient at the L2 regularization term of the cost function.
          "reg_lambda": 1  # 默认reg_lambda=3.0
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [22]:
"""
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
    Build the number of trees defined by the training parameters.
    Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).
"""
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # 通过eval_metric选择最优模型
          # 必须设置eval_set参数
          "use_best_model": True
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

CatBoostError: To employ param {'use_best_model': True} provide non-empty 'eval_set'.

In [23]:
%%timeit

params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "task_type" : "CPU"
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

8.02 s ± 283 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
%%timeit

params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "task_type" : "GPU"  # 默认task_type='CPU'
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

6.08 s ± 217 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
