## cat主要调节参数及其含义

1. 其他参数
    * loss_function/objective===>xgboost(objective)
    * thread_count===>xgboost(nthread)
    * allow_writing_files
    * eval_metric===>xgboost(eval_metric)
    * task_type
    * leaf_estimation_method
    * iterations/n_estimators===>xgboost(num_boost_round)
    * use_best_model
    * *****************************
    * pool===>xgboost(dtrain)
    * early_stopping_rounds===>xgboost(early_stopping_rounds)
    * eval_set===>xgboost(evals)
    * verbose_eval===>xgboost(verbose_eval)

3. 树调节参数
    * max_depth/depth===>xgboost(max_depth)

4. 防止过拟合参数
    * colsample_bylevel/rsm===>xgboost(colsample_bylevel)
    * learning_rate===>xgboost(learning_rate)
    * reg_lambda===>xgboost(reg_lambda)

In [10]:
import catboost as cat
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np

In [11]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(y_train.shape)
print(np.unique(y_train))  # 7分类任务

(2250, 54)
(2250,)
[1 2 3 4 5 6 7]


In [12]:
train_dataset = cat.Pool(data=X_train, label=y_train)

In [13]:
'''
# loss_function:The metric to use in training.
# eval_metric:The metric used for overfitting detection (if enabled) and best model selection (if enabled).
# loss_function/eval_metric可自定义:具体参考https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric
loss_function/eval_metric可选参数:
RMSE: 均方根误差==>loss_function/eval_metric
MAE:平均绝对误差===>loss_function/eval_metric
R2: R平方==>eval_metric

CrossEntropy:交叉熵===>loss_function/eval_metric
Logloss: 负对数似然函数值(二分类)===>loss_function/eval_metric

MultiClass:多分类logloss===>loss_function/eval_metric

Precision:查准率(多分类)===>eval_metric
Recall:召回率(多分类)===>eval_metric
F1:F1值(多分类)===>eval_metric
Accuracy:精度(多分类)===>eval_metric
AUC:多分类===>eval_metric
'''

'''
verbose:
The purpose of this parameter depends on the type of the given value:

1. bool — Defines the logging level:
    * “True”  corresponds to the Verbose logging level
    * “False” corresponds to the Silent logging level
2. int — Use the Verbose logging level and set the logging period to the value of this parameter.
'''

'''
task_type:
The processing unit type to use for training.

Possible values:
    * CPU
    * GPU
'''
params = {"loss_function": "MultiClass",
          # 不支持定义多个eval_metric
          # 默认与loss_function定义相同
          "eval_metric": "MultiClass",
          # Allow to write analytical and snapshot files during training.
          # If set to “False”, the snapshot and data visualization tools are unavailable.
          # 默认allow_writing_files=True
          "allow_writing_files": False,
          # The number of threads to use during the training.
          "thread_count": -1,
          "task_type": 'CPU'  # 默认task_type='CPU'
          }

'''
相比xgboost/lightgbm
    * y标签不要求0开始
    * 不需要指定多分类类别数量,自动识别是否为多分类任务
'''
model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

Learning rate set to 0.082468
0:	learn: 1.7841495	total: 7.22ms	remaining: 7.21s
1:	learn: 1.6525647	total: 14.1ms	remaining: 7.04s
2:	learn: 1.5506747	total: 20.5ms	remaining: 6.82s
3:	learn: 1.4629622	total: 27.2ms	remaining: 6.78s
4:	learn: 1.3917608	total: 33.1ms	remaining: 6.58s
5:	learn: 1.3246631	total: 39.4ms	remaining: 6.53s
6:	learn: 1.2629908	total: 45.8ms	remaining: 6.5s
7:	learn: 1.2097860	total: 51.8ms	remaining: 6.42s
8:	learn: 1.1585170	total: 58.1ms	remaining: 6.39s
9:	learn: 1.1167620	total: 64.7ms	remaining: 6.41s
10:	learn: 1.0756527	total: 70.9ms	remaining: 6.38s
11:	learn: 1.0433370	total: 77.2ms	remaining: 6.36s
12:	learn: 1.0117343	total: 83.1ms	remaining: 6.31s
13:	learn: 0.9828477	total: 89ms	remaining: 6.27s
14:	learn: 0.9555561	total: 95.2ms	remaining: 6.25s
15:	learn: 0.9287209	total: 101ms	remaining: 6.24s
16:	learn: 0.9083755	total: 108ms	remaining: 6.25s
17:	learn: 0.8840428	total: 114ms	remaining: 6.23s
18:	learn: 0.8645431	total: 121ms	remaining: 6.22s

(750, 7)

In [14]:
"""
The method used to calculate the values in leaves.
    * Possible values:
    * Newton
    * Gradient
    * Exact

默认值:
Depends on the mode and the selected loss function:
    * Regression with Quantile or MAE loss functions — One Exact iteration.
    * Regression with any loss function but Quantile or MAE – One Gradient iteration.
    * Classification mode – Ten Newton iterations.
    * Multiclassification mode – One Newton iteration.
"""
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "leaf_estimation_method": "Gradient"
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [15]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          "n_estimators": 200  # 默认n_estimators=1000
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [16]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # 默认6 (16 if the growing policy is set to Lossguide)
          "max_depth": 5  # 设置最高16
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [17]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.
          "colsample_bylevel": 0.8  # 默认colsample_bylevel=1
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [18]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Used for reducing the gradient step.
          "learning_rate": 0.01
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [19]:
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # Coefficient at the L2 regularization term of the cost function.
          "reg_lambda": 1  # 默认reg_lambda=3.0
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape

(750, 7)

In [20]:
"""
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
    Build the number of trees defined by the training parameters.
    Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).
"""
params = {"loss_function": "MultiClass",
          "allow_writing_files": False,
          "verbose": False,
          # 通过eval_metric选择最优模型
          # 必须设置eval_set参数
          "use_best_model": True
          }

model = cat.train(pool=train_dataset, params=params)
model.predict(X_test).shape


CatBoostError: To employ param {'use_best_model': True} provide non-empty 'eval_set'.