## xgboost主要调节参数

1. 其他参数
    * booster
    * nthread
    * objective
    * num_class
    * verbosity
    * eval_metric
    * *****************************
    * dtrain
    * num_boost_round
    * evals
    * early_stopping_rounds
    * evals_result
    * feval
    * verbose_eval

2. 树调节参数
    * max_depth
    * min_child_weight
    * gamma/min_split_loss

2. 防止过拟合参数
    * eta/learning_rate
    * subsample
    * colsample_bytree
    * colsample_bylevel
    * reg_alpha/alpha
    * reg_lambda/lambda

In [97]:
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

In [98]:
X = datasets.fetch_covtype().data[:3000]
y = datasets.fetch_covtype().target[:3000]
X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(y_train.shape)
print(np.unique(y_train))  # 7分类任务

(2250, 54)
(2250,)
[1 2 3 4 5 6 7]


In [99]:
enc = OrdinalEncoder()
y_train_enc = enc.fit_transform(y_train.reshape(-1, 1))
y_test_enc = enc.transform(y_test.reshape(-1, 1))
print(np.unique(y_train_enc))

[0. 1. 2. 3. 4. 5. 6.]


In [100]:
train_dataset = xgb.DMatrix(data=X_train, label=y_train_enc)
test_dataset = xgb.DMatrix(data=X_test)

In [101]:
"""
Specify the learning task and the corresponding learning objective.
objective [default=reg:squarederror]
    reg:squarederror: regression with squared loss.
    binary:logistic: logistic regression for binary classification, output probability
    multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
    multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix.
        The result contains predicted probability of each data point belonging to each class.
"""

"""
Evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and logloss for classification, mean average precision for ranking)
User can add multiple evaluation metrics.
    rmse: root mean square error
    mae: mean absolute error
    logloss: negative log-likelihood
    error: Binary classification error rate.
    merror: Multiclass classification error rate.
    mlogloss: Multiclass logloss.
    auc: Receiver Operating Characteristic Area under the Curve. Available for classification and learning-to-rank tasks.
        * When used with binary classification, the objective should be binary:logistic or similar functions that work on probability.
        * When used with multi-class classification, objective should be multi:softprob instead of multi:softmax, as the latter doesn’t output probability. Also the AUC is calculated by 1-vs-rest with reference class weighted by class prevalence.
"""
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'num_class': 7}  # 多分类类别数量(多分类任务时必须指定)
model = xgb.train(params=params, dtrain=train_dataset)  # 分类问题中y标签必须从0开始
model.predict(test_dataset).shape

(750, 7)

In [102]:
'''
booster [default= gbtree ]
    Which booster to use. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'num_class': 7,
          'booster': 'dart'}  # 若booster='gblinear',可调用sklearn API输出coef_,intercept_
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [103]:
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'num_class': 7,
          "nthread": -1,  # default to maximum number of threads available if not set
          # verbosity: Verbosity of printing messages. Valid values of 0 (silent), 1 (warning), 2 (info), and 3 (debug).
          "verbosity": 2
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 78 extra nodes, 0 pruned nodes, max_depth=6
[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 74 extra nodes, 0 pruned nodes, max_depth=6
[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 50 extra nodes, 0 pruned nodes, max_depth=6
[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 24 extra nodes, 0 pruned nodes, max_depth=5
[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 34 extra nodes, 0 pruned nodes, max_depth=6
[16:44:33] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/tree/updater_prune.cc:101: tree pruning end, 38 extra nodes, 0 

(750, 7)

In [104]:
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          "verbosity": 0,
          'num_class': 7,
          # Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit
          'max_depth': 6  # 默认max_depth=6
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [105]:
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          # Minimum sum of instance weight (hessian) needed in a child.
          'min_child_weight': 2  # 默认min_child_weight=2
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [106]:
'''
gamma [default=0, alias: min_split_loss]
    Minimum loss reduction required to make a further partition on a leaf node of the tree.
    The larger gamma is, the more conservative the algorithm will be.
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'gamma': 1
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [107]:
'''
eta [default=0.3, alias: learning_rate]
    Step size shrinkage used in update to prevents overfitting.
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'eta': 0.01
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [108]:
'''
subsample [default=1]
    Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
    range: (0,1]
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'subsample': 0.8
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [109]:
# colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'colsample_bytree': 0.9
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [None]:
# colsample_bylevel is the subsample ratio of columns for each level.
# Subsampling occurs once for every new depth level reached in a tree.
# Columns are subsampled from the set of columns chosen for the current tree.
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'colsample_bylevel': 0.9
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

In [110]:
'''
alpha [default=0, alias: reg_alpha]
    L1 regularization term on weights. Increasing this value will make model more conservative.
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'alpha': 0.1
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)

In [111]:
'''
lambda [default=0, alias: reg_lambda]
    L2 regularization term on weights.
    Increasing this value will make model more conservative. Normalised to number of training examples.
'''
params = {'objective': 'multi:softprob',
          "eval_metric": 'mlogloss',
          'verbosity': 0,
          'num_class': 7,
          'lambda': 0.1
          }
model = xgb.train(params=params, dtrain=train_dataset)
model.predict(test_dataset).shape

(750, 7)