## XGBoost快速入门——与scikit-learn一起使用-cv

In [1]:
# 运行xgboost安装包中的示例程序
import xgboost as xgb
from xgboost import XGBClassifier

# 加载libsvm格式数据模块
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
%matplotlib inline

## 数据读取
scikit-learn支持多种格式的数据，包括LibSVM格式数据 XGBoost可以加载libsvm格式的文本数据，libsvm的文件格式（稀疏特征）如下： 1 101:1.2 102:0.03 0 1:2.1 10001:300 10002:400 ...

每一行表示一个样本，第一行的开头的“1”是样本的标签。“101”和“102”为特征索引，'1.2'和'0.03' 为特征的值。 在两类分类中，用“1”表示正样本，用“0” 表示负样本。也支持[0,1]表示概率用来做标签，表示为正样本的概率。

下面的示例数据需要我们通过一些蘑菇的若干属性判断这个品种是否有毒。[UCI数据描述](http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/)， 每个样本描述了蘑菇的22个属性，比如形状、气味等等(加工成libsvm格式后变成了126维特征)，然后给出了这个蘑菇是否可食用。其中6513个样本做训练，1611个样本做测试。

In [2]:
# read in data，数据在xgboost安装的路径下的demo目录，现在copy到代码目录下的data目录
my_workpath = './data/'
X_train, y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

## 构造模型

In [3]:
# 设置boosting迭代计算次数
bst = XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic')

## 交叉验证
会比较慢:(

In [4]:
# stratified k-fold cross validation evaluation of xgboost model
param_test = {'n_estimators': range(1, 51, 1)}
clf = GridSearchCV(estimator=bst, param_grid=param_test, scoring='accuracy', cv=5, return_train_score=True)
clf.fit(X_train, y_train)
clf.cv_results_, clf.best_params_, clf.best_score_

({'mean_fit_time': array([0.02912192, 0.03091731, 0.03291445, 0.03510628, 0.03829732,
         0.04009233, 0.04029179, 0.04228668, 0.04428148, 0.04567795,
         0.04807158, 0.05026493, 0.05166383, 0.05325747, 0.0550528 ,
         0.05744624, 0.05964022, 0.06343026, 0.06362963, 0.06522551,
         0.06662149, 0.06901531, 0.07061124, 0.07260542, 0.07539549,
         0.0763957 , 0.07958689, 0.08257885, 0.08218007, 0.08457384,
         0.08636875, 0.08776522, 0.08956041, 0.09215355, 0.09394717,
         0.09913492, 0.09793792, 0.10192728, 0.1035182 , 0.10352287,
         0.10711341, 0.1077116 , 0.1132966 , 0.11289806, 0.11329689,
         0.11469297, 0.11628895, 0.11948047, 0.12107615, 0.12705994]),
  'std_fit_time': array([9.77369407e-04, 6.30901829e-04, 6.31591281e-04, 7.46327472e-04,
         1.84952454e-03, 7.45754635e-04, 7.98106227e-04, 4.88480359e-04,
         7.97987020e-04, 3.98898135e-04, 3.99136582e-04, 7.97403298e-04,
         1.16527653e-03, 4.88519261e-04, 7.46595138e-04,

## 测试

In [5]:
# make prediction
preds = clf.predict(X_test)
predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(y_test, predictions)
print('Test Accuracy of gridsearchcv: %.2f%%' % (test_accuracy * 100.0))

Test Accuracy of gridsearchcv: 97.27%
