## xgboost用法速查表
### by 《网易云课程 x 稀牛学院 机器学习工程师微专业》寒小阳

#### 1.读取libsvm格式数据并指定参数建模

**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [18]:
# coding: utf-8
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# 加载数据集合
print('Load data...')

df_train = load_boston()

# 设定训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    df_train.data, df_train.target, test_size=0.25, random_state=42
)

# 数据预处理
ss_X = StandardScaler()
ss_y = StandardScaler()
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))

# 构建xgb中的DMatrixt格式
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

#超参数设定
param = {
#     'n_estimator': 200,
    'max_depth':8, 
    'eta':0.25, 
    'silent':1, 
    'objective':'reg:linear' 
}

# 设定watchlist用于查看模型状态
watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 20
bst = xgb.train(param, dtrain, num_round, watchlist)

# 使用模型预测
preds = bst.predict(dtest)

# 模型存储

bst.save_model('./0001.model')

Load data...
[0]	eval-rmse:0.864868	train-rmse:0.868291
[1]	eval-rmse:0.68519	train-rmse:0.680458
[2]	eval-rmse:0.557824	train-rmse:0.537732
[3]	eval-rmse:0.469427	train-rmse:0.427458
[4]	eval-rmse:0.40723	train-rmse:0.343512
[5]	eval-rmse:0.368745	train-rmse:0.278701
[6]	eval-rmse:0.343565	train-rmse:0.228765
[7]	eval-rmse:0.329269	train-rmse:0.190318
[8]	eval-rmse:0.319058	train-rmse:0.158885
[9]	eval-rmse:0.311615	train-rmse:0.133616
[10]	eval-rmse:0.305831	train-rmse:0.114225
[11]	eval-rmse:0.302456	train-rmse:0.099174
[12]	eval-rmse:0.298727	train-rmse:0.087801
[13]	eval-rmse:0.297225	train-rmse:0.076566
[14]	eval-rmse:0.296927	train-rmse:0.069883
[15]	eval-rmse:0.295933	train-rmse:0.062789
[16]	eval-rmse:0.294876	train-rmse:0.057303
[17]	eval-rmse:0.292889	train-rmse:0.052135
[18]	eval-rmse:0.292624	train-rmse:0.045967
[19]	eval-rmse:0.291023	train-rmse:0.039977


#### 3.使用xgboost的sklearn包

**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [16]:
#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib


# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据
data = load_boston()

# 设定训练集和测试集
train_X, test_X, train_y, test_y = train_test_split(
    df_train.data, df_train.target, test_size=0.25, random_state=42
)

# 数据预处理
ss_X = StandardScaler()
ss_y = StandardScaler()
train_X = ss_X.fit_transform(train_X)
test_X = ss_X.transform(test_X)
train_y = ss_y.fit_transform(train_y.reshape(-1, 1))
test_y = ss_y.transform(test_y.reshape(-1, 1))

# 初始化模型
xgb_regressor = xgb.XGBRegressor(n_estimators=30,\
                                   max_depth=8, \
                                   learning_rate=0.25, \
                                   subsample=0.9, \
                                   colsample_bytree=0.9, \
                                   scilent=1)

# 拟合模型
xgb_regressor.fit(train_X, train_y, 
                 eval_set=[(train_X, train_y), (test_X, test_y)], eval_metric="rmse", verbose=True)

# 使用模型预测
preds = xgb_regressor.predict(test_X)

# 模型存储
joblib.dump(xgb_regressor, './0003.model')

[0]	validation_0-rmse:0.877011	validation_1-rmse:0.895387
[1]	validation_0-rmse:0.691485	validation_1-rmse:0.719628
[2]	validation_0-rmse:0.554144	validation_1-rmse:0.598901
[3]	validation_0-rmse:0.442405	validation_1-rmse:0.493608
[4]	validation_0-rmse:0.361971	validation_1-rmse:0.426578
[5]	validation_0-rmse:0.295493	validation_1-rmse:0.383583
[6]	validation_0-rmse:0.245715	validation_1-rmse:0.355404
[7]	validation_0-rmse:0.205383	validation_1-rmse:0.337687
[8]	validation_0-rmse:0.178193	validation_1-rmse:0.325121
[9]	validation_0-rmse:0.153158	validation_1-rmse:0.319087
[10]	validation_0-rmse:0.134972	validation_1-rmse:0.315769
[11]	validation_0-rmse:0.118034	validation_1-rmse:0.31032
[12]	validation_0-rmse:0.102505	validation_1-rmse:0.304502
[13]	validation_0-rmse:0.092036	validation_1-rmse:0.303424
[14]	validation_0-rmse:0.083077	validation_1-rmse:0.300918
[15]	validation_0-rmse:0.075729	validation_1-rmse:0.299148
[16]	validation_0-rmse:0.070185	validation_1-rmse:0.298403
[17]	val

['./0003.model']

#### 2.网格搜索查找最优超参数
**by《网易云课程 x 稀牛学院 机器学习工程师微专业》 寒小阳**

In [14]:
from sklearn.model_selection import GridSearchCV

print("参数最优化：")
y = train_y
X = train_X
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)

参数最优化：
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.8331697180655884
{'max_depth': 2, 'n_estimators': 200}


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.9s finished
