# XGBoost Model
* 此 notebook 用于训练 XGBoost 模型，模型结果保存在 ```Model``` 目录下
* 运行此 notebook 前，请确保已经正确运行前置程序 ```trainPrep.py```，得到特征文件 ```train_data.csv```，并确保它们和此程序位于同一目录下

## 1. Preparation
导入需要的模块

In [1]:
import warnings
warnings.filterwarnings('ignore') # 取消warning

import time
import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

## 2. Load Data & Standardization
读取数据，并使用 z-score 进行标准化

In [2]:
# 读取数据
train_ = pd.read_csv('train_data.csv')
# 填充缺失值
train_ = train_.fillna(0)
# 定义标准化函数
def standardization(df):
    newDataFrame = pd.DataFrame(index=df.index)
    columns = df.columns.tolist()
    for c in columns:
        if (c == 'label'):
            newDataFrame[c] = df[c].tolist()
        else:
            d = df[c]
            newDataFrame[c] = ((d - np.mean(d)) / (np.std(d))).tolist()
    return newDataFrame
# 进行标准化
train_data = standardization(train_)
# 提取特征和类别
label = train_data['label'] # label 从 0 开始标记，0，1，2，3 分别表示 4 种情绪
feature = train_data.drop(['label'],axis=1)

## 3. Model Training
训练 XGBoost 模型，并调整超参数

In [3]:
now = time.time()
# 超参数
paras={
    'booster':'gbtree',
    'objective':'multi:softmax', # 多分类问题，采用multisoft多分类器
    'num_class':4, # 类别数，与multi softmax并用
    'gamma':0.015, # 树的叶子节点下一个区分的最小损失
    'max_depth':20, # 树的最大深度
    'lambda':40, # L2正则项权重
    'subsample':0.6, # 用于训练模型的子样本占整个样本集合的比例
    'colsample_bytree':0.7, # 在建立树时对特征采样的比例
    'min_child_weight':22, # 节点的最少特征数
    'silent':1,
    'eta':0.5, # 为了防止过拟合，更新过程中用到的收缩步
    'seed':123,
    'nthread':4, # cpu线程数
}
# 将上述所有超参数放到集合plst中
plst=list(paras.items())

# 将训练集划分为训练集（90%）和验证集（10%）
X_train,x_val,Y_train,y_val = train_test_split(feature,label,test_size=0.1,random_state=156)

# 设定总迭代次数
num_rounds=5000

# 置入DMatrix数据结构
xgtrain=xgb.DMatrix(X_train, label=Y_train)#将训练集的二维数组加入到里面
xgval=xgb.DMatrix(x_val,label=y_val)#将验证集的二维数组形式的数据加入到DMatrix对象中

# 设定观察训练集和验证集上的错误率
watchlist =[(xgtrain,'train'),(xgval,'val')]

# 训练 XGBoost 模型
model = xgb.train(plst,xgtrain,num_rounds,watchlist,early_stopping_rounds=100)

# 计算训练用时
cost_time=time.time()-now
print("end...",'\n',"cost time",cost_time,"(s)...")

[0]	train-merror:0.546347	val-merror:0.605195
Multiple eval metrics have been passed: 'val-merror' will be used for early stopping.

Will train until val-merror hasn't improved in 100 rounds.
[1]	train-merror:0.493214	val-merror:0.579221
[2]	train-merror:0.464915	val-merror:0.592208
[3]	train-merror:0.436616	val-merror:0.592208
[4]	train-merror:0.41207	val-merror:0.574026
[5]	train-merror:0.402252	val-merror:0.563636
[6]	train-merror:0.375975	val-merror:0.535065
[7]	train-merror:0.359226	val-merror:0.563636
[8]	train-merror:0.338435	val-merror:0.563636
[9]	train-merror:0.322841	val-merror:0.563636
[10]	train-merror:0.309269	val-merror:0.558442
[11]	train-merror:0.304649	val-merror:0.548052
[12]	train-merror:0.293965	val-merror:0.568831
[13]	train-merror:0.286168	val-merror:0.550649
[14]	train-merror:0.263355	val-merror:0.550649
[15]	train-merror:0.258158	val-merror:0.550649
[16]	train-merror:0.255848	val-merror:0.542857
[17]	train-merror:0.241409	val-merror:0.548052
[18]	train-merror:0

[171]	train-merror:0.002021	val-merror:0.509091
[172]	train-merror:0.00231	val-merror:0.506494
[173]	train-merror:0.00231	val-merror:0.503896
[174]	train-merror:0.00231	val-merror:0.501299
[175]	train-merror:0.00231	val-merror:0.503896
[176]	train-merror:0.00231	val-merror:0.503896
[177]	train-merror:0.00231	val-merror:0.509091
[178]	train-merror:0.00231	val-merror:0.506494
[179]	train-merror:0.002021	val-merror:0.511688
[180]	train-merror:0.002021	val-merror:0.509091
[181]	train-merror:0.00231	val-merror:0.511688
[182]	train-merror:0.002021	val-merror:0.503896
[183]	train-merror:0.00231	val-merror:0.509091
[184]	train-merror:0.001733	val-merror:0.506494
[185]	train-merror:0.002021	val-merror:0.506494
[186]	train-merror:0.002021	val-merror:0.511688
[187]	train-merror:0.002021	val-merror:0.511688
[188]	train-merror:0.001733	val-merror:0.509091
[189]	train-merror:0.001733	val-merror:0.511688
[190]	train-merror:0.001733	val-merror:0.514286
[191]	train-merror:0.001733	val-merror:0.519481
[

## 4. Save The Model
将训练得到的模型文件保存在 ```Model``` 目录下，模型文件名为 ```XGB.pickle.dat```

In [4]:
# 保存模型
pickle.dump(model, open("Model/XGB.pickle.dat", "wb"))