## 实验介绍

### 1.实验内容

本实验包括: 
* 理解集成学习的概念，掌握常见集成学习算法。
* 基于集成学习算法，在马疝病数据集上预测患有疝病的马是否能够存活。
* 相关参考：Chapter 7 of Machine Leaarning in Action

### 2.实验环境

* python 3.6.5
* numpy 1.13.3
* pandas 0.23.4 

### 3.数据介绍

* 数据集保存在两个文件中，horseColicTrain.txt与horseColicTest.txt，分别为训练数据和测试数据。
* 数据文件共22列，其中最后一列为类别标签。

### 4.实验准备

点击屏幕右上方的下载实验数据模块，选择下载HorseColicData.tgz到指定目录下，然后再依次选择点击上方的File->Open->Upload,上传刚才下载的数据集压缩包，再使用如下命令解压：

In [1]:
# !tar -zxvf ./work/horseColic.tgz  -C ./dataset/HorseColic/

## 正式实验

### 1. 导入所需的库

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import os

### 2. 数据集展示

In [3]:
horse_train_path = './dataset/HorseColic/horseColicTrain.txt'
horse_test_path = './dataset/HorseColic/horseColicTest.txt'

horse_train = pd.read_table(horse_train_path, header=None)
horse_test = pd.read_table(horse_test_path, header=None)

horse_train[21] = horse_train[21].astype(int)
horse_test[21] = horse_test[21].astype(int)

In [4]:
horse_train.loc[horse_train[21] == -1, 21] = 0
horse_test.loc[horse_test[21] == -1, 21] = 0

In [5]:
horse_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,2.0,1.0,38.5,66.0,28.0,3.0,3.0,0.0,2.0,5.0,...,0.0,0.0,0.0,3.0,5.0,45.0,8.4,0.0,0.0,0
1,1.0,1.0,39.2,88.0,20.0,0.0,0.0,4.0,1.0,3.0,...,0.0,0.0,0.0,4.0,2.0,50.0,85.0,2.0,2.0,0
2,2.0,1.0,38.3,40.0,24.0,1.0,1.0,3.0,1.0,3.0,...,0.0,0.0,0.0,1.0,1.0,33.0,6.7,0.0,0.0,1
3,1.0,9.0,39.1,164.0,84.0,4.0,1.0,6.0,2.0,2.0,...,1.0,2.0,5.0,3.0,0.0,48.0,7.2,3.0,5.3,0
4,2.0,1.0,37.3,104.0,35.0,0.0,0.0,6.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,74.0,7.4,0.0,0.0,0


In [6]:
horse_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,2.0,1.0,38.5,54.0,20.0,0.0,1.0,2.0,2.0,3.0,...,2.0,2.0,5.9,0.0,2.0,42.0,6.3,0.0,0.0,1
1,2.0,1.0,37.6,48.0,36.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,44.0,6.3,1.0,5.0,1
2,1.0,1.0,37.7,44.0,28.0,0.0,4.0,3.0,2.0,5.0,...,1.0,1.0,0.0,3.0,5.0,45.0,70.0,3.0,2.0,1
3,1.0,1.0,37.0,56.0,24.0,3.0,1.0,4.0,2.0,4.0,...,1.0,1.0,0.0,0.0,0.0,35.0,61.0,3.0,2.0,0
4,2.0,1.0,38.0,42.0,12.0,3.0,0.0,3.0,1.0,1.0,...,0.0,0.0,0.0,0.0,2.0,37.0,5.8,0.0,0.0,1


In [7]:
X_train = horse_train.iloc[:, :-1].values
y_train = horse_train.iloc[:, -1].values

X_test = horse_test.iloc[:, :-1].values
y_test = horse_test.iloc[:, -1].values

In [8]:
X_train.shape

(299, 21)

In [9]:
y_train.shape

(299,)

## 3. Boosting

### 3.1 AdaBoost

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

> 默认使用`DecisionTreeClassifier(max_depth=1)` 作为基学习器

In [11]:
base_estimator = DecisionTreeClassifier(max_depth=3)

In [12]:
adaboost = AdaBoostClassifier(n_estimators=50, random_state=0, base_estimator=base_estimator)
adaboost.fit(X_train, y_train)



In [13]:
print("Accuracy on training set: {:.3f}".format(adaboost.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(adaboost.score(X_test, y_test)))

Accuracy on training set: 0.997
Accuracy on test set: 0.836


> 采用SVM作为基学习器

In [14]:
svc = SVC(kernel='linear', C=1.0, random_state=0)
adaboost = AdaBoostClassifier(base_estimator=svc, n_estimators=50, random_state=0, algorithm='SAMME')
adaboost.fit(X_train, y_train)



In [15]:
print("Accuracy on training set: {:.3f}".format(adaboost.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(adaboost.score(X_test, y_test)))

Accuracy on training set: 0.732
Accuracy on test set: 0.761


In [16]:
svc = SVC(kernel='rbf', C=0.1, random_state=0)
adaboost = AdaBoostClassifier(base_estimator=svc, n_estimators=50, random_state=0, algorithm='SAMME')
adaboost.fit(X_train, y_train)



In [17]:
print("Accuracy on training set: {:.3f}".format(adaboost.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(adaboost.score(X_test, y_test)))

Accuracy on training set: 0.595
Accuracy on test set: 0.701


> `AdaBoostClassifier`默认只能使用同一种类型的基学习器，例如只能使用决策树作为基学习器，或者只能使用SVM作为基学习器。这是因为`AdaBoostClassifier`中的样本权重更新和弱分类器的组合是基于基学习器的输出结果来进行的。如果使用不同类型的基学习器，它们的输出结果不一定是可比较的，会导致样本权重更新和弱分类器组合的策略无法有效地应用到不同类型的基学习器上。

### 3.2 Gradient Boosting

In [18]:
from sklearn.ensemble import GradientBoostingClassifier

> 默认使用`DecisionTreeClassifier(max_depth=3)` 作为基学习器

In [19]:
base_estimator = DecisionTreeClassifier(max_depth=3)
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

In [20]:
print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.836
Accuracy on test set: 0.731


> 采用逻辑回归分类器作为基学习器

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
base_estimator = LogisticRegression()
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.766
Accuracy on test set: 0.761


> 采用KNN作为基学习器

In [24]:
from sklearn.neighbors import KNeighborsClassifier

In [25]:
base_estimator = KNeighborsClassifier(n_neighbors=3)
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.839
Accuracy on test set: 0.731


> 采用SVC作为基学习器

In [26]:
base_estimator = SVC(kernel='rbf', C=1, gamma=0.1, probability=True)
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.997
Accuracy on test set: 0.731


> 采用RandomForestClassifier作为基学习器

In [27]:
from sklearn.ensemble import RandomForestClassifier

In [28]:
base_estimator = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=0)
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.839
Accuracy on test set: 0.776


> 采用RandomForestClassifier作为基学习器

In [29]:
from sklearn.neural_network import MLPClassifier

In [30]:
base_estimator = MLPClassifier(hidden_layer_sizes=(5,), activation='relu', solver='adam', max_iter=500, random_state=0)
gb = GradientBoostingClassifier(n_estimators=50, learning_rate=0.01, random_state=0, init=base_estimator)
gb.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

Accuracy on training set: 0.793
Accuracy on test set: 0.701




### 3.3 XGBoost

> XGBoost算法的基学习器是决策树，通常使用CART决策树作为基学习器。在每一轮迭代中，XGBoost算法会根据前一轮迭代的结果计算残差，然后用CART决策树来拟合这些残差，生成一个新的弱分类器。将所有弱分类器加权结合形成最终的强分类器。同时，XGBoost算法还可以使用一些正则化方法，如L1和L2正则化，来控制模型的复杂度和过拟合。

In [31]:
import xgboost as xgb

In [32]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [33]:
param = {'max_depth': 3, 'eta': 0.01, 'objective': 'multi:softmax', 'num_class': 3, 'eval_metric': 'merror'}
num_round = 50

In [34]:
xgb_model = xgb.train(param, dtrain, num_round)

In [35]:
pred_train = xgb_model.predict(dtrain)
pred_test = xgb_model.predict(dtest)

In [36]:
acc_train = sum(pred_train == y_train) / len(y_train)
acc_test = sum(pred_test == y_test) / len(y_test)

print("Accuracy on training set: {:.3f}".format(acc_train))
print("Accuracy on test set: {:.3f}".format(acc_test))

Accuracy on training set: 0.816
Accuracy on test set: 0.731


## 4. Bagging

### 4.1 Bagging

In [37]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

In [38]:
# 创建Bagging分类器
bg_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

In [39]:
# 训练模型
bg_clf.fit(X_train, y_train)

# 评估模型
y_pred = bg_clf.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)
print('Accuracy (train): {:.2f}%'.format(accuracy * 100))

y_pred = bg_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy (test): {:.2f}%'.format(accuracy * 100))



Accuracy (train): 99.67%
Accuracy (test): 70.15%


### 4.2 Random Forest

In [40]:
from sklearn.ensemble import RandomForestClassifier

In [41]:
# 构建随机森林分类器
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

In [42]:
y_pred = clf.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)
print('Accuracy (train): {:.2f}%'.format(accuracy * 100))

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy (test): {:.2f}%'.format(accuracy * 100))

Accuracy (train): 99.67%
Accuracy (test): 79.10%


## 5. 结论

这里选取了各个算法最优的表现进行对比，事实上根据设置的**超参、基学习器的不同**，模型性能有很大的差异。因此，以下的结果仅作为参考

|Model|Accurancy (train set)|Accurancy (test set)|
|:--:|:--:|:--:|
|AdaBoost|0.997|0.836|
|Gradient Boosting|0.997|0.716|
|XGBoost|0.816|0.731|
|Bagging|0.997|0.702|
|Random Forest|0.997|0.791|