# 集成模型（分类）

集成模型对泰坦尼克数据集进行分类，这里给出了单一的决策树，随机森林分类器，梯度提升决策树模型。

In [4]:
# 对数据集进行处理
import pandas as pd
titanic = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")

x = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']
x['age'].fillna(x['age'].mean(), inplace=True)

# 处理非数值型的数据
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
x_ = dv.fit_transform(x.to_dict(orient='record'))

In [5]:
# 划分测试集/训练集
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_, y, test_size=0.25, random_state=33)

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
y_dtc_predict = dtc.predict(x_test)

rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
y_rfc_predict = rfc.predict(x_test)

gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)
y_gbc_predict = gbc.predict(x_test)

In [8]:
# 性能比较
from sklearn.metrics import classification_report
print "DecisionTreeClassifier's accuracy:", dtc.score(x_test, y_test)
print classification_report(y_test, y_dtc_predict, target_names=['deied', 'survived'])

print "RandomForestClassifier's accuracy:", rfc.score(x_test, y_test)
print classification_report(y_test, y_rfc_predict, target_names=['deied', 'survived'])

print "GradientBoostingClassifier's accuracy:", gbc.score(x_test, y_test)
print classification_report(y_test, y_gbc_predict, target_names=['deied', 'survived'])

DecisionTreeClassifier's accuracy: 0.781155015198
             precision    recall  f1-score   support

      deied       0.78      0.91      0.84       202
   survived       0.80      0.58      0.67       127

avg / total       0.78      0.78      0.77       329

RandomForestClassifier's accuracy: 0.793313069909
             precision    recall  f1-score   support

      deied       0.79      0.90      0.84       202
   survived       0.80      0.62      0.70       127

avg / total       0.79      0.79      0.79       329

GradientBoostingClassifier's accuracy: 0.790273556231
             precision    recall  f1-score   support

      deied       0.78      0.92      0.84       202
   survived       0.82      0.58      0.68       127

avg / total       0.80      0.79      0.78       329



**特点分析：**集成模型可以说是实战中最常见的，相比于一些其它单一的学习模型，集成模型可以整合多种模型，或者多次就一种类型的模型进行建模。由于模型参数估计的过程中受到概率的影响，所以具有一定的不确定性；因此，集成模型虽然在训练过程中要耗费更多的时间，但是得到的综合模型往往具有更高的表现性能和更好的稳定性。