决策树的应用分类非常广泛，在各行各业都有应用，比如在金融行业可以用决策树做贷款风险评估，医疗行业可以用决策树生成辅助判断，电商行业可以用决策树对销售额进行预测等。

加载数据

In [3]:
# -*-coding:utf-8
import pandas as pd
train_data = pd.read_csv('./data/titanic_train.csv')
test_data = pd.read_csv('./data/titanic_test.csv')

数据探索

In [4]:
train_data.info() # 可以看出哪些列有缺失值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [5]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [21]:
train_data.describe(include=['O']) # include=['O']  查看字符串类型（非数字）的整体情况；

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,891
unique,891,2,681,147,3
top,"Allison, Master. Hudson Trevor",male,CA. 2343,G6,S
freq,1,577,7,4,646


前五条数据

In [7]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


后五条数据

In [8]:
train_data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


数据清洗             
使用平均年龄来填充年龄中的 NaN 值

In [14]:
train_data['Age'].fillna(train_data['Age'].mean(),inplace=True)
test_data['Age'].fillna(test_data['Age'].mean(),inplace=True)

使用票价的均值填充票价中的 NaN 值

In [15]:
train_data['Fare'].fillna(train_data['Fare'].mean(),inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True)

Cabin为船舱，有大量缺失值，无法补齐                                
Embarked 为登录港口，有少量缺失值                   
print(train_data['Embarked'].value_counts())  发现港口只有三个，S 港口人最多，将缺失值补为 S

In [17]:
train_data['Embarked'].fillna('S',inplace=True)
test_data['Embarked'].fillna('S',inplace=True)

特征选择           
通过数据探索我们发现，PassengerId 为乘客编号，对分类没有作用；                
Name 为乘客名字，对分类没有作用；         
Cabin 字段缺失值太多，可以放弃；          
Ticket 字段为船票号码，杂乱无章且无规律，可以放弃；                   
其余字段，包括：Pclass、Sex、Age、SibSp、Parch和Fare，分别表示乘客的船票等级、性别、年龄、亲戚数量以及船票价格，可能有关；          
具体是什么关系，可以交给分类器来处理。          

将可能有用的字段，放到特征向量 features 里

In [20]:
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
train_features=train_data[features]
train_labels=train_data['Survived']
test_features=test_data[features]
#test_labels=test_data['Survived']

特征值中有一些是字符串，不方便之后的运算，需要转成数值类型，比如 Sex, 有 male 和 female 两种取值，我们可以把它变成 sex = male 和 sex = female 两个字段，数值用 0 或 1 来表示。           
Embarked 有 S, C, Q 三种可能，我们也可以改成 Embarked = S，Embarked = C 和 Embarked = Q 三个字段，数值用 0 或 1 表示。        
sklearn 特征选择中的 DictVectorizer 类，用它将可以处理符号化的对象，将符号转为数值 0/1 进行表示。

In [23]:
from sklearn.feature_extraction import DictVectorizer
dvec=DictVectorizer(sparse=False)
train_features=dvec.fit_transform(train_features.to_dict(orient='record'))

fit_transform 函数，可以将特征向量转化为特征值矩阵，然后我们看下 dvec 在转化后的特征属性是怎么样的：

In [24]:
dvec.feature_names_

['Age',
 'Embarked=C',
 'Embarked=Q',
 'Embarked=S',
 'Fare',
 'Parch',
 'Pclass',
 'Sex=female',
 'Sex=male',
 'SibSp']

原本是一列的 Embarked，变成了三列，Sex 列变成了两列。 这样， train_features 特征矩阵就变成了 10 个特征值（列），以及 891 个样本（行）。

In [26]:
train_features

array([[22.        ,  0.        ,  0.        , ...,  0.        ,
         1.        ,  1.        ],
       [38.        ,  1.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  0.        ,  0.        , ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [29.69911765,  0.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [26.        ,  1.        ,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [32.        ,  0.        ,  1.        , ...,  0.        ,
         1.        ,  0.        ]])

In [29]:
# 决策树模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# 构造 ID3 决策树
clf = DecisionTreeClassifier(criterion='entropy')
# 决策树训练
clf.fit(train_features,train_labels)

# 模型评预测 & 评估
test_features=dvec.fit_transform(test_features.to_dict(orient='record'))
# 决策树预测
pred_labels=clf.predict(test_features)

In [30]:
# 得到决策树准确率
acc_decision_tree = round(clf.score(train_features,train_labels),6)
# 用训练集做训练，再用训练集自身做准确率评估，会很高
print('score 准确率为 {:.4f}'.format(acc_decision_tree))

score 准确率为 0.9820


K 折交叉验证，原理是拿出大部分样本进行训练，少量的用于分类器的验证, K 折就是做 K 次交叉验证，
每次选 K 分之一的数据作为验证，其余作为训练，轮流 K 次,取平均值，一般 K 取10
1. 将数据集平均分割成 K 个等份
2. 使用 1 份数据作为测试数据，其余作为训练数据
3. 计算测试准确率
4. 使用不同的测试集，重复 2、3步骤

In [31]:
import numpy as np
from sklearn.model_selection import cross_val_score
# 使用 K 折交叉验证，统计决策树的准确率
print("cross_val_score 准确率为 {:.4f}".format(np.mean(cross_val_score(clf,train_features,train_labels,cv=10))))

cross_val_score 准确率为 0.7802


In [32]:
import graphviz
from sklearn import tree
# 决策树可视化
dot_data = tree.export_graphviz(clf,out_file=None)
graph=graphviz.Source(dot_data)
graph.view()

'Source.gv.pdf'