# 决策树

## sk-learn 官方User Guide的例子 
DecisionTreeClassifier 是能够在数据集上执行多分类的类,与其他分类器一样，DecisionTreeClassifier 采用输入两个数组：数组X，用 [n_samples, n_features] 的方式来存放训练样本。整数值数组Y，用 [n_samples] 来保存训练样本的类标签:

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
import graphviz

In [None]:
iris = load_iris()
dt_clf = DecisionTreeClassifier()
dt_clf

In [None]:
clf = dt_clf.fit(iris.data, iris.target)

In [None]:
cross_val_score(clf, iris.data, iris.target, cv=10)

经过训练，我们可以使用 export_graphviz 导出器以 Graphviz 格式导出决策树.

In [None]:
dot_data = export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph
# graph.render("iris", format='png')  保存成其他格式

export_graphviz 还支持各种美化，包括通过他们的类着色节点（或回归值），如果需要，还能使用显式变量和类名

In [None]:
dot_data = export_graphviz(clf, out_file=None,
                        feature_names=iris.feature_names,
                        class_names=iris.target_names,
                        filled=True, rounded=True,
                        special_characters=True  # 忽略特殊字符
                        )
graph = graphviz.Source(dot_data)
graph

实用技巧: <https://github.com/apachecn/sklearn-doc-zh/blob/master/docs/0.21.3/11.md#1105-%E5%AE%9E%E9%99%85%E4%BD%BF%E7%94%A8%E6%8A%80%E5%B7%A7>

## sklearn-cookbook 例子
<https://github.com/apachecn/sklearn-cookbook-zh/blob/master/4.md#41-%E4%BD%BF%E7%94%A8%E5%86%B3%E7%AD%96%E6%A0%91%E5%AE%9E%E7%8E%B0%E5%9F%BA%E6%9C%AC%E7%9A%84%E5%88%86%E7%B1%BB>

In [None]:
from sklearn import datasets
import numpy as np
np.set_printoptions(precision=4, suppress=True, threshold=15)

In [None]:
# 生成n-class(默认2个)的样本
# 3特征 其中0冗余 0重复
X, y = datasets.make_classification(n_samples=1000, n_features=3, n_redundant=0)
X

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X, y)

In [None]:
preds = dt.predict(X)
np.mean(preds == y)

**max_depth** 决策树最大深度, 决定了分支的数量

In [None]:
n_features = 200
X, y = datasets.make_classification(1000, n_features=n_features, n_informative=5)  # 有用的特征数5个
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
accuracies = [] # 保存正确率

In [None]:
# 不同的最大深度 的影响
for x in range(1, n_features+1):
    dt_clf = DecisionTreeClassifier(max_depth=x)
    dt_clf.fit(X_train, y_train)
    score = dt_clf.score(X_test, y_test)
    accuracies.append(score)

In [None]:
np.argmax(accuracies) + 1

In [None]:
# 可视化处理
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(range(1, n_features+1), accuracies)
plt.xlabel('Max Depth')
plt.ylabel('Score')

实际上在较低最大深度处得到了漂亮的准确率.

In [None]:
N = 15
plt.plot(range(1, n_features+1)[:N], accuracies[:N])
plt.xlabel('Max Depth')
plt.ylabel('Score')

**调整决策树模型**

In [None]:
def dt_max_depth(*args, **kwargs):
    X, y = datasets.make_classification(1000, n_features=20, n_informative=3)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    dt = DecisionTreeClassifier(*args, **kwargs)
    dt.fit(X_train, y_train)
    return dt

In [None]:
def view_dt(dt):
    dot_data = export_graphviz(dt, out_file=None)
    graph = graphviz.Source(dot_data)
    return graph

In [None]:
view_dt(dt_max_depth())  # 10层 

In [None]:
# 降低最大深度
view_dt(dt_max_depth(max_depth=5)) 

In [None]:
# 使用信息增益(熵)进行分割
view_dt(dt_max_depth(max_depth=5, criterion='entropy', min_samples_leaf=10))   # 叶结点最小样本数10

##  用决策树解释泰坦尼克号假设
sklearn 学习指南
<https://github.com/apachecn/misc-docs-zh/blob/master/docs/learning-sklearn/ch02.md#%E7%94%A8%E5%86%B3%E7%AD%96%E6%A0%91%E8%A7%A3%E9%87%8A%E6%B3%B0%E5%9D%A6%E5%B0%BC%E5%85%8B%E5%8F%B7%E5%81%87%E8%AE%BE>

属性列表为：Ordinal（序号），Class（等级），Survived（是否幸存，0=no，1=yes），Name（名称），Age（年龄），Port of Embarkation（登船港口），Home/Destination（家/目的地），Room（房间），Ticket（票号），Boat（救生艇）和Sex（性别）

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv('titanic.txt')
data

### 预处理数据   
选择 pclass age sex 这3种数据进行划分

In [None]:
titanic_X , titanic_y = data.iloc[:, [1, 4, -1]], data.iloc[:, 2]
features = titanic_X.columns

In [None]:
titanic_y.value_counts()

**处理缺失值**

将年龄的缺失值用所有人员年龄的平均值进行替换

In [None]:
titanic_X = titanic_X.fillna(titanic_X.mean())
X = titanic_X.values  # pandas -> array
X

In [None]:
y = titanic_y.values
y

**类别特征编码**  
将标签值转为0..K-1的整数
```
class sklearn.preprocessing.LabelEncoder
    Encode labels with value between 0 and nclasses
```

In [None]:
from sklearn import preprocessing
enc = preprocessing.LabelEncoder()
X[:, -1] = enc.fit_transform(X[:, -1])  # 直接对 性别 这一列进行转换
X

In [None]:
le.classes_ 

对 pclass 进行这样处理,会有3个结果 0, 1, 2.这种转换隐式地引入了类之间的顺序, 但实际却是无序的.  
另外一种将标称型特征转换为能够被scikit-learn中模型使用的编码是one-of-K， 又称为**独热码**或dummy encoding。  
这种编码类型已经在类`OneHotEncoder`中实现。该类把每一个具有n_categories个可能取值的categorical特征变换为长度为n_categories的二进制特征向量，里面只有一个地方是1，其余位置都是0.  
1st 2nd 3rd 

In [None]:
# 先将标称label 变换为整数
enc = preprocessing.LabelEncoder()
label_encoder = enc.fit(X[:, 0])
label_encoder.classes_

In [None]:
integer_classes = label_encoder.transform(label_encoder.classes_)
integer_classes  # 3个整数类别

In [None]:
int_classses = label_encoder.transform(X[:, 0])
int_classses[:, np.newaxis]  # pclass列所有数据都转为int

In [None]:
# 将整数特征 变 独热码
enc = preprocessing.OneHotEncoder()
enc

In [None]:
new_features = one_hot_encoder.fit_transform(int_classses[:, np.newaxis]).toarray()
new_features  # 将pclass列 n个数据变成 n个特征向量, 每个向量中只有一个为1 其余为0

In [None]:
X

In [None]:
X = np.concatenate((new_features, X[:, 1:]), axis=1)  # 最终得到的数据
X