# 决策树

## sk-learn 官方文档的例子 
DecisionTreeClassifier 是能够在数据集上执行多分类的类,与其他分类器一样，DecisionTreeClassifier 采用输入两个数组：数组X，用 [n_samples, n_features] 的方式来存放训练样本。整数值数组Y，用 [n_samples] 来保存训练样本的类标签:

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
import graphviz

In [None]:
iris = load_iris()
dt_clf = DecisionTreeClassifier()
dt_clf

In [None]:
clf = dt_clf.fit(iris.data, iris.target)

In [None]:
cross_val_score(clf, iris.data, iris.target, cv=10)

经过训练，我们可以使用 export_graphviz 导出器以 Graphviz 格式导出决策树.

In [None]:
dot_data = export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph
# graph.render("iris", format='png')  保存成其他格式

export_graphviz 还支持各种美化，包括通过他们的类着色节点（或回归值），如果需要，还能使用显式变量和类名

In [None]:
dot_data = export_graphviz(clf, out_file=None,
                        feature_names=iris.feature_names,
                        class_names=iris.target_names,
                        filled=True, rounded=True,
                        special_characters=True  # 忽略特殊字符
                        )
graph = graphviz.Source(dot_data)
graph

实用技巧: <https://github.com/apachecn/sklearn-doc-zh/blob/master/docs/0.21.3/11.md#1105-%E5%AE%9E%E9%99%85%E4%BD%BF%E7%94%A8%E6%8A%80%E5%B7%A7>

## sklearn-cookbook 例子

In [None]:
from sklearn import datasets
import numpy as np
np.set_printoptions(precision=4, suppress=True, threshold=15)

In [None]:
# 生成n-class(默认2个)的样本
# 3特征 其中0冗余 0重复
X, y = datasets.make_classification(n_samples=1000, n_features=3, n_redundant=0)
X

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X, y)

In [None]:
preds = dt.predict(X)
np.mean(preds == y)

**max_depth** 决策树最大深度, 决定了分支的数量

In [None]:
n_features = 200
X, y = datasets.make_classification(1000, n_features=n_features, n_informative=5)  # 有用的特征数5个
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
accuracies = [] # 保存正确率

In [None]:
# 不同的最大深度 的影响
for x in range(1, n_features+1):
    dt_clf = DecisionTreeClassifier(max_depth=x)
    dt_clf.fit(X_train, y_train)
    score = dt_clf.score(X_test, y_test)
    accuracies.append(score)

In [None]:
np.argmax(accuracies) + 1

In [None]:
# 可视化处理
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(range(1, n_features+1), accuracies)
plt.xlabel('Max Depth')
plt.ylabel('Score')

实际上在较低最大深度处得到了漂亮的准确率。让我们进一步看看低级别的准确率，首先是 15

In [None]:
N = 15
plt.plot(range(1, n_features+1)[:N], accuracies[:N])
plt.xlabel('Max Depth')
plt.ylabel('Score')