# 决策树

思想：选择最优的特征和取值进行划分，使得结点的“纯度”越来越高。

D：类别数量，V：某一特征的取值个数。

**ID3**：信息增益最大

信息熵：$Ent(D)=-\sum_{i=1}^{D}p_ilog(p_i)$

信息增益：$Gain=Ent(D)-\sum_{v=1}^{V}\frac{|D^v|}{|D|}Ent(D^v)$

**C4.5**：信息增益率最大

固有值：$IV=-\sum_{v=1}^{V}\frac{|D^v|}{|D|}log(\frac{|D^v|}{|D|})$

信息增益率：$Gain_radio=\frac{Gain}{IV}$

**CART**：分类时基尼系数最小，回归时方差最小

基尼值：$Gini(D)=1-\sum_{i=1}^{D}p_i^2$

基尼系数：$Gini\_index=\sum_{v=1}^{V}\frac{|D^v|}{|D|}Gini(D^v)$

方差：$m*s^2=\sum_{i=1}^{m}(y^{(i)}-\bar{y})^2$

In [1]:
import pandas as pd

wh_data = pd.read_csv('武汉.csv', index_col='date', encoding='utf-8', engine='python')
wh_data.drop(wh_data[wh_data['质量等级']=='无'].index, inplace=True)
wh_data.head()

Unnamed: 0_level_0,AQI,质量等级,PM2.5,PM10,SO2,CO,NO2,O3_8h
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-01-01,203,重度污染,153,210,70,2.0,106,55
2014-01-02,231,重度污染,181,254,89,2.4,112,26
2014-01-03,224,重度污染,174,226,63,1.7,84,55
2014-01-04,147,轻度污染,112,184,73,1.6,87,40
2014-01-05,195,中度污染,147,213,89,2.2,91,53


In [0]:
from sklearn.model_selection import train_test_split

X = wh_data.iloc[:, 2:]
y = wh_data.iloc[:, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

In [3]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=100)
dt_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=100, splitter='best')

In [4]:
dt_clf.score(X_test, y_test)

0.973568281938326

In [5]:
from sklearn.model_selection import GridSearchCV

dt_clf = DecisionTreeClassifier(random_state=100)
params = {
         'class_weight': [None, 'balanced'],
         'max_depth': list(range(3, 12)),
         'max_leaf_nodes': list(range(10, 30))
}
grid = GridSearchCV(dt_clf, params, cv=3, n_jobs=-1)
grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=100,
                                              splitter='best'),
             iid='warn', n_jobs=-1,
             param_grid={'class_weight': [None, 'balanced'],
                         'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11],
         

In [6]:
grid.best_params_

{'class_weight': None, 'max_depth': 9, 'max_leaf_nodes': 20}

In [7]:
grid.score(X_test, y_test)

0.9779735682819384