**Decision Tree**

@Time: 2024-12-01<br>
@Author: Rui Zhu<br>
@Follow: 第2章 深入浅出决策树

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

import astrokit as ak
ak.pandas_show_all_columns()

dir_data = Path("/Users/rui/Code/Astronote/31_XGBoost/data")

---
# 决策树
- 为什么在XGBoost学习之前介绍决策树?
    1. XGBoost是一种集成方法, 由不同的ML模型(基学习器)组合而成
    2. 决策树是XGBoost最常用的基学习器
- 决策树容易过拟合
    1. 决策树能够创建成千上万个分支, 直到训练集的数据映射到正确的目标, 但这样的模型泛化能力不强
    2. 解决方法1: 超参数微调
    3. 解决方法2: 决策树集成, 即随机森林和XGBoost

---
# 实战: 预测一个人的年收入是否达到5万美元以上

In [2]:
df_census = pd.read_csv(dir_data / "census_cleaned.csv")
df_census.shape

(32561, 93)

In [3]:
X = df_census.iloc[:, :-1]
y = df_census.iloc[:, -1]

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8139106402579457

---
# 偏差与方差
- 偏差: 将模型应用到实际问题时所估计的误差
    1. 模型对数据的拟合不足, 偏差较大
- 方差: 模型在不同训练集上训练会发生多大变化.
    1. 具有高方差的模型往往会过拟合
- 机器学习应追求低偏差和低方差

---
# 决策树的超参数
- [官方文档DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
- 优先调整的超参数:
    1. max_depth
    2. max_features
    3. min_samples_leaf
    4. max_leaf_nodes
    5. min_impurity_decrease
    6. min_samples_splits

## 实战: 预测某天的自行车租赁数量
- 基准测试, 说明单颗决策树很容易过拟合

In [31]:
# 基准测试
df = pd.read_csv(dir_data / "bike_rentals_cleaned.csv")
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def grid_serach(params={}, reg=DecisionTreeRegressor(random_state=42)):
    grid_red = GridSearchCV(reg, params, scoring="neg_mean_squared_error", 
    cv=5, n_jobs=-1)
    grid_red.fit(X_train, y_train)

    best_params = grid_red.best_params_
    best_score = np.sqrt(-grid_red.best_score_)

    best_model = grid_red.best_estimator_
    y_pred = best_model.predict(X_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    print(f"Best params: {best_params}")
    print(f"Training score: {best_score:.2f}")
    print(f"Test score: {test_rmse:.2f}")

grid_serach()

Best params: {}
Training score: 1000.03
Test score: 816.36


In [30]:
# 观察模型在训练集上的表现
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_train)

from sklearn.metrics import mean_squared_error
reg_mse = mean_squared_error(y_train, y_pred)
reg_rmse = np.sqrt(reg_mse)
reg_rmse  # ! 说明模型完美拟合训练集

np.float64(0.0)

## max_depth (最大树深度)
- 默认: none, 没有限制, 即单颗决策树可以无限分割, 容易过拟合
- 使用网格搜索工具选择最佳超参数

In [32]:
params = {'max_depth': [None, 2, 3, 4, 5, 6, 8, 10, 20]}

grid_serach(params)

Best params: {'max_depth': 5}
Training score: 906.15
Test score: 858.24


## min_samples_leaf (叶节点的最小样本数量)
- 默认: 1, 即叶节点可以由1个样本组成, 这样很容易过拟合
- 增加其数量, 有助于减少过拟合的风险

In [33]:
params = {'min_samples_leaf': [1, 2, 4, 6, 8, 10, 20, 30]}
grid_serach(params)

Best params: {'min_samples_leaf': 10}
Training score: 853.53
Test score: 910.45


In [34]:
params = {
    'max_depth': [None, 2, 3, 4, 5, 6, 8, 10, 20],
    'min_samples_leaf': [1, 2, 4, 6, 8, 10, 20, 30]
}
grid_serach(params)

Best params: {'max_depth': None, 'min_samples_leaf': 10}
Training score: 853.53
Test score: 910.45


---
# 决策树分类实战:预测心脏病

In [36]:
df_heart = pd.read_csv(dir_data / "heart_disease.csv")
df_heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [37]:
X = df_heart.iloc[:, :-1]   
y = df_heart.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 基准模型

In [39]:
model = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
print(f"Accurcy: {np.round(scores, 2)}")
print(f"Mean accuracy: {np.round(scores.mean(), 2)}")

Accurcy: [0.76 0.73 0.81 0.69 0.73]
Mean accuracy: 0.74


## 超参数选择

In [45]:
from sklearn.model_selection import RandomizedSearchCV

def random_search_clf(params, runs=20):
    clf = DecisionTreeClassifier(random_state=42)
    rand_clf = RandomizedSearchCV(
        clf, params, n_iter=runs, cv=5, n_jobs=5, 
        random_state=42
        )
    rand_clf.fit(X_train, y_train)
    best_model = rand_clf.best_estimator_
    best_score = rand_clf.best_score_
    print(f"Traning score: {best_score:.2f}")
    
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test score: {accuracy:.2f}")
    return best_model

In [61]:
params = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'min_weight_fraction_leaf': [0.0, 0.0025, 0.005, 0.0075, 0.01], 
    'min_samples_split': [2, 3, 4, 5, 6, 8, 10], 
    'min_samples_leaf': [1, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1], 
    'min_impurity_decrease': [0.0, 0.0005, 0.005, 0.05, 0.1, 0.15, 0.2], 
    'max_leaf_nodes': [10, 15, 20, 25, 30, 35, 40, 45, 50, None], 
    'max_features': [None, 0.95, 0.90, 0.85, 0.80, 0.75, 0.7], 
    'max_depth': [None, 2, 3, 4, 5, 6, 8, 10, 20],
    'min_weight_fraction_leaf': [0.0, 0.0025, 0.005, 0.0075, 0.01, 0.05]
}

best_model = random_search_clf(params, runs=20)
best_model.get_params()

Traning score: 0.80
Test score: 0.77


{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': 0.75,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0005,
 'min_samples_leaf': 0.01,
 'min_samples_split': 4,
 'min_weight_fraction_leaf': 0.005,
 'monotonic_cst': None,
 'random_state': 42,
 'splitter': 'random'}

## 检查优化的模型的性能

In [62]:
model = best_model
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
print(f"Accurcy: {np.round(scores, 2)}")
print(f"Mean accuracy: {np.round(scores.mean(), 2)}")

Accurcy: [0.82 0.78 0.83 0.79 0.77]
Mean accuracy: 0.8


## 特征重要性

In [71]:
feature_dict = dict(zip(X.columns, model.feature_importances_))
import operator
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)

[('exang', np.float64(0.3073929485313097)),
 ('ca', np.float64(0.23768966305113415)),
 ('sex', np.float64(0.09381420780786087)),
 ('cp', np.float64(0.0858642434380593)),
 ('slope', np.float64(0.07750439180073812)),
 ('oldpeak', np.float64(0.07481278188102705)),
 ('thal', np.float64(0.07019042250056112)),
 ('trestbps', np.float64(0.024647943028019394)),
 ('age', np.float64(0.014574116934075365)),
 ('restecg', np.float64(0.013509281027214822)),
 ('chol', np.float64(0.0)),
 ('fbs', np.float64(0.0)),
 ('thalach', np.float64(0.0))]