# 第9课：集成学习进阶

## 学习目标
- 深入理解 Boosting 原理
- 掌握 XGBoost 的使用
- 掌握 LightGBM 的使用
- 了解 CatBoost 的特点
- 学习 Stacking 集成方法

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

## 1. Boosting 回顾

Boosting 是一种将弱学习器组合成强学习器的方法：

- **串行训练**：每个模型都试图纠正前一个模型的错误
- **加权投票**：最终预测是所有模型的加权组合
- **代表算法**：AdaBoost、Gradient Boosting、XGBoost、LightGBM、CatBoost

In [None]:
# 准备数据
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"训练集: {X_train.shape}")
print(f"测试集: {X_test.shape}")
print(f"特征名: {cancer.feature_names[:5]}...")

## 2. Sklearn Gradient Boosting

In [None]:
# Sklearn 的 Gradient Boosting
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)

print("Sklearn Gradient Boosting:")
print(f"准确率: {accuracy_score(y_test, y_pred_gb):.4f}")

## 3. XGBoost

XGBoost (eXtreme Gradient Boosting) 是目前最流行的 Boosting 库之一。

主要特点：
- 正则化防止过拟合
- 并行处理加速训练
- 支持缺失值
- 内置交叉验证

In [None]:
# 安装: pip install xgboost
try:
    import xgboost as xgb
    print(f"XGBoost 版本: {xgb.__version__}")
except ImportError:
    print("请先安装 xgboost: pip install xgboost")

In [None]:
import xgboost as xgb

# 创建 XGBoost 分类器
xgb_clf = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss'
)

xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)

print("XGBoost:")
print(f"准确率: {accuracy_score(y_test, y_pred_xgb):.4f}")

In [None]:
# 特征重要性
feature_importance = pd.DataFrame({
    'feature': cancer.feature_names,
    'importance': xgb_clf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:15], feature_importance['importance'][:15])
plt.xlabel('Importance')
plt.title('XGBoost Feature Importance (Top 15)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# 使用早停 (Early Stopping)
xgb_clf_es = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss',
    early_stopping_rounds=10
)

xgb_clf_es.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

print(f"最佳迭代次数: {xgb_clf_es.best_iteration}")
print(f"准确率: {accuracy_score(y_test, xgb_clf_es.predict(X_test)):.4f}")

## 4. LightGBM

LightGBM 是微软开发的高效 Boosting 框架。

主要特点：
- **Leaf-wise 生长**：比 Level-wise 更高效
- **直方图算法**：加速特征分裂
- **支持类别特征**：无需编码
- **训练速度极快**：尤其适合大数据集

In [None]:
# 安装: pip install lightgbm
try:
    import lightgbm as lgb
    print(f"LightGBM 版本: {lgb.__version__}")
except ImportError:
    print("请先安装 lightgbm: pip install lightgbm")

In [None]:
import lightgbm as lgb

# 创建 LightGBM 分类器
lgb_clf = lgb.LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1
)

lgb_clf.fit(X_train, y_train)
y_pred_lgb = lgb_clf.predict(X_test)

print("LightGBM:")
print(f"准确率: {accuracy_score(y_test, y_pred_lgb):.4f}")

In [None]:
# LightGBM 两种特征重要性
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Split 重要性
lgb.plot_importance(lgb_clf, importance_type='split', max_num_features=15, ax=axes[0])
axes[0].set_title('Feature Importance (Split)')

# Gain 重要性
lgb.plot_importance(lgb_clf, importance_type='gain', max_num_features=15, ax=axes[1])
axes[1].set_title('Feature Importance (Gain)')

plt.tight_layout()
plt.show()

## 5. CatBoost

CatBoost 是 Yandex 开发的 Boosting 框架，特别擅长处理类别特征。

主要特点：
- **有序目标编码**：处理类别特征
- **对称树**：平衡树结构
- **无需参数调优**：默认参数就很好

In [None]:
# 安装: pip install catboost
try:
    from catboost import CatBoostClassifier
    import catboost
    print(f"CatBoost 版本: {catboost.__version__}")
except ImportError:
    print("请先安装 catboost: pip install catboost")

In [None]:
from catboost import CatBoostClassifier

# 创建 CatBoost 分类器
cat_clf = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=3,
    random_state=42,
    verbose=False
)

cat_clf.fit(X_train, y_train)
y_pred_cat = cat_clf.predict(X_test)

print("CatBoost:")
print(f"准确率: {accuracy_score(y_test, y_pred_cat):.4f}")

## 6. 三大 Boosting 库对比

In [None]:
import time

# 性能对比
results = []

# Sklearn GB
start = time.time()
gb_clf.fit(X_train, y_train)
gb_time = time.time() - start
gb_acc = accuracy_score(y_test, gb_clf.predict(X_test))
results.append({'Model': 'Sklearn GB', 'Accuracy': gb_acc, 'Time': gb_time})

# XGBoost
start = time.time()
xgb_clf.fit(X_train, y_train)
xgb_time = time.time() - start
xgb_acc = accuracy_score(y_test, xgb_clf.predict(X_test))
results.append({'Model': 'XGBoost', 'Accuracy': xgb_acc, 'Time': xgb_time})

# LightGBM
start = time.time()
lgb_clf.fit(X_train, y_train)
lgb_time = time.time() - start
lgb_acc = accuracy_score(y_test, lgb_clf.predict(X_test))
results.append({'Model': 'LightGBM', 'Accuracy': lgb_acc, 'Time': lgb_time})

# CatBoost
start = time.time()
cat_clf.fit(X_train, y_train, verbose=False)
cat_time = time.time() - start
cat_acc = accuracy_score(y_test, cat_clf.predict(X_test))
results.append({'Model': 'CatBoost', 'Accuracy': cat_acc, 'Time': cat_time})

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

In [None]:
# 可视化对比
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 准确率
axes[0].bar(results_df['Model'], results_df['Accuracy'], color=['blue', 'green', 'red', 'purple'])
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Accuracy Comparison')
axes[0].set_ylim(0.9, 1.0)

# 训练时间
axes[1].bar(results_df['Model'], results_df['Time'], color=['blue', 'green', 'red', 'purple'])
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Training Time Comparison')

plt.tight_layout()
plt.show()

## 7. XGBoost 参数调优

In [None]:
# XGBoost 重要参数
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0]
}

xgb_grid = GridSearchCV(
    xgb.XGBClassifier(random_state=42, eval_metric='logloss'),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

xgb_grid.fit(X_train, y_train)

print(f"\n最佳参数: {xgb_grid.best_params_}")
print(f"最佳分数: {xgb_grid.best_score_:.4f}")
print(f"测试集准确率: {accuracy_score(y_test, xgb_grid.predict(X_test)):.4f}")

## 8. Stacking 集成

In [None]:
# 定义基础模型
estimators = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('xgb', xgb.XGBClassifier(n_estimators=50, random_state=42, eval_metric='logloss')),
    ('lgb', lgb.LGBMClassifier(n_estimators=50, random_state=42, verbose=-1))
]

# 创建 Stacking 分类器
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=42),
    cv=5
)

stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)

print("Stacking 集成:")
print(f"准确率: {accuracy_score(y_test, y_pred_stack):.4f}")

## 9. 回归任务示例

In [None]:
# 加载房价数据
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42)

print(f"训练集: {X_train_r.shape}")
print(f"目标变量范围: {y_reg.min():.2f} - {y_reg.max():.2f}")

In [None]:
# 训练回归模型
xgb_reg = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
lgb_reg = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42, verbose=-1)

xgb_reg.fit(X_train_r, y_train_r)
lgb_reg.fit(X_train_r, y_train_r)

# 预测
y_pred_xgb_r = xgb_reg.predict(X_test_r)
y_pred_lgb_r = lgb_reg.predict(X_test_r)

print("XGBoost 回归:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_xgb_r)):.4f}")
print(f"  R2: {r2_score(y_test_r, y_pred_xgb_r):.4f}")

print("\nLightGBM 回归:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_lgb_r)):.4f}")
print(f"  R2: {r2_score(y_test_r, y_pred_lgb_r):.4f}")

## 10. 练习题

### 练习1：使用 LightGBM 进行调参
对 LightGBM 进行网格搜索调参

In [None]:
# 在这里编写代码


### 练习2：自定义 Stacking
尝试不同的基础模型组合，比较 Stacking 效果

In [None]:
# 在这里编写代码


## 11. 本课小结

### 三大 Boosting 库对比

| 特性 | XGBoost | LightGBM | CatBoost |
|------|---------|----------|----------|
| 树生长策略 | Level-wise | Leaf-wise | Symmetric |
| 训练速度 | 中等 | 最快 | 较慢 |
| 内存使用 | 中等 | 最低 | 较高 |
| 类别特征 | 需编码 | 支持 | 最佳支持 |
| 默认效果 | 好 | 好 | 最好 |
| GPU 支持 | 是 | 是 | 是 |

### 选择建议

1. **XGBoost**：通用性最强，社区支持最好
2. **LightGBM**：大数据集首选，训练最快
3. **CatBoost**：有类别特征时首选，开箱即用

### 重要参数

- **n_estimators**：树的数量
- **learning_rate**：学习率，越小需要更多树
- **max_depth**：树的最大深度
- **subsample**：样本采样比例
- **colsample_bytree**：特征采样比例