# 机器学习笔记之树模型

## 1.树模型概括

| 模型类型 | 代表模型                               | 核心特点与适用场景 | Python 函数 / 来源                                                                                                               |
|---------|------------------------------------|-------------------|------------------------------------------------------------------------------------------------------------------------------|
| 单一模型 | 决策树                                | if-else规则分割数据；简单直观易解释；易过拟合；适合小数据和需要解释的场景 | sklearn.tree.DecisionTree(Classifier/Regressor)                                                                              |
| Bagging | 随机森林                               | 多树并行投票；稳定性高，抗过拟合；计算成本高；适合表格数据基线模型 | sklearn.ensemble.RandomForest(Classifier/Regressor)                                                                          |
| Boosting | GBDT                               | 顺序学习残差；预测精度高；训练慢参数多；适合中等规模高性能场景 | sklearn.ensemble.GradientBoosting(Classifier/Regressor)                                                                      |
| Boosting优化 | XGBoost <br>LightGBM <br> CatBoost | **XGBoost**：GBDT优化版，功能全面性能强，参数复杂，竞赛工业常用<br>**LightGBM**：直方图算法，训练快内存低，小数据易过拟合，适合海量数据<br>**CatBoost**：自动处理类别特征，预处理简单，训练较慢，适合类别特征多的数据 | xgboost.XGB(Classifier/Regressor)    (由**陈天奇**在**华盛顿大学**开发) <br>lightgbm.LGBM(Classifier/Regressor)    **(微软:Microsoft)** <br> catboost.CatBoost(Classifier/Regressor)    **(Yandex:俄罗斯科技公司)** |

## 2. demo project

### 2-1:InterDIA

[Huang L , Liu P , Huang X .InterDIA: Interpretable prediction of drug-induced autoimmunity through ensemble machine learning approaches[J].Toxicology, 2025:511.DOI:10.1016/j.tox.2025.154064.](https://www.sciencedirect.com/science/article/abs/pii/S0300483X25000204)

[https://github.com/Huangxiaojie2024/InterDIA](https://github.com/Huangxiaojie2024/InterDIA)

**流程图**

<img src="./InterDIA.jpg" width="700" height="250">

#### 知识点

1.  处理不平衡数据imbalanced-ensemble

    在分类问题中，如果数据集的各类别样本数量差异较大，可能会导致模型偏向预测多数类，从而影响模型的性能。常见的方法是使用 SMOTE（Synthetic Minority Over-sampling Technique）算法。

| 模型类别 | 核心思想 | 代表模型 | 简要说明 |
|---------|---------|---------|---------|
| **🔄 重采样集成**<br>*(Resampling-based)* | 在训练每个基学习器之前，先对训练数据进行重采样（如过采样或欠采样）以平衡类别分布。 | **SMOTEBoost**<br>**RUSBoost**<br>**UnderBagging**<br>**OverBagging**<br>**SMOTEBagging** | 将经典的SMOTE、随机过采样/欠采样与Boosting或Bagging框架结合。 |
| **⚖️ 代价敏感集成**<br>*(Cost-sensitive)* | 不改变数据分布，而是让学习算法在训练过程中更加关注误分少数类样本带来的高昂代价。 | **AdaCost**<br>**AsymBoost** | 通过修改算法的损失函数或权重更新机制，实现代价敏感学习。 |
| **✨ 自适应采样集成**<br>*(Adaptive Sampling)* | 根据之前基学习器的表现，自适应地调整下一次采样的样本分布，专注于难以分类的样本。 | **BalanceCascade**<br>**Self-Paced Ensemble (SPE)** | 动态地、有选择性地进行采样，效率和学习效果往往更好。 |
| **🔄 混合方法**<br>*(Hybrid Methods)* | 将多种不平衡处理技术（如采样和代价敏感）结合在一起，或设计新的集成策略。 | **EasyEnsemble**<br>**BalanceCascade**<br>*(广义上也属此类)* | 综合不同策略的优势，以期获得更鲁棒的模型。 |

2.  模型选择
  - Balanced Random Forest (BRF)
  - Easy Ensemble Classifier (EEC)
  - XGBoost with Balanced Bagging (BBC+XGBoost)
  - Gradient Boosting with Balanced Bagging (BBC+GBDT)
  - LightGBM with Balanced Bagging (BBC+LightGBM)

### 2-2:IntelliGenes

[DeGroat W, Mendhe D, Bhusari A, et al. IntelliGenes: a novel machine learning pipeline for biomarker discovery and predictive analysis using multi-genomic profiles[J]. Bioinformatics, 2023, 39(12): btad755.](https://academic.oup.com/bioinformatics/article/39/12/btad755/7473370)

https://github.com/drzeeshanahmed/intelligenes

流程图

![IntelliGenes](./IntelliGenes.jpg)

特真选取的最重要函数from sklearn.feature_selection import SelectKBest, chi2, f_classif, RFE

SelectKBest：对应的函数chi2, f_classif选取最好的10个特征，这里很重要还增加了一个p value值

REF：递归特征消除的特征排序

pearson 相关性系数筛选

最后选取共有的特征作为最终的biomarker


In [None]:
# (Packages/Libraries) Matrix Manipulation
import pandas as pd

# (Packages/Libraries) Statistical Analysis & Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, RFE
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import pearsonr

# (Packages/Libraries) Miscellaneous
import argparse
import warnings
from sklearn.exceptions import ConvergenceWarning
import os
from datetime import datetime
from pathlib import Path

class FeatureSelection:

    def __init__(self: 'FeatureSelection', cgit_file: str, output_dir: str, random_state: 42, test_size: 0.3, use_rfe = True, use_pearson = True, use_chi2 = True, use_anova = True, use_normalization = False):
        self.cgit_file = cgit_file
        self.output_dir = output_dir
        self.random_state = random_state
        self.test_size = test_size
        self.use_rfe = use_rfe
        self.use_pearson = use_pearson
        self.use_chi2 = use_chi2
        self.use_anova = use_anova
        self.use_normalization = use_normalization

        self.df = pd.read_csv(self.cgit_file)

        self.y = self.df['Type']
        self.X = self.df.drop(['Type', 'ID'], axis = 1)

        if self.use_normalization:
            self.X = pd.DataFrame(MinMaxScaler().fit_transform(self.X), columns = self.X.columns)

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size = self.test_size, random_state = self.random_state)

        self.selectors = []

    def rfe_selector(self: 'FeatureSelection'):
        if self.use_rfe:
            print("Recursive Feature Elimination...")
            rfe_selection = RFE(estimator = DecisionTreeClassifier(random_state = self.random_state), n_features_to_select = 1).fit(self.X_train, self.y_train)
            rfe_df = pd.DataFrame({'attributes': self.X_train.columns,
                                   'rfe_rankings': rfe_selection.ranking_})

            rfe_df = rfe_df.sort_values(by = 'rfe_rankings').loc[rfe_df['rfe_rankings'] <= int((self.df.shape[1] - 2) * .10)]
            return rfe_df
        return None

    def pearson_selector(self: 'FeatureSelection'):
        if self.use_pearson:
            print("Pearson's Correlation...")
            pearson_selection = [pearsonr(self.X_train[column], self.y_train) for column in self.X.columns]
            pearson_df = pd.DataFrame({'attributes': self.X_train.columns,
                                       'pearson_p-value': [corr[1] for corr in pearson_selection]})

            pearson_df = pearson_df[pearson_df['pearson_p-value'] < 0.05]
            return pearson_df
        return None

    def chi2_selector(self: 'FeatureSelection'):
        if self.use_chi2:
            print("Chi-Square Test...")
            chi2_selection = SelectKBest(score_func = chi2, k = 10).fit(self.X_train, self.y_train)
            chi2_df = pd.DataFrame({'attributes': self.X_train.columns,
                                    'chi2_p-value': chi2_selection.pvalues_})

            chi2_df = chi2_df[chi2_df['chi2_p-value'] < 0.05]
            return chi2_df
        return None

    def anova_selector(self: 'FeatureSelection'):
        if self.use_anova:
            print("ANOVA...")
            anova_selection = SelectKBest(score_func = f_classif, k = 10).fit(self.X_train, self.y_train)
            anova_df = pd.DataFrame({'attributes': self.X_train.columns,
                                     'anova_p-value': anova_selection.pvalues_})

            anova_df = anova_df[anova_df['anova_p-value'] < 0.05]
            return anova_df
        return None

    def execute_selectors(self: 'FeatureSelection'):
        self.selectors = [self.rfe_selector(),
                          self.pearson_selector(),
                          self.chi2_selector(),
                          self.anova_selector()]

        self.selectors = [df for df in self.selectors if df is not None]

    def selected_attributes(self: 'FeatureSelection'):
        selected_attributes = pd.DataFrame({'attributes': self.X_train.columns})
        for df in self.selectors:
            selected_attributes = selected_attributes.merge(df, how = 'inner', on = 'attributes')

        selector_cols = ['rfe_rankings', 'pearson_p-value', 'chi2_p-value', 'anova_p-value']
        selectors_used = [col for col in selector_cols if col in selected_attributes.columns]
        if any(not self.__dict__[f"use_{selector.split('_')[0]}"] for selector in selectors_used):
            selected_attributes = selected_attributes.dropna(subset = selectors_used, how = 'any')

        selected_attributes = selected_attributes.rename(columns={
            'attributes': 'Features',
            'rfe_rankings': 'RFE Rankings',
            'pearson_p-value': "Pearson's Correlation (p-value)",
            'chi2_p-value': 'Chi-Square Test (p-value)',
            'anova_p-value': 'ANOVA (p-value)'
        })

        return selected_attributes

def main():
    print("\n")
    print("IntelliGenes Feature Selection/Biomarker Location...")

    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--cgit_file', required = True)
    parser.add_argument('-o', '--output_dir', required = True)
    parser.add_argument('--random_state', type = int, default = 42)
    parser.add_argument('--test_size', type = float, default = 0.3)
    parser.add_argument('--no_rfe', action = 'store_true')
    parser.add_argument('--no_pearson', action = 'store_true')
    parser.add_argument('--no_chi2', action = 'store_true')
    parser.add_argument('--no_anova', action = 'store_true')
    parser.add_argument('--normalize', action = 'store_true')
    args = parser.parse_args()

    pipeline = FeatureSelection(
        cgit_file  = args.cgit_file,
        output_dir = args.output_dir,
        random_state = args.random_state,
        test_size = args.test_size,
        use_rfe = not args.no_rfe,
        use_pearson = not args.no_pearson,
        use_chi2 = not args.no_chi2,
        use_anova = not args.no_anova,
        use_normalization = args.normalize
    )

    pipeline.execute_selectors()
    features_df = pipeline.selected_attributes()

    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    file_name = Path(args.cgit_file).stem
    features_name = f"{file_name}_{datetime.now().strftime('%m-%d-%Y-%I-%M-%S-%p')}_Selected-Features.csv"
    features_file = os.path.join(args.output_dir, features_name)

    features_df.to_csv(features_file, index = False)
    print("\n Selected Features:", features_file, "\n")

if __name__ == '__main__':
    main()