# 模型融合(Ensemble Learning)技术

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/image-20221101183239483.png" alt="image-20221101183239483" style="zoom:50%;" />

In [1]:
import joblib
import pathlib
import warnings

import pandas as pd

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

PROCESSED_DATA_DIR = pathlib.Path("../dataset/processed")

## 加载数据

### 加载数据集

In [2]:
from sklearn.model_selection import train_test_split

inputs = joblib.load(PROCESSED_DATA_DIR / "inputs.joblib")
target = joblib.load(PROCESSED_DATA_DIR / "target.joblib")
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.25, random_state=42, stratify=target)
y_train.value_counts()

fraudulent
0    12252
1     8127
Name: count, dtype: int64

### 加载预处理器

In [3]:
preprocessor = joblib.load(PROCESSED_DATA_DIR / "preprocessor.joblib")
preprocessor

### 加载基础模型

In [4]:
tree = joblib.load("../app/models/DecisionTreeClassifier.joblib")
lr = joblib.load("../app/models/LogisticRegression.joblib")
lsvc = joblib.load("../app/models/LinearSVC.joblib")
sgdc = joblib.load("../app/models/SGDClassifier.joblib")
rf = joblib.load("../app/models/RandomForestClassifier.joblib")

## 投票法(Voting)

首先如果模型最终输出的是类别判别结果，则可以通过投票法进行模型融合，投票法会根据**少数服从多数**的规则进行结果输出，例如现有A、B、C、D、E五个模型对现有数据进行预测，结果如下：

<table>
  <tr>
    <th rowspan="2">样本</th>
    <th colspan="5">单模预测结果</th>
    <th colspan="2">投票结果</th>
    <th>最终预测结果</th>
  </tr>
  <tr>
    <th>模型A</th>
    <th>模型B</th>
    <th>模型C</th>
    <th>模型D</th>
    <th>模型E</th>
    <th>预测为0</th>
    <th>预测为1</th>
    <th>规则：少数服从多数</th>
  </tr>
  <tr>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>1</td>
    <td>4</td>
    <td>1</td>
  </tr>
  <tr>
    <td>2</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>2</td>
    <td>3</td>
    <td>1</td>
  </tr>
  <tr>
    <td>3</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>3</td>
    <td>2</td>
    <td>0</td>
  </tr>
  <tr>
    <td>4</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>4</td>
    <td>1</td>
    <td>0</td>
  </tr>
</table>

&emsp;&emsp;当然，很多时候为了计算方便，我们会把这个少数服从多数的过程等价转化是否**超过半数评估器认为该样本应该属于1类**，如果是，则输出结果为1，反之则输出预测结果为0。需要注意的是，该做法会更加方便代码层面的实现，也是后续我们主要采用的计算流程。例如上述简单示例可以修改流程如下：

&emsp;&emsp;而这样的一个投票集成的过程，到底能带来多少性能上的提升呢？从理论上来说，根据[Narasimhamurthy,2003]()研究表明，在多样性构建的比较好的情况下，投票融合性能边界如下：

<center><img src="https://s2.loli.net/2022/05/20/uVrzS79LaBsJgKp.png" alt="image-20220520122231347" style="zoom:33%;" />

能够看出，在单体分类器准确率为80%左右（较为普遍的情况）时，模型投票融合能有平均约15%的准确率提升。当然，该理论实际上是基于分类器相互独立的假设推导而来，而在大多数真实场景下，该假设并不成立，因此该理论的结论可以视作一个理论上限，并不能代表一般情况。

In [5]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report

### 硬投票(Hard Voting)

In [6]:
estimators = [
    ("tree", tree), 
    ("lr", lr),
    ("lsvc", lsvc), 
    ("sgdc", sgdc),
    ("rf", rf)
]
hard_vc = VotingClassifier(estimators, voting="hard")
hard_vc.fit(X_train, y_train)

[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  18.7s
[Pipeline] ....... (step 1 of 2) Processing transformer, total=  28.0s
[Pipeline]  (step 2 of 2) Processing DecisionTreeClassifier, total= 1.9min
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  12.9s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  14.1s
[C

In [7]:
print(classification_report(hard_vc.predict(X_test), y_test, digits=6))

              precision    recall  f1-score   support

           0   0.998286  0.990525  0.994390      4116
           1   0.985604  0.997385  0.991459      2677

    accuracy                       0.993228      6793
   macro avg   0.991945  0.993955  0.992925      6793
weighted avg   0.993288  0.993228  0.993235      6793



### 软投票(Soft Voting)

In [8]:
soft_vc = VotingClassifier(estimators, voting="soft")
soft_vc.fit(X_train, y_train)

[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  15.7s
[Pipeline] ....... (step 1 of 2) Processing transformer, total=  19.9s
[Pipeline]  (step 2 of 2) Processing DecisionTreeClassifier, total= 2.1min
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  20.3s
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.1s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  14.9s
[C

In [9]:
print(classification_report(soft_vc.predict(X_test), y_test, digits=6))

              precision    recall  f1-score   support

           0   0.998531  0.989566  0.994028      4121
           1   0.984127  0.997754  0.990894      2672

    accuracy                       0.992787      6793
   macro avg   0.991329  0.993660  0.992461      6793
weighted avg   0.992865  0.992787  0.992795      6793



## 均值法

### 权重设计策略

#### 平均为主，博采众长

In [10]:
weight1 = tree.score(X_train, y_train)
weight2 = lr.score(X_train, y_train)
weight3 = lsvc.score(X_train, y_train)
weight4 = sgdc.score(X_train, y_train)
weight5 = rf.score(X_train, y_train)
weights = [weight1, weight2, weight3, weight4, weight5]
weights

[1.0, 0.9894008538201089, 1.0, 0.9855243142450562, 1.0]

: 

In [11]:
soft_vc_weight = VotingClassifier(
    estimators=estimators, 
    voting='soft', 
    weights=weights)
soft_vc_weight.fit(X_train, y_train)

[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.1s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.1s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  16.1s
[Pipeline] ....... (step 1 of 2) Processing transformer, total=  20.6s
[Pipeline]  (step 2 of 2) Processing DecisionTreeClassifier, total= 1.9min
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.1s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.1s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  17.0s
[ColumnTransformer]  (2 of 4) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] . (3 of 4) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] ... (4 of 4) Processing StandScaler, total=   0.0s
[ColumnTransformer] .... (1 of 4) Processing vectorizer, total=  12.8s


In [None]:
print(classification_report(soft_vc_weight.predict(X_test), y_test, digits=6))

              precision    recall  f1-score   support

           0   0.997074  0.961439  0.978932      4253
           1   0.939394  0.995301  0.966540      2554

    accuracy                       0.974144      6807
   macro avg   0.968234  0.978370  0.972736      6807
weighted avg   0.975432  0.974144  0.974283      6807



#### 设计核心评估器与辅助评估器

In [None]:
weight1 = 10
weight2 = 1
weight3 = 1
weight4 = 1
weight5 = 100

weights = [weight1, weight2, weight3, weight4, weight5]
soft_vc_core_weight = VotingClassifier(
    estimators=estimators, 
    voting='soft', 
    weights=weights
)
soft_vc_core_weight.fit(X_train, y_train)

[ColumnTransformer]  (2 of 5) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (5 of 5) Processing StandScaler, total=   0.0s
[ColumnTransformer]  (4 of 5) Processing CatBoostEncoder, total=   0.0s
[ColumnTransformer] .. (3 of 5) Processing CountEncoder, total=   0.0s
[ColumnTransformer] .... (1 of 5) Processing vectorizer, total=  12.9s
[Pipeline] ....... (step 1 of 2) Processing transformer, total=  17.4s
[Pipeline]  (step 2 of 2) Processing DecisionTreeClassifier, total=  18.2s
[ColumnTransformer]  (2 of 5) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer] ... (5 of 5) Processing StandScaler, total=   0.0s
[ColumnTransformer]  (4 of 5) Processing CatBoostEncoder, total=   0.0s
[ColumnTransformer] .. (3 of 5) Processing CountEncoder, total=   0.1s
[ColumnTransformer] .... (1 of 5) Processing vectorizer, total=  13.0s
[ColumnTransformer]  (2 of 5) Processing OrdinalEncoder, total=   0.0s
[ColumnTransformer]  (4 of 5) Processing CatBoostEncoder, total=   0.0s

In [None]:
print(classification_report(soft_vc_core_weight.predict(X_test), y_test, digits=6))

              precision    recall  f1-score   support

           0   0.999756  0.979221  0.989382      4187
           1   0.967849  0.999618  0.983477      2620

    accuracy                       0.987072      6807
   macro avg   0.983803  0.989420  0.986430      6807
weighted avg   0.987475  0.987072  0.987109      6807



## Stacking法