# Task05 模型融合

## Task04 目標及內容
### 目標
>對多種調參完成的模型進行融合

### 內容
>1. 簡單加權融合
2. stacking  
(因賽題為回歸問題，故不討論分類模型的融合方法)

### 常見模型融合方法
回歸常用的模型融合方法有：
1. 簡單加權融合
2. stacking / blending
3. boosting / bagging  

由於先前訓練的隨機森林及XGBoost模型分別運用了bagging和boosting的方法，故模型融合的時後不再考慮。  
Blending 方法則不能良好的利用數據集，暫時先不考慮。

以下用Task04 的隨機森林以及 Grid search調參後的 XGBoost 模型為基學習器，嘗試簡單加權及Stacking方法做模型融合。

## O、基學習器

In [26]:
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
import warnings
warnings.filterwarnings("ignore")

In [47]:
###  讀取 Task04 的隨機森林以及 Grid search調參後的 XGBoost 模型的預測結果
train_pred_RF = pd.read_csv('./pred_result/RF_train_pred_1.csv')
train_pred_xgb_GS = pd.read_csv('./pred_result/xgb_GS_train_pred_1.csv')

test_pred_RF = pd.read_csv('./submits/RF_submit_1.csv')
test_pred_xgb_GS = pd.read_csv('./submits/xgb_GS_submit_1.csv')

In [34]:
### 獲取訓練集拆分測試集的 price 真實值
train_y_test = pd.read_csv('./pred_result/train_y_test_1.csv')

In [48]:
### 將 price 部分取出為 Series
train_pred_RF_price = train_pred_RF['price']
train_pred_xgb_GS_price = train_pred_xgb_GS['price']

test_pred_RF_price = test_pred_RF['price']
test_pred_xgb_GS_price = test_pred_xgb_GS['price']

train_y_test_price = train_y_test['price']

基學習器表現：  

||隨機森林|XGB_GS|
|---|---|---|
|訓練集 MAE|647.02|624.17|
|測試集 MAE|630.31|613.27|

## 一、簡單加權融合

>將基學習器獲得的預測結果以計算簡單或加權平均方式融合

In [42]:
from sklearn.metrics import mean_absolute_error

### 1. 簡單平均

In [52]:
train_mixed_price_11 = 0.5*train_pred_RF_price+0.5*train_pred_xgb_GS_price
MAE_11 = mean_absolute_error(train_y_test_price,train_mixed_price_11)
print(MAE_11)

609.81310263665


### 2. 以模型表現狀況加權平均

In [53]:
### 以 MAE 計算權重
MAE_RF = 647.02
MAE_xgb_GS = 624.17

w_RF = 1-(647.02/(647.02+624.17))
w_xgb_GS = 1-(624.17/(647.02+624.17))
print('w_RF={}; w_xgb_GS={}'.format(w_RF,w_xgb_GS))

w_RF=0.4910123584987296; w_xgb_GS=0.5089876415012705


In [54]:
train_mixed_price_12 = w_RF*train_pred_RF_price+w_xgb_GS*train_pred_xgb_GS_price
MAE_12 = mean_absolute_error(train_y_test_price,train_mixed_price_12)
print(MAE_12)

609.602690021946


因兩基學習器MAE表現差異不大，故以MAE計算權重加權平均的結果和簡單平均差不多，整體來看簡單加權融合後MAE下降，結果確實獲得優化。

In [92]:
test_mixed_price_12 = w_RF*test_pred_RF_price+w_xgb_GS*test_pred_xgb_GS_price

In [64]:
def transform_pred_to_submit(pred_price_array, SaleID_Series):
    pred_price = pd.Series(pred_price_array)
    df = pd.concat([SaleID_Series, pred_price],axis=1,ignore_index=True)
    df.columns=['SaleID','price']
    return df

In [69]:
SaleID_test = test_pred_RF[['SaleID']]

In [93]:
RF_xgb_GS_wavg_submit = transform_pred_to_submit(test_mixed_price_12,SaleID_test)
RF_xgb_GS_wavg_submit.to_csv('./submits/RF_xgb_GS_wavg_submit_1.csv',index=False)

## 二、stacking

>用基學習器預測結果再訓練一個簡單的模型

In [60]:
### 將基學習器預測結果整理為一個訓練集
stacking_x_train = pd.concat([train_pred_RF_price,train_pred_xgb_GS_price],axis=1)
stacking_x_train.columns=['RF_pred_price','xgb_GS_pred_price']
stacking_x_train.head()

Unnamed: 0,RF_pred_price,xgb_GS_pred_price
0,13167.932293,14541.303
1,992.860321,958.42175
2,784.172263,798.15173
3,388.659678,396.65408
4,6473.975776,6552.6533


In [61]:
stacking_x_test = pd.concat([test_pred_RF_price,test_pred_xgb_GS_price],axis=1)
stacking_x_test.columns=['RF_pred_price','xgb_GS_pred_price']
stacking_x_test.head()

Unnamed: 0,RF_pred_price,xgb_GS_pred_price
0,31927.97175,31928.271
1,352.267288,358.585
2,6433.327046,6475.3423
3,11768.682098,11873.419
4,617.96145,608.3052


In [55]:
from sklearn.linear_model import LinearRegression

In [91]:
### 先將 stacking_x_train 拆分初步檢視效果
stacking_LR = LinearRegression()
stacking_LR = stacking_LR.fit(stacking_x_train[:25000],train_y_test_price[:25000])
stacking_pred_test = stacking_LR.predict(stacking_x_train[25000:])
mean_absolute_error(train_y_test_price[25000:],stacking_pred_test)

591.9276455268716

由上可見MAE確實下降了，stacking 應有一定優化效果，後面用所有stacking_x_train訓練LR模型：

In [62]:
stacking_LR = LinearRegression()
stacking_LR = stacking_LR.fit(stacking_x_train,train_y_test_price)
stacking_pred_test = stacking_LR.predict(stacking_x_test)

In [63]:
stacking_pred_test

array([32525.34039437,   365.68066947,  6584.74944814, ...,
        7888.54091517,  8958.33204713,  3551.70949075])

In [72]:
RF_xgb_GS_LR_stacking_submit = transform_pred_to_submit(stacking_pred_test,SaleID_test)
RF_xgb_GS_LR_stacking_submit.to_csv('./submits/RF_xgb_GS_LR_stacking_submit_1.csv',index=False)

## 測試集結果

**基學習器表現：**  

||隨機森林|XGB_GS|
|---|---|---|
|訓練集 MAE|647.02|624.17|
|測試集 MAE|630.31|613.27|

**融合表現：**  

||MAE加權平均|LR Stacking|
|---|---|---|
|訓練集 MAE|609.6|591.93|
|測試集 MAE|596.38|604.46|

由上結果可知，不管訓練集或測試集，融合後模型表現確實有所提升，後續優化可將模型融合加入考慮。