<a href="https://colab.research.google.com/github/cclljj/LJ-test/blob/master/6_3_%E8%B3%87%E6%96%99%E6%A0%A1%E6%AD%A3_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6.3 - 資料校正

**作者**：羅泉恆 <br>
**更新時間**: 2022.09.29


---

## 相關連結


1. 校正模型
    - [DCF-PM2.5 (GitHub)](https://github.com/IISNRL/DCF-PM2.5)

2. 資料來源
    - [AirBox status report](https://pm25.lass-net.org/AirBox/)
    - [Open Data API](https://app.swaggerhub.com/apis-docs/I2875/PM25_Open_Data/)
    - [民生公共物聯網 - 資料服務平台](https://ci.taiwan.gov.tw/dsp/dataset_air_airbox.aspx)

3. Python 套件
  * 資料分析
    - [Pandas](https://pandas.pydata.org/)
    - [Numpy](https://numpy.org/)
  * 資料視覺化
    - [Plotly](https://plotly.com/python/)
    - [Matplotlib](https://matplotlib.org/)

## 前言




> 空氣品質監測站

在傳統的空品監測方式中，以極為專業、大型、昂貴的監測站為主，該專業的監測站由於部署及維護成本較高，依據不同監測目的有些甚至會超過五萬美金，通常會是由當地的環境保護機構（EPA）來負責運營。也因此不會在每個社區都有部署。依據[台灣環保署網站](https://airtw.epa.gov.tw/CHT/EnvMonitoring/Central/Background_Intro.aspx)公告，截至目前 2022 年 7 月為止，台灣的中央監測站數量為 81 座。

那當我們問起，空氣質量如何？專業的監測站能給我們答案嗎？是的。但是，如果我們將問題改為 "**此刻的、當前場域的空氣品質如何？**" 是無法得到答案的。不單是因為無法大量密集地部署在我們生活周圍，也由於對高精準度的要求，大多數的專業監測站會以每小時一次的頻率發布數值。受益於物聯網（IoT）技術的進步，低成本空氣品質感測器能夠有效的滿足這項需求。

> 低成本空品感測器

與傳統的大型專業測站相比，不僅架設成本降低了，連帶的提供了更靈活的安裝條件，擴大了可覆蓋的範圍。易於安裝和維護的特點，滿足了大規模即時空氣品質監測系統的條件，而能夠做到每五分鐘上傳一次數據，也使用戶對於突然的污染事件得以立即的反應，進一步的降低傷害。

> 準確性 vs 一致性

當然我們不能期待成本較低的感測器會擁有專業儀器的高準確度，如何提高其精準度成為成了另一項需要被解決的問題。與此之前我們該先認識另一項概念「**一致性**」，不強行追求數值完全一樣，只要這些感測器的數值能夠穩定維持在同一個區間，擁有可校正、可比較、可以看出數值趨勢變化的特性。目前中研院與環保署合作佈建的「**空氣盒子**」便是擁有高一致性的微型感測器，因此我們就能藉由數據處理的方式，彌補硬體上的弱點，增加可信度與應用的維度。

## DCF - Dynamic Calibration Framework


(插入架構圖片) [DCF 架構圖](https://drive.google.com/file/d/18p7tEnAwmgByfnxGDrb0uDFp0ZTCvWrw/view?usp=sharing)

相較於有嚴格規定安裝環境的專業級測站，微型空品感測器所面臨的環境更為多變且不確定，成為了校正低成本感測器的一大挑戰。

在這項校正框架中，我們提供了解決此問題的標準，進而開發了一個基於站點地理位置的動態校正模型框架，能夠快速的反應環境變化。通過每天使用不同的參數組合與模型訓練方式來建立校正模型，如此一來便能將可能的影響因素考慮進來，以適用不同的情況。

目前，我們已經完成以個別環保署專業測站為目標值、以鄰近 AirBox 為對照值的校正模型，並發佈於開放的線上平台，使用者能依照校正目標（AirBox）的地理位置選擇當天與之距離最近的校正模型，不需要自行訓練或是調整參數。


## 成果簡介


2021/03/15 17:40 時台灣本島的 IDW 截圖
可以看到原本的低成本感測器（LCAS）比起環保署專業測站（EPA）在某些區域的數值是偏高的，但經過校正後（Calibrated LCAS）數值差異明顯縮小了，增加了數據的可性度。

(插入三張截圖)

*   [LCAS](https://drive.google.com/file/d/1qCnIG-lw1ozOYG4Z7uGbFMQeJOQEF38b/view?usp=sharing)
*   [EPA](https://drive.google.com/file/d/1vuhqAhiKKiNlguKfEwi3fpdyVFDUFx1-/view?usp=sharing)
*   [Calibrated LCAS](https://drive.google.com/file/d/1211g1shm6NZNVCiInyI4pdqbqNZdMkLG/view?usp=sharing)

---

# 實作

In [None]:
# Disable Warnings
import warnings
warnings.filterwarnings("ignore")

import sys, traceback

In [None]:
def error_msg( e ):
    detail = e.args[0]
    error_class = e.__class__.__name__
    cl, exc, tb = sys.exc_info()
    lastCallStack = traceback.extract_tb(tb)[-1]
    lineNum = lastCallStack[1]
    funcName = lastCallStack[2]
    print("Unexpected error: line:{} in {}: [{}] {}".format(lineNum, funcName, error_class, detail))
    

### 0 - 基本設定與模型參數
使用萬華（Wanhua, [Dynamic Calibration Model Station - Wanhua](https://pm25.lass-net.org/DCF/site.html?site=wanhua&sensor=PMS5003)）專業測站與佈建於該測站的微型感測器來產生屬於萬華測站區域的校正模型，
訓練資料區間、所選用的特徵（Features）、模型種類與參數皆為可變的，以下使用了


*   萬華專業測站 ID：
*   區間長度：8, 5, 3 天
*   模型種類：LinearRegression, RandomForesetRegression, SVR
*   特徵：PM2.5, 濕度, 溫度, 時間戳記（小時值）

In [None]:
from sklearn import linear_model, svm, tree
from sklearn.ensemble import RandomForestRegressor

In [None]:
SITE = "wanhua"
EPA = "EPA-Wanhua"
AIRBOXS = ['08BEAC028A52', '08BEAC028690']
DAYS = [8, 5, 3] 
METHODS = ['LinearRegression', 'RandomForestRegressor', 'SVR']
METHOD_SW = { 'LinearRegression':'LinearR', 'RandomForestRegressor':'RFR', 'SVR':'SVR' }

METHOD_FUNTION = {'LinearRegression':linear_model.LinearRegression(),
                  'RandomForestRegressor': RandomForestRegressor(n_estimators = 300, random_state = 36),
                  'SVR': svm.SVR(C=20)
                }

FIELD_SW = {'s_d0':'PM25', 'pm25':'PM25', 'PM2_5':"PM25", 'pm2_5':"PM25", 's_h0':"HUM", 's_t0':'TEM'}
FEATURES_METHOD = {'PHTR':["PM25", "HR", "TEM", "HUM"], 
                   'PH':['PM25','HR'], 
                   'PT':['PM25','TEM'], 'PR':['PM25', 'HUM'], 
                   'P':['PM25'], 
                   'PHR':["PM25", "HR", "HUM"], 'PTR':["PM25", "TEM", "HUM"], 'PHT':["PM25", "HR", "TEM"]
                    }

In [None]:
print( "\n Site: {site}\n Device id: [EPA]{EPA} | [AIRBOX]{AIRBOXS}\n Day List: {day}\n Method List: {method}\n Feature Set: {feature}".format(
    site=SITE, 
    EPA=EPA, AIRBOXS=AIRBOXS,
    day=DAYS, 
    method=METHODS, 
    feature=list(FEATURES_METHOD.keys()) 
    ) 
)


 Site: wanhua
 Device id: [EPA]EPA-Wanhua | [AIRBOX]['08BEAC028A52', '08BEAC028690']
 Day List: [8, 5, 3]
 Method List: ['LinearRegression', 'RandomForestRegressor', 'SVR']
 Feature Set: ['PHTR', 'PH', 'PT', 'PR', 'P', 'PHR', 'PTR', 'PHT']


### 1 - 載入訓練資料
**Best-Yesterday Method**

欲取得（預測）第 N 天的校正模型，我們將第 N - 1 天的資料做為測試資料，而第 N - 2 到 N - (2+X) 天的資料作為訓練資料，這邊的 X 指的是訓練資料的區間，如果是使用 7 天，則訓練資料的區間將是第 N - 2 到 N - 8 天。

為了完整的演示流程，我們會將 N 設定為目前時間（今天），而根據 Part 0 的設定，我們會需要用到最久的時間點為第 N - 9 天，故需要載入十天的歷史資料。

**Open Data API**

指定日期：pm25.lass-net.org/data/history-date.php?device_id= \<ID\>&date=\<YYY-MM-DD\>&format=CSV

EX, EPA-Wanhua 2022-09-21 的資料,<br>
[https://pm25.lass-net.org/data/history-date.php?device_id=EPA-Wanhua&date=2022-09-21&format=CSV](https://pm25.lass-net.org/data/history-date.php?device_id=EPA-Wanhua&date=2022-09-21&format=CSV)

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

In [None]:
TODAY = datetime.today()
ENDDATE = (TODAY - timedelta(days=2)).date()
TESTDATE = (TODAY - timedelta(days=1)).date()
print("TODAY: " + TODAY.strftime("%Y-%m-%d"))

TODAY: 2022-09-30


In [None]:
def getDF(id):
  temp_list = []
  for i in range(1,11):
    date = (TODAY - timedelta(days=i)).strftime("%Y-%m-%d")

    URL = "https://pm25.lass-net.org/data/history-date.php?device_id=" + id + "&date=" + date + "&format=CSV"
    temp_DF = pd.read_csv( URL, index_col=0 )
    temp_list.append( temp_DF )

    print("ID: {id}, Date: {date}, Shape: {shape}".format(id=id, date=date, shape=temp_DF.shape))

  All_DF = pd.concat( temp_list )
  return All_DF

In [None]:
# AirBox
AirBox1_DF = getDF(AIRBOXS[0])
AirBox1_DF.head()

ID: 08BEAC028A52, Date: 2022-09-29, Shape: (208, 19)
ID: 08BEAC028A52, Date: 2022-09-28, Shape: (222, 19)
ID: 08BEAC028A52, Date: 2022-09-27, Shape: (225, 19)
ID: 08BEAC028A52, Date: 2022-09-26, Shape: (230, 19)
ID: 08BEAC028A52, Date: 2022-09-25, Shape: (231, 19)
ID: 08BEAC028A52, Date: 2022-09-24, Shape: (232, 19)
ID: 08BEAC028A52, Date: 2022-09-23, Shape: (223, 19)
ID: 08BEAC028A52, Date: 2022-09-22, Shape: (220, 19)
ID: 08BEAC028A52, Date: 2022-09-21, Shape: (222, 19)
ID: 08BEAC028A52, Date: 2022-09-20, Shape: (215, 19)


Unnamed: 0_level_0,time,SiteAddr,SiteName,app,area,date,device_id,gps_alt,gps_fix,gps_lat,gps_lon,gps_num,name,s_d0,s_d1,s_d2,s_h0,s_t0,timestamp
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,00:03:19,Wanhua,萬華-1(2020),AirBox,taipei,2022-09-29,08BEAC028A52,2.0,1.0,25.0471,121.508,9.0,手動站：萬華站-1（A52),17.0,18.0,11.0,75.0,30.61,2022-09-29T00:03:19Z
1,00:09:24,Wanhua,萬華-1(2020),AirBox,taipei,2022-09-29,08BEAC028A52,2.0,1.0,25.0471,121.508,9.0,手動站：萬華站-1（A52),10.0,11.0,6.0,73.0,31.0,2022-09-29T00:09:24Z
2,00:15:32,Wanhua,萬華-1(2020),AirBox,taipei,2022-09-29,08BEAC028A52,2.0,1.0,25.0471,121.508,9.0,手動站：萬華站-1（A52),8.0,9.0,4.0,71.0,31.61,2022-09-29T00:15:32Z
3,00:21:38,Wanhua,萬華-1(2020),AirBox,taipei,2022-09-29,08BEAC028A52,2.0,1.0,25.0471,121.508,9.0,手動站：萬華站-1（A52),8.0,8.0,4.0,70.0,31.86,2022-09-29T00:21:38Z
4,00:27:43,Wanhua,萬華-1(2020),AirBox,taipei,2022-09-29,08BEAC028A52,2.0,1.0,25.0471,121.508,9.0,手動站：萬華站-1（A52),8.0,8.0,4.0,70.0,32.0,2022-09-29T00:27:43Z


In [None]:
# AirBox
AirBox2_DF = getDF(AIRBOXS[1])
AirBox2_DF.head()

ID: 08BEAC028690, Date: 2022-09-29, Shape: (200, 19)
ID: 08BEAC028690, Date: 2022-09-28, Shape: (221, 19)
ID: 08BEAC028690, Date: 2022-09-27, Shape: (227, 19)
ID: 08BEAC028690, Date: 2022-09-26, Shape: (230, 19)
ID: 08BEAC028690, Date: 2022-09-25, Shape: (226, 19)
ID: 08BEAC028690, Date: 2022-09-24, Shape: (228, 19)
ID: 08BEAC028690, Date: 2022-09-23, Shape: (217, 19)
ID: 08BEAC028690, Date: 2022-09-22, Shape: (222, 19)
ID: 08BEAC028690, Date: 2022-09-21, Shape: (219, 19)
ID: 08BEAC028690, Date: 2022-09-20, Shape: (211, 19)


Unnamed: 0_level_0,time,SiteAddr,SiteName,app,area,date,device_id,gps_alt,gps_fix,gps_lat,gps_lon,gps_num,name,s_d0,s_d1,s_d2,s_h0,s_t0,timestamp
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,00:00:38,Wanhua,萬華-2(2020),AirBox,taipei,2022-09-29,08BEAC028690,2.0,1.0,25.047,121.5081,9.0,手動站：萬華站-2（690),14.0,16.0,10.0,2.0,30.25,2022-09-29T00:00:38Z
1,00:06:44,Wanhua,萬華-2(2020),AirBox,taipei,2022-09-29,08BEAC028690,2.0,1.0,25.047,121.5081,9.0,手動站：萬華站-2（690),10.0,11.0,7.0,12.0,30.86,2022-09-29T00:06:44Z
2,00:12:51,Wanhua,萬華-2(2020),AirBox,taipei,2022-09-29,08BEAC028690,2.0,1.0,25.047,121.5081,9.0,手動站：萬華站-2（690),8.0,9.0,5.0,18.0,31.36,2022-09-29T00:12:51Z
3,00:18:57,Wanhua,萬華-2(2020),AirBox,taipei,2022-09-29,08BEAC028690,2.0,1.0,25.047,121.5081,9.0,手動站：萬華站-2（690),8.0,9.0,5.0,23.0,31.61,2022-09-29T00:18:57Z
4,00:25:03,Wanhua,萬華-2(2020),AirBox,taipei,2022-09-29,08BEAC028690,2.0,1.0,25.047,121.5081,9.0,手動站：萬華站-2（690),7.0,8.0,4.0,25.0,32.0,2022-09-29T00:25:03Z


In [None]:
# EPA
EPA_DF = getDF(EPA)
EPA_DF.head()

ID: EPA-Wanhua, Date: 2022-09-29, Shape: (21, 36)
ID: EPA-Wanhua, Date: 2022-09-28, Shape: (20, 36)
ID: EPA-Wanhua, Date: 2022-09-27, Shape: (22, 36)
ID: EPA-Wanhua, Date: 2022-09-26, Shape: (21, 36)
ID: EPA-Wanhua, Date: 2022-09-25, Shape: (23, 36)
ID: EPA-Wanhua, Date: 2022-09-24, Shape: (22, 37)
ID: EPA-Wanhua, Date: 2022-09-23, Shape: (22, 37)
ID: EPA-Wanhua, Date: 2022-09-22, Shape: (22, 36)
ID: EPA-Wanhua, Date: 2022-09-21, Shape: (22, 36)
ID: EPA-Wanhua, Date: 2022-09-20, Shape: (18, 37)


Unnamed: 0_level_0,time,County,SiteName,SiteType,app,aqi,co,co_8hr,county,datacreationdate,...,siteid,sitename,sitetype,so2,so2_avg,status,ver_format,winddirec,windspeed,pollutant
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,00:00:00,臺北市,萬華,一般站,EPA_COPY,30.0,0.3,0.1,臺北市,2022-09-29 08:00,...,13,萬華,一般站,2.9,1.0,良好,3,100.0,1.2,
1,01:00:00,臺北市,萬華,一般站,EPA_COPY,27.0,0.37,0.1,臺北市,2022-09-29 09:00,...,13,萬華,一般站,3.0,1.0,良好,3,112.0,1.5,
2,02:00:00,臺北市,萬華,一般站,EPA_COPY,27.0,0.31,0.2,臺北市,2022-09-29 10:00,...,13,萬華,一般站,2.9,1.0,良好,3,72.0,1.3,
3,03:00:00,臺北市,萬華,一般站,EPA_COPY,26.0,,0.2,臺北市,2022-09-29 11:00,...,13,萬華,一般站,,2.0,良好,3,97.0,1.4,
4,04:00:00,臺北市,萬華,一般站,EPA_COPY,,,,臺北市,2022-09-29 12:00,...,13,萬華,一般站,,,,3,,,


### 2 - 資料前處理


*   過濾需要的欄位 (只需要溫度、濕度、PM2.5、時間)
*   計算小時平均
*   轉換欄位名稱 (方便後續訓練步驟)
*   合併（merge） AirBox 與 EPA 資料
*   去除空值
*   取得小時特徵


In [None]:
# 過濾欄位

Col_need = ["timestamp", "s_d0", "s_t0", "s_h0"]
AirBox1_DF_need = AirBox1_DF[Col_need]
print(AirBox1_DF_need.head())
AirBox2_DF_need = AirBox2_DF[Col_need]
print(AirBox2_DF_need.head())

Col_need = ["time", "date", "pm2_5"]
EPA_DF_need = EPA_DF[Col_need]

print(EPA_DF_need.head())

                  timestamp  s_d0   s_t0  s_h0
index                                         
0      2022-09-29T00:03:19Z  17.0  30.61  75.0
1      2022-09-29T00:09:24Z  10.0  31.00  73.0
2      2022-09-29T00:15:32Z   8.0  31.61  71.0
3      2022-09-29T00:21:38Z   8.0  31.86  70.0
4      2022-09-29T00:27:43Z   8.0  32.00  70.0
                  timestamp  s_d0   s_t0  s_h0
index                                         
0      2022-09-29T00:00:38Z  14.0  30.25   2.0
1      2022-09-29T00:06:44Z  10.0  30.86  12.0
2      2022-09-29T00:12:51Z   8.0  31.36  18.0
3      2022-09-29T00:18:57Z   8.0  31.61  23.0
4      2022-09-29T00:25:03Z   7.0  32.00  25.0
           time        date  pm2_5
index                             
0      00:00:00  2022-09-29    5.0
1      01:00:00  2022-09-29    7.0
2      02:00:00  2022-09-29    8.0
3      03:00:00  2022-09-29    NaN
4      04:00:00  2022-09-29    NaN


In [None]:
# EPA
# date + time = timestamp

EPA_DF_need['timestamp'] = pd.to_datetime( EPA_DF_need["date"] + "T" + EPA_DF_need["time"], utc=True )
EPA_DF_need.head()

Unnamed: 0_level_0,time,date,pm2_5,timestamp
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,00:00:00,2022-09-29,5.0,2022-09-29 00:00:00+00:00
1,01:00:00,2022-09-29,7.0,2022-09-29 01:00:00+00:00
2,02:00:00,2022-09-29,8.0,2022-09-29 02:00:00+00:00
3,03:00:00,2022-09-29,,2022-09-29 03:00:00+00:00
4,04:00:00,2022-09-29,,2022-09-29 04:00:00+00:00


In [None]:
# 小時平均

def getHourly(DF):
  DF = DF.set_index( pd.DatetimeIndex(DF["timestamp"]) )
  DF_Hourly = DF.resample('H').mean()
  DF_Hourly.reset_index(inplace=True)

  return DF_Hourly

In [None]:
AirBox1_DF_need_Hourly = getHourly( AirBox1_DF_need)
AirBox2_DF_need_Hourly = getHourly( AirBox2_DF_need)


EPA_DF_need_Hourly = getHourly( EPA_DF_need) # 可省略，原始數據已經是小時平均

In [None]:
AirBox2_DF_need_Hourly.head()

Unnamed: 0,timestamp,s_d0,s_t0,s_h0
0,2022-09-20 00:00:00+00:00,7.5,28.6625,-49.75
1,2022-09-20 01:00:00+00:00,7.222222,28.204444,-56.222222
2,2022-09-20 02:00:00+00:00,7.25,28.05875,-59.625
3,2022-09-20 03:00:00+00:00,6.666667,28.157778,-61.555556
4,2022-09-20 04:00:00+00:00,5.625,28.08625,-63.75


In [None]:
EPA_DF_need_Hourly.head()

Unnamed: 0,timestamp,pm2_5
0,2022-09-20 03:00:00+00:00,14.0
1,2022-09-20 04:00:00+00:00,15.0
2,2022-09-20 05:00:00+00:00,12.0
3,2022-09-20 06:00:00+00:00,12.0
4,2022-09-20 07:00:00+00:00,13.0


In [None]:
# 轉換欄位名稱

# s_d0 = PM25, s_h0 = 相對濕度(HUM), s_t0 = 溫度(TEMP)
Col_rename = {"s_d0":"PM25", "s_h0":"HUM", "s_t0":"TEM"}

AirBox1_DF_need_Hourly.rename(columns=Col_rename, inplace=True)
AirBox2_DF_need_Hourly.rename(columns=Col_rename, inplace=True)

AirBox2_DF_need_Hourly.head()

Unnamed: 0,timestamp,PM25,TEM,HUM
0,2022-09-20 00:00:00+00:00,7.5,28.6625,-49.75
1,2022-09-20 01:00:00+00:00,7.222222,28.204444,-56.222222
2,2022-09-20 02:00:00+00:00,7.25,28.05875,-59.625
3,2022-09-20 03:00:00+00:00,6.666667,28.157778,-61.555556
4,2022-09-20 04:00:00+00:00,5.625,28.08625,-63.75


In [None]:
# pm2_5 = PM25, 
Col_rename = {"pm2_5":"EPA_PM25"}

EPA_DF_need_Hourly.rename(columns=Col_rename, inplace=True)
EPA_DF_need_Hourly.head()

Unnamed: 0,timestamp,EPA_PM25
0,2022-09-20 03:00:00+00:00,14.0
1,2022-09-20 04:00:00+00:00,15.0
2,2022-09-20 05:00:00+00:00,12.0
3,2022-09-20 06:00:00+00:00,12.0
4,2022-09-20 07:00:00+00:00,13.0


In [None]:
# Merge

# 將兩台 AirBox 資料合併
AirBoxs_DF = pd.concat([AirBox1_DF_need_Hourly, AirBox2_DF_need_Hourly]).reset_index(drop=True)

# 以時間欄位為基準合併 EPA 與 Airbox
# inner : 只保留交集的部分
All_DF = pd.merge( AirBoxs_DF, EPA_DF_need_Hourly, on=["timestamp"], how="inner" )

In [None]:
All_DF

Unnamed: 0,timestamp,PM25,TEM,HUM,EPA_PM25
0,2022-09-20 03:00:00+00:00,6.800000,28.667000,60.100000,14.0
1,2022-09-20 03:00:00+00:00,6.666667,28.157778,-61.555556,14.0
2,2022-09-20 04:00:00+00:00,5.888889,29.006667,58.555556,15.0
3,2022-09-20 04:00:00+00:00,5.625000,28.086250,-63.750000,15.0
4,2022-09-20 05:00:00+00:00,5.777778,29.687778,56.666667,12.0
...,...,...,...,...,...
469,2022-09-29 21:00:00+00:00,8.000000,26.896000,-68.000000,10.0
470,2022-09-29 22:00:00+00:00,9.000000,26.713750,80.625000,10.0
471,2022-09-29 22:00:00+00:00,8.375000,26.918750,-67.375000,10.0
472,2022-09-29 23:00:00+00:00,9.125000,28.265000,75.750000,7.0


In [None]:
# 去除空值

All_DF.dropna(how="any", inplace=True)
All_DF.reset_index(inplace=True, drop=True)
All_DF

Unnamed: 0,timestamp,PM25,TEM,HUM,EPA_PM25
0,2022-09-20 03:00:00+00:00,6.800000,28.667000,60.100000,14.0
1,2022-09-20 03:00:00+00:00,6.666667,28.157778,-61.555556,14.0
2,2022-09-20 04:00:00+00:00,5.888889,29.006667,58.555556,15.0
3,2022-09-20 04:00:00+00:00,5.625000,28.086250,-63.750000,15.0
4,2022-09-20 05:00:00+00:00,5.777778,29.687778,56.666667,12.0
...,...,...,...,...,...
407,2022-09-29 21:00:00+00:00,8.000000,26.896000,-68.000000,10.0
408,2022-09-29 22:00:00+00:00,9.000000,26.713750,80.625000,10.0
409,2022-09-29 22:00:00+00:00,8.375000,26.918750,-67.375000,10.0
410,2022-09-29 23:00:00+00:00,9.125000,28.265000,75.750000,7.0


In [None]:
# HR column
def return_HR(row):
    row['HR'] = int(row[ "timestamp" ].hour)
    return row

In [None]:
All_DF = All_DF.apply(return_HR , axis=1)

In [None]:
All_DF.head()

Unnamed: 0,timestamp,PM25,TEM,HUM,EPA_PM25,HR
0,2022-09-20 03:00:00+00:00,6.8,28.667,60.1,14.0,3
1,2022-09-20 03:00:00+00:00,6.666667,28.157778,-61.555556,14.0,3
2,2022-09-20 04:00:00+00:00,5.888889,29.006667,58.555556,15.0,4
3,2022-09-20 04:00:00+00:00,5.625,28.08625,-63.75,15.0,4
4,2022-09-20 05:00:00+00:00,5.777778,29.687778,56.666667,12.0,5


### 3 - 訓練候選模型、驗證

經過了各種特徵組合、訓練資料長度、訓練模型的搭配，會產出許多的候選校正模型，以上述的設定來說，會有 3 (訓練區間) * 8 (特徵組合) * 3（訓練模型）= 72 個候選模型

以第 N 天的資料作為測試資料，計算 MSE 與 MAE


*   MAE, Mean Absolute Error 平均絕對誤差<br>
    為目標值和預測值之差的絕對值之和，平均絕對誤差能更好地反映預測值誤差的實際情況，數值越小代表成效越好。
*   MSE, Mean Squared Error 均方誤差<br>
    為預測值和實際觀測值間差的平方的均值，MSE 的值越小，說明預測模型描述實驗資料具有更好的精確度。



In [None]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_selection import f_regression
from sklearn import metrics as sk_metrics

import scipy.stats as stats

In [None]:
def SlideDay( Hourly_DF, day, enddate ):
    startdate = enddate - timedelta( days = (day-1) )
    time_mask = Hourly_DF["timestamp"].between( pd.Timestamp(startdate, tz='utc'), pd.Timestamp(enddate, tz='utc') )
    return Hourly_DF[ time_mask ]

In [None]:
def BuildModel( site, enddate, FEATURE, day, method, Training_DF ):

    X_train = Training_DF[ FEATURES_METHOD[ FEATURE ] ]
    Y_train = Training_DF[ "EPA_PM25" ]
    model_result = {}
    model_result["site"], model_result["day"], model_result["feature"], model_result["method"] = site, day, FEATURE, method
    model_result["datapoints"], model_result["modelname"] = X_train.shape[0], (site + "_" + str(day) + "_" + METHOD_SW[method] + "_" + FEATURE)
    model_result["date"] = enddate.strftime( "%Y-%m-%d" )
    # add timestamp field
    Now_Time = datetime.utcnow().strftime( "%Y-%m-%d %H:%M:%S" )
    model_result['create_timestamp_utc'] = Now_Time

    ### training model ###
    print( "[BuildR]-\"{method}\" with {day}/{feature}".format(method=method, day=day, feature=FEATURE) )
    
    # fit
    lm = METHOD_FUNTION[ method ]
    lm.fit( X_train, Y_train )

    # get score
    Y_pred = lm.predict( X_train )
    model_result['Train_MSE'] = MSE = sk_metrics.mean_squared_error( Y_train, Y_pred )
    model_result['Train_MAE'] = sk_metrics.mean_absolute_error( Y_train, Y_pred )

    return model_result, lm

In [None]:
def TestModel( site, feature, modelname, Testing_DF, lm ):

    X_test = Testing_DF[ FEATURES_METHOD[ feature ] ]
    Y_test = Testing_DF[ "EPA_PM25" ]

    # add timestamp field
    Now_Time = datetime.utcnow().strftime( "%Y-%m-%d %H:%M:%S" )

    ### testing model ###
    # predict
    Y_pred = lm.predict( X_test )

    # get score
    test_result = {}
    test_result["test_MSE"] = round( sk_metrics.mean_squared_error( Y_test, Y_pred ), 3)
    test_result["test_MAE"] = round( sk_metrics.mean_absolute_error( Y_test, Y_pred ), 3)
    # print( "[Test]-Result: MAE={MAE}, MSE={MSE}".format( MAE=test_result["test_MAE"], MSE=test_result["test_MSE"] ) )

    return test_result

In [None]:
AllResult_list = []

for day in DAYS:
  for method in METHODS:
    for feature in FEATURES_METHOD:
      Training_DF = SlideDay(All_DF, day, ENDDATE)[ FEATURES_METHOD[feature] + ["EPA_PM25"] ]
      result, lm = BuildModel( SITE, TESTDATE, feature, day, method, Training_DF )
      test_result = TestModel(SITE, feature, result["modelname"], SlideDay(All_DF, 1, TESTDATE), lm)
      R_DF = pd.DataFrame.from_dict( [{ **result, **test_result }] )
      AllResult_list.append( R_DF )

AllResult_DF = pd.concat(AllResult_list)

[BuildR]-"LinearRegression" with 8/PHTR
[BuildR]-"LinearRegression" with 8/PH
[BuildR]-"LinearRegression" with 8/PT
[BuildR]-"LinearRegression" with 8/PR
[BuildR]-"LinearRegression" with 8/P
[BuildR]-"LinearRegression" with 8/PHR
[BuildR]-"LinearRegression" with 8/PTR
[BuildR]-"LinearRegression" with 8/PHT
[BuildR]-"RandomForestRegressor" with 8/PHTR
[BuildR]-"RandomForestRegressor" with 8/PH
[BuildR]-"RandomForestRegressor" with 8/PT
[BuildR]-"RandomForestRegressor" with 8/PR
[BuildR]-"RandomForestRegressor" with 8/P
[BuildR]-"RandomForestRegressor" with 8/PHR
[BuildR]-"RandomForestRegressor" with 8/PTR
[BuildR]-"RandomForestRegressor" with 8/PHT
[BuildR]-"SVR" with 8/PHTR
[BuildR]-"SVR" with 8/PH
[BuildR]-"SVR" with 8/PT
[BuildR]-"SVR" with 8/PR
[BuildR]-"SVR" with 8/P
[BuildR]-"SVR" with 8/PHR
[BuildR]-"SVR" with 8/PTR
[BuildR]-"SVR" with 8/PHT
[BuildR]-"LinearRegression" with 5/PHTR
[BuildR]-"LinearRegression" with 5/PH
[BuildR]-"LinearRegression" with 5/PT
[BuildR]-"LinearRegressi

In [None]:
AllResult_DF.head()

Unnamed: 0,site,day,feature,method,datapoints,modelname,date,create_timestamp_utc,Train_MSE,Train_MAE,test_MSE,test_MAE
0,wanhua,8,PHTR,LinearRegression,306,wanhua_8_LinearR_PHTR,2022-09-29,2022-09-30 11:19:24,7.679059,2.085894,7.937,2.801
0,wanhua,8,PH,LinearRegression,306,wanhua_8_LinearR_PH,2022-09-29,2022-09-30 11:19:24,9.165171,2.25935,13.661,3.693
0,wanhua,8,PT,LinearRegression,306,wanhua_8_LinearR_PT,2022-09-29,2022-09-30 11:19:24,9.506875,2.305106,8.49,2.91
0,wanhua,8,PR,LinearRegression,306,wanhua_8_LinearR_PR,2022-09-29,2022-09-30 11:19:24,8.090155,2.167699,7.224,2.672
0,wanhua,8,P,LinearRegression,306,wanhua_8_LinearR_P,2022-09-29,2022-09-30 11:19:24,9.511185,2.305515,7.586,2.75


### 4 - 訓練本日最佳校正模型

1. 以 test_MSE 為基準，選出候選模型中 MSE 最小的
2. 將新的訓練資料以該模型的參數設定組合進行訓練
3. 儲存模型

In [None]:
import joblib

In [None]:
# 最低的 test_MSE

FIELD = "test_MSE"
BEST = AllResult_DF[ AllResult_DF[FIELD] == AllResult_DF[FIELD].min() ]
BEST

Unnamed: 0,site,day,feature,method,datapoints,modelname,date,create_timestamp_utc,Train_MSE,Train_MAE,test_MSE,test_MAE
0,wanhua,3,PHT,SVR,86,wanhua_3_SVR_PHT,2022-09-29,2022-09-30 11:19:36,2.17931,1.117448,0.054,0.224


In [None]:
# 訓練新模型
BEST_DC = BEST.to_dict(orient="index")[0]

Training_DF = SlideDay(All_DF, BEST_DC["day"], TESTDATE)[ FEATURES_METHOD[BEST_DC["feature"]] + ["EPA_PM25"] ]
result, lm = BuildModel( SITE, TODAY, BEST_DC["feature"], BEST_DC["day"], BEST_DC["method"], Training_DF )

[BuildR]-"SVR" with 3/PHT


In [None]:
result

{'site': 'wanhua',
 'day': 3,
 'feature': 'PHT',
 'method': 'SVR',
 'datapoints': 80,
 'modelname': 'wanhua_3_SVR_PHT',
 'date': '2022-09-30',
 'create_timestamp_utc': '2022-09-30 11:19:48',
 'Train_MSE': 3.91517342356589,
 'Train_MAE': 1.42125724796098}

In [None]:
# 儲存模型

# wanhua_3_SVR_PHT.joblib
model_dumpname = result["modelname"] + ".joblib"

# 儲存路徑
MODEL_OUTPUT_PATH = ""

try:
    joblib.dump( lm, MODEL_OUTPUT_PATH + model_dumpname )
    print( "[BuildR]-dump {}".format( MODEL_OUTPUT_PATH+model_dumpname ) )
except Exception as e:
    print( "ERROR! [dump model] {}".format( result["modelname"] ) )
    error_msg(e)

[BuildR]-dump wanhua_3_SVR_PHT.joblib
