<font size=6> Time Series </font> <br/>
<font size=3> 以空氣指標預測未來一小時、六小時是否會降雨 </font> <br/>
<font size=3>
 空氣指標共有 18 種，測試拿單一指標
 <font size=3 color=green> PM 2.5 </font>
 與所有指標的差異
</font>

In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv('新竹_2020.csv', encoding='Big5')
data.columns = ['A', 'date', 'att', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
                '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23']
data.drop(columns=['A'], inplace=True)
data.drop(index=[0], inplace=True)
replace_symbol = ['#                              ', 'x                              ', '*                              '
                 , 'A                              ', 'NA                             ']
data = data.apply( lambda x: x.replace(replace_symbol, np.nan) )
data.drop(columns=['date'], inplace=True)
# 4932 之後是10月的
train_data = data.iloc[4932:6030, :]
test_data = data.iloc[6030:, :]

<font size=5> Preprocess </font> <br/>
<font size=3> 前處理後切分訓練、測試資料 </font>

In [3]:
# 前處理，將資料變成 18 列、天數*24 欄
def preprocess(data):
    new_data = []
    for i in range(18):
        j = i
        all_day = []
        new_all_day = []
        while j < len( data ):
            # while 跑一圈代表一天的資料
            one_day = []
            # columns 0是 att (空汙名稱)
            one_day = data.iloc[j, 1:].tolist()
            # 將一列資料(一天的數據)存到one_day串列中，然後去掉空格、轉 float
            for item in range( len(one_day) ):
                if type(one_day[item]) == str:
                    one_day[item] = float(one_day[item].strip())
            # 處理完一天的數據，將串列存到 all_day
            all_day.append(one_day)
            j += 18
        # 合併串列裡的多個串列成為一列，這樣一個空汙指標會有 1 * 1464 (24*61)
        for q in range( len(all_day) ):
            for w in all_day[q]:
                new_all_day.append(w)
        # 不用新串列去接的話，會一直往後，變成串列中只有一個元素，該元素裡有 1464*18 個資料
        # 但變成 18 * 1464，之後轉 dataframe 才會是 18列 1464欄
        new_data.append(new_all_day)
    # 填補 nan
    n_data = pd.DataFrame(new_data)
    f_data = n_data.copy()
    b_data = n_data.copy()
    f_data = f_data.ffill(axis=1)
    b_data = f_data.bfill(axis=1)
    f_data.fillna(0.0)
    b_data.fillna(0.0)
    n_data = (f_data + b_data) / 2
    print(n_data.shape)
    return n_data

In [4]:
# 前處理後，生出要拿去模型訓練、預測的資料 X
train_data = preprocess(train_data)
test_data = preprocess(test_data)

(18, 1464)
(18, 744)


In [5]:
# 生成要拿去模型訓練、預測的資料
# 以單一指標(PM 2.5)去預測 1小時後、6小時後。index 9是 PM 2.5
def forecast_one(data, hour):
    X = []
    Y = []
    if hour == 1:
        for k in range( len(data.columns)-6 ):
            x_temp = data.iloc[9, k:k+6].tolist()
            X.append(x_temp)
            Y.append(data.iloc[9, k+6])
    elif hour == 6:
        for k in range( len(data.columns)-11 ):
            x_temp = data.iloc[9, k:k+6].tolist()
            X.append(x_temp)
            Y.append(data.iloc[9, k+11])
    else:
        print('error')
    
    return X, Y

# 以 18種指標去分別預測 1小時後、6小時後
def forecast_all(data, hour):
    X = []
    Y = []
    if hour == 1:
        # 18種空汙指標，每種指標有 6筆資料 (6小時)
        for i in range( len(data.columns)-6 ):
            # 順序可以有 2種。18*6、6*18 (一次取一種指標的 6小時資料，再接下一個指標)
            # 這邊是用後面那種
            temp = []
            one_row = []
            for j in range( len(data.index) ):
                x_temp = data.iloc[j, i:i+6].tolist()
                temp.append(x_temp)
            # 要把上面 for 產生的 18個元素(每個元素裡 6個資料)併成一列
            for q in range( len(temp) ):
                for w in range( len(temp[q]) ):
                    one_row.append(temp[q][w])
            X.append(one_row)
            Y.append(data.iloc[9, i+6])
    
    elif hour == 6:
        for i in range( len(data.columns)-11 ):
            # 順序可以有 2種。18*6、6*18 (一次取一種指標的 6小時資料，再接下一個指標)
            # 這邊是用後面那種
            temp = []
            one_row = []
            for j in range( len(data.index) ):
                x_temp = data.iloc[j, i:i+6].tolist()
                temp.append(x_temp)
            # 要把上面 for 產生的 18個元素(每個元素裡 6個資料)併成一列
            for q in range( len(temp) ):
                for w in range( len(temp[q]) ):
                    one_row.append(temp[q][w])
            X.append(one_row)
            Y.append(data.iloc[9, i+11])
    return X, Y

<font size=5> Evaluation </font> <br/>
<font size=3>
    以 <font size=3 color=yellow> Linear Regression </font>
    與 <font size=3 color=yellow> XGBoost </font>
    兩種方法作為預測模型，並計算 MAE 作為最終結果
</font>

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

def forecast_model(train_X, train_Y, test_X, test_Y):
    # linear Regression
    lm = LinearRegression()
    lm.fit(train_X, train_Y)
    mae_linear = mean_absolute_error(lm.predict(test_X), test_Y)
    # XGBoost
    xgm = xgb.XGBRegressor(n_estimators = 50)
    xgm.fit(train_X, train_Y)
    mae_xgb = mean_absolute_error(xgm.predict(test_X), test_Y)
    
    return mae_linear, mae_xgb

In [15]:
# PM 2.5去預測
# 訓練資料的 X、Y
train_X_f1, train_Y_f1 = forecast_one(train_data, 1)
train_X_f6, train_Y_f6 = forecast_one(train_data, 6)
# 測試資料的 X、Y
test_X_f1, test_Y_f1 = forecast_one(train_data, 1)
test_X_f6, test_Y_f6 = forecast_one(train_data, 6)

print('以單一指標 PM 2.5 去預測')
mae_linear_1, mae_xgb_1 = forecast_model(train_X_f1, train_Y_f1, test_X_f1, test_Y_f1)
mae_linear_6, mae_xgb_6 = forecast_model(train_X_f6, train_Y_f6, test_X_f6, test_Y_f6)
print('future 1 hour MAE ')
print('linear :', round(mae_linear_1, 3), '/ xgb :', round(mae_xgb_1, 3))
print('future 6 hour MAE ')
print('linear :', round(mae_linear_6, 3), '/ xgb :', round(mae_xgb_6, 3))


以單一指標 PM 2.5 去預測
future 1 hour MAE 
linear : 2.577 / xgb : 1.156
future 6 hour MAE 
linear : 4.023 / xgb : 1.837


In [16]:
# 全部指標去預測
# 訓練資料的 X、Y
train_X_f1, train_Y_f1 = forecast_all(train_data, 1)
train_X_f6, train_Y_f6 = forecast_all(train_data, 6)
# 測試資料的 X、Y
test_X_f1, test_Y_f1 = forecast_all(train_data, 1)
test_X_f6, test_Y_f6 = forecast_all(train_data, 6)

print('用所有指標(18個) 去預測')
mae_linear_1, mae_xgb_1 = forecast_model(train_X_f1, train_Y_f1, test_X_f1, test_Y_f1)
mae_linear_6, mae_xgb_6 = forecast_model(train_X_f6, train_Y_f6, test_X_f6, test_Y_f6)

print('future 1 hour MAE ')
print('linear :', round(mae_linear_1, 3), '/ xgb :', round(mae_xgb_1, 3))
print('future 6 hour MAE ')
print('linear :', round(mae_linear_6, 3), '/ xgb :', round(mae_xgb_6, 3))

用所有指標(18個) 去預測
future 1 hour MAE 
linear : 2.3 / xgb : 0.413
future 6 hour MAE 
linear : 3.642 / xgb : 0.354


<font size=5> Result </font>

In [90]:
# 統整結果
lin = [l11, l16, l21, l26]
xg = [x11, x16, x21, x26]
a = pd.DataFrame({'linear':lin, 'XGBoost':xg})
a.index = ['PM2.5, 1H', 'PM2.5, 6H', 'All, 1H', 'All, 6H']
print(a)

           linear  XGBoost
PM2.5, 1H   2.577    1.156
PM2.5, 6H   4.023    1.837
All, 1H     2.300    0.413
All, 6H     3.642    0.354
