*This is the open source code of paper: <u>A hybrid-model forecasting framework for reducing the building energy performance gap.</u><br>
@Date    : 2022-05-29 18:31:45<br>
@Author  : Xia CHEN (xia.chen@iek.uni-hannover.de), Tong Guo, Martin Kriegel, Philipp Geyer<br>
@Link    : https://doi.org/10.1016/j.aei.2022.101627<br>
@Ver    : v01<br>
For using the code or data, please cite: <br>*
- *Chen, Xia, et al. "A hybrid-model forecasting framework for reducing the building energy performance gap." Advanced Engineering Informatics 52 (2022): 101627.*<br>
- *Xiao, Tong, Xu, Peng, Sha, Huajing, Chen, Zhe, & Gu, Jiefan. (2022). XuPengResearchGroup/EnergyDetective2020_dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6590976*<br>


*The original data & description is available at:<br>*
- *https://github.com/XuPengResearchGroup/EnergyDetective2020/tree/main/dataset*

In [1]:
import os, sys, gc, time, warnings, psutil, random, csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from math import ceil
import lightgbm as lgb
import plotly.offline as py
py.init_notebook_mode(connected=True) 
import plotly.graph_objs as go     

from utils import *

warnings.filterwarnings('ignore')

## Load data & visualization

In [2]:
filename = 'LOI3_3-3'
df = pd.read_csv('3Y/result_' + filename + '.csv', index_col=0)
df

Unnamed: 0,Time,Simu_Record,Record
0,2015-01-01 00:00:00,49.981329,59.17
1,2015-01-01 01:00:00,52.510945,57.63
2,2015-01-01 02:00:00,51.702943,60.00
3,2015-01-01 03:00:00,56.192828,58.17
4,2015-01-01 04:00:00,58.558172,60.74
...,...,...,...
26299,2017-12-31 19:00:00,56.948915,64.81
26300,2017-12-31 20:00:00,61.802434,67.87
26301,2017-12-31 21:00:00,64.598717,66.88
26302,2017-12-31 22:00:00,67.675437,67.38


### All

In [3]:
'''
Evaluation: 
    Evaluate the performance gap between different simulations and actual records.
    Plot the simulation and record by plotly to allow interactive analysis.
''' 
evaluation(df, filename)

Performance evaluation LOI3_3-3: RMSE is 134.1307, MAPE is 28.8336, MAE is 70.3593, NRMSE is 7.8286, R^2 is 0.8253


### Target period
WT: Jan 05 - Jan 17<br>
WE: Feb 09 - Feb 21<br>
ST: Jul 12 - Jul 24<br>
SE: Jul 19 - Jul 31<br>


In [4]:
'''
Visualization of four selected period.

'''

df['Time'] = df['Time'].astype('datetime64')
mask_WT = (df['Time'] > '2017-1-5') & (df['Time'] <= '2017-1-17')
mask_WE = (df['Time'] > '2017-2-9') & (df['Time'] <= '2017-2-21')
mask_ST = (df['Time'] > '2017-7-6') & (df['Time'] <= '2017-7-18')
mask_SE = (df['Time'] > '2017-7-22') & (df['Time'] <= '2017-8-3')

# print('Winter Typical:')
df_WT = df.loc[mask_WT]
evaluation(df_WT, filename + ' Winter Typical 2017')
# print('Winter Extreme:')
df_WE = df.loc[mask_WE]
evaluation(df_WE, filename + 'Winter Extreme 2017')
# print('Summer Typical:')
df_ST = df.loc[mask_ST]
evaluation(df_ST, filename + 'Summer Typical 2017')
# print('Summer Extreme:')
df_SE = df.loc[mask_SE]
evaluation(df_SE, filename + 'Summer Extreme 2017')

Performance evaluation LOI3_3-3 Winter Typical 2017: RMSE is 16.2601, MAPE is 12.6653, MAE is 11.2575, NRMSE is 8.7304, R^2 is 0.9229


Performance evaluation LOI3_3-3Winter Extreme 2017: RMSE is 29.2316, MAPE is 19.4570, MAE is 21.2997, NRMSE is 15.4829, R^2 is 0.7237


Performance evaluation LOI3_3-3Summer Typical 2017: RMSE is 224.7688, MAPE is 22.6506, MAE is 143.0527, NRMSE is 15.0381, R^2 is 0.8165


Performance evaluation LOI3_3-3Summer Extreme 2017: RMSE is 200.6393, MAPE is 26.6078, MAE is 135.8452, NRMSE is 11.9616, R^2 is 0.8941


## Feature engineering

In [5]:
'''
Prepare training data with weather data and feature engineering.

'''

# load weather data
w_2017_df = pd.read_csv('shanghai_weather_all.csv')

w_2017_df['Time'] = pd.to_datetime(w_2017_df['Time'])
df['Time'] = pd.to_datetime(df['Time'])
df = df.merge(w_2017_df, on='Time', how='outer')

# Time-series feature engineering
df['tm_h_sin'] = np.sin(2*np.pi*df['Time'].dt.hour/24)
df['tm_h_cos'] = np.cos(2*np.pi*df['Time'].dt.hour/24)

df['tm_d'] = df['Time'].dt.day.astype(np.int8)
df['tm_w'] = df['Time'].dt.week.astype(np.int8)
df['tm_m'] = df['Time'].dt.month.astype(np.int8)
df['tm_y'] = df['Time'].dt.year
df['tm_y'] = (df['tm_y'] - df['tm_y'].min()).astype(np.int8)
df['tm_wm'] = df['tm_d'].apply(lambda x: ceil(x/7)).astype(np.int8)

df['tm_dw'] = df['Time'].dt.dayofweek.astype(np.int8)
df['tm_w_end'] = (df['tm_dw']>=5).astype(np.int8)

for d_shift in [1,3,6,12]: 
    for d_window in [6,12]:
        col_name = 'rolling_mean_temp'+str(d_shift)+'_'+str(d_window)
        df[col_name] = df['temp'].transform(lambda x: x.shift(d_shift).rolling(d_window).mean()).astype(np.float16)
        col_name = 'rolling_mean_dew'+str(d_shift)+'_'+str(d_window)
        df[col_name] = df['dew'].transform(lambda x: x.shift(d_shift).rolling(d_window).mean()).astype(np.float16)
        col_name = 'rolling_mean_hum'+str(d_shift)+'_'+str(d_window)
        df[col_name] = df['hum'].transform(lambda x: x.shift(d_shift).rolling(d_window).mean()).astype(np.float16)
        col_name = 'rolling_mean_simu'+str(d_shift)+'_'+str(d_window)
        df[col_name] = df['Simu_Record'].transform(lambda x: x.shift(d_shift).rolling(d_window).mean()).astype(np.float16)
for d_shift in [1,2,3]:
        col_name = 'Shift_temp'+str(d_shift)
        df[col_name] = df['temp'].transform(lambda x: x.shift(d_shift)).astype(np.float16)
        col_name = 'Shift_dew'+str(d_shift)
        df[col_name] = df['dew'].transform(lambda x: x.shift(d_shift)).astype(np.float16)
        col_name = 'Shift_hum'+str(d_shift)
        df[col_name] = df['hum'].transform(lambda x: x.shift(d_shift)).astype(np.float16)
        col_name = 'Shift_Simu_record_'+str(d_shift)
        df[col_name] = df['Simu_Record'].transform(lambda x: x.shift(d_shift)).astype(np.float16)
            
df = df.dropna()
df = df.set_index('Time')
df.columns

Index(['Simu_Record', 'Record', 'temp', 'dew', 'hum', 'pres', 'winds',
       'tm_h_sin', 'tm_h_cos', 'tm_d', 'tm_w', 'tm_m', 'tm_y', 'tm_wm',
       'tm_dw', 'tm_w_end', 'rolling_mean_temp1_6', 'rolling_mean_dew1_6',
       'rolling_mean_hum1_6', 'rolling_mean_simu1_6', 'rolling_mean_temp1_12',
       'rolling_mean_dew1_12', 'rolling_mean_hum1_12', 'rolling_mean_simu1_12',
       'rolling_mean_temp3_6', 'rolling_mean_dew3_6', 'rolling_mean_hum3_6',
       'rolling_mean_simu3_6', 'rolling_mean_temp3_12', 'rolling_mean_dew3_12',
       'rolling_mean_hum3_12', 'rolling_mean_simu3_12', 'rolling_mean_temp6_6',
       'rolling_mean_dew6_6', 'rolling_mean_hum6_6', 'rolling_mean_simu6_6',
       'rolling_mean_temp6_12', 'rolling_mean_dew6_12', 'rolling_mean_hum6_12',
       'rolling_mean_simu6_12', 'rolling_mean_temp12_6',
       'rolling_mean_dew12_6', 'rolling_mean_hum12_6', 'rolling_mean_simu12_6',
       'rolling_mean_temp12_12', 'rolling_mean_dew12_12',
       'rolling_mean_hum12_12', 'r

## Training

In [6]:
########################### Model params #################################################################################
lgb_params = {
            'boosting_type': 'gbdt',
            'metric': 'rmse',
            'subsample': 0.5,
            'subsample_freq': 1,
            'learning_rate': 0.1,
            'feature_fraction': 0.5,
            'boost_from_average': False,
            'verbose': -1,
            'n_jobs': -1,
            }

########################### Vars #################################################################################
VER = 1                          # Version
SEED = 42                        # We want all things
seed_everything(SEED)            # to be as deterministic 
lgb_params['seed'] = SEED        # as possible
N_CORES = psutil.cpu_count()     # Available CPU cores


mask_WT = (df.index > '2017-1-5') & (df.index <= '2017-1-17')
mask_WE = (df.index > '2017-2-9') & (df.index <= '2017-2-21')
mask_ST = (df.index > '2017-7-6') & (df.index <= '2017-7-18')
mask_SE = (df.index > '2017-7-22') & (df.index <= '2017-8-3')

test_df = pd.concat([df.loc[mask_WT],df.loc[mask_WE],df.loc[mask_ST],df.loc[mask_SE]])
test_df = df['2017-1':'2017-11']

In [7]:
for WITH_SIMU in (True, False):
    Target_features = ['Record', 'Simu_Record']
    features = [col for col in list(df) if col not in Target_features]
    if(WITH_SIMU == True):
        print('Now running Hybrid approach:')
        features.append('Simu_Record')
        col_name = 'Hybrid pred'
    else:
        print('Now running Pure ML approach:')
        col_name = 'Pure ML pred'
    TARGET = Target_features[0]

    train_df = df['2015-1':'2016-11']
    # vaild_df = df['2017-9':'2017-10']
    # test_df = df['2017-12':]

    train_data = lgb.Dataset(train_df[features], 
                       label=train_df[Target_features[0]])
    # train_data = lgb.Dataset('train_data.bin')

    # valid_data = lgb.Dataset(vaild_df[features], 
    #                    label=vaild_df[Target_features[0]])

    estimator = lgb.cv(
                        lgb_params,
                        train_data,
                        num_boost_round=10000,
                        nfold=3,
                        early_stopping_rounds=100,
                        verbose_eval=100,
                        stratified=False,
                        seed=42
    )

    print('Best RMSE in cv {:.5f}，with std {:.5f}.'.format(
    estimator['rmse-mean'][-1], estimator['rmse-stdv'][-1]))
    print('Best iteration rounds is {}.'.format(len(estimator['rmse-mean'])))

    estimator = lgb.train(lgb_params,
                          train_data,
                          verbose_eval = 100,
                          num_boost_round = int(len(estimator['rmse-mean'])), ###########
                          )
    
    test_df[col_name] = estimator.predict(test_df[features])
    test_df['Time'] = test_df.index
    
    print('Performance, all period:')
    print('NGB Result RMSE is {}, MAPE is {}, MAE is {}, NRMSE is {}, R^2 is {}'.format('%.4f' % RMSE(test_df['Record'],test_df[col_name]), 
                                                             '%.4f' % MAPE(test_df[col_name],test_df['Record']), 
                                                             '%.4f' % MAE(test_df['Record'],test_df[col_name]),
                                                             '%.4f' % NRMSE(test_df['Record'],test_df[col_name]),
                                                             '%.4f' % metrics.r2_score(test_df[col_name],test_df['Record'])))

Now running Hybrid approach:
[100]	cv_agg's rmse: 51.9704 + 1.58155
[200]	cv_agg's rmse: 47.8352 + 1.75313
[300]	cv_agg's rmse: 46.272 + 1.61167
[400]	cv_agg's rmse: 45.4846 + 1.60483
[500]	cv_agg's rmse: 44.9802 + 1.53684
[600]	cv_agg's rmse: 44.6374 + 1.55789
[700]	cv_agg's rmse: 44.4041 + 1.52463
[800]	cv_agg's rmse: 44.2528 + 1.56633
[900]	cv_agg's rmse: 44.1496 + 1.55523
[1000]	cv_agg's rmse: 44.0554 + 1.59115
[1100]	cv_agg's rmse: 43.9622 + 1.5903
[1200]	cv_agg's rmse: 43.9121 + 1.5994
[1300]	cv_agg's rmse: 43.8841 + 1.60602
[1400]	cv_agg's rmse: 43.8575 + 1.6057
[1500]	cv_agg's rmse: 43.8189 + 1.60969
[1600]	cv_agg's rmse: 43.7948 + 1.61105
[1700]	cv_agg's rmse: 43.769 + 1.61709
[1800]	cv_agg's rmse: 43.7442 + 1.62977
[1900]	cv_agg's rmse: 43.7368 + 1.63434
[2000]	cv_agg's rmse: 43.7288 + 1.63708
[2100]	cv_agg's rmse: 43.719 + 1.63621
[2200]	cv_agg's rmse: 43.7084 + 1.6437
[2300]	cv_agg's rmse: 43.7101 + 1.64429
Best RMSE in cv 43.70800，with std 1.64400.
Best iteration rounds is

## Evaluation

In [8]:
mask_WT = (test_df.index > '2017-1-5') & (test_df.index <= '2017-1-17')
mask_WE = (test_df.index > '2017-2-9') & (test_df.index <= '2017-2-21')
mask_ST = (test_df.index > '2017-7-6') & (test_df.index <= '2017-7-18')
mask_SE = (test_df.index > '2017-7-22') & (test_df.index <= '2017-8-3')

In [9]:
print('Winter typical 2017:')
Test_evaluation(test_df.loc[mask_WT], filename)
print('Winter extreme 2017:')
Test_evaluation(test_df.loc[mask_WE], filename)
print('Summer typical 2017:')
Test_evaluation(test_df.loc[mask_ST], filename)
print('Summer extreme 2017:')
Test_evaluation(test_df.loc[mask_SE], filename)

Winter typical 2017:
Hybrid approach LOI3_3-3: RMSE is 16.0747, MAPE is 10.2784, MAE is 11.0089, NRMSE is 6.8950, R^2 is 0.9406
Pure ML LOI3_3-3: RMSE is 26.6059, MAPE is 18.2251, MAE is 18.8151, NRMSE is 10.2855, R^2 is 0.8086


Winter extreme 2017:
Hybrid approach LOI3_3-3: RMSE is 22.6184, MAPE is 10.3408, MAE is 14.4352, NRMSE is 8.2462, R^2 is 0.9036
Pure ML LOI3_3-3: RMSE is 29.7498, MAPE is 16.5477, MAE is 21.0339, NRMSE is 9.7241, R^2 is 0.8027


Summer typical 2017:
Hybrid approach LOI3_3-3: RMSE is 81.2288, MAPE is 17.1028, MAE is 51.4569, NRMSE is 4.2792, R^2 is 0.9860
Pure ML LOI3_3-3: RMSE is 218.5658, MAPE is 50.1181, MAE is 140.6564, NRMSE is 10.3846, R^2 is 0.8896


Summer extreme 2017:
Hybrid approach LOI3_3-3: RMSE is 88.4926, MAPE is 200.1567, MAE is 59.7215, NRMSE is 4.3600, R^2 is 0.9868
Pure ML LOI3_3-3: RMSE is 185.9916, MAPE is 42.4256, MAE is 120.2519, NRMSE is 9.2589, R^2 is 0.9331
