# E4C CHALLENGE

## GREENER BUILDING: THE CO2 FOOTPRINT REDUCTION CHALLENGE 

### Team A* - Forecasting model

The current notebook thoroughly presents, with detailed commentary, the code used to process the data, train, and export the prediction model of the A* team.

In [1]:
# import numpy and pandas with usual aliases

import numpy as np
import pandas as pd

### Data completion


Since each prediction requires _4536 predictors_, corresponding to data from the previous 7 days, each missing day prevents the creation of training data for the following 6 days. One solution we considered was to use a model capable of handling missing values. The selected model (GradientBoosting) actually supports this capability.

The solution we adopted involves filling in missing values as follows: for each missing data point, we take the average between its value at the same hour the previous day and its value at the same hour the following day.

In [2]:
# read the data from the file
df = pd.read_csv('students_drahi_production_consumption_hourly.csv', parse_dates=[0])

# set datetime as the index
df.set_index('datetime', inplace=True)

# fill the missing data points with the average of its value 24 hour earlier and 24 hour later
df.fillna((df.shift(24) + df.shift(-24))/2, inplace=True)

# save the completed data to a new file
df.to_csv('students_drahi_production_consumption_hourly_completed.csv')

### Data preprocessing

For data preprocessing, we first process the completed data according to the provided code to obtain an _X_ of _shape (723, 4536)_ and a _y_ of shape _(4536, 1)_, as required.

In [3]:
# read the file
hourly_data_path = 'students_drahi_production_consumption_hourly_completed.csv'
hourly_data = pd.read_csv(hourly_data_path)

# code from example_model.py and challenge_utils.py

time = pd.to_datetime(hourly_data['datetime']).values.astype('datetime64[s]')
power_consumption = hourly_data['kw_total_zone2'].values

dec = [] # daily energy consumption
t_dec = []

for ti, t in enumerate(time):
    tmp_t = pd.Timestamp(t)

    if np.isclose(tmp_t.hour, 0) and np.isclose(tmp_t.minute, 0):

        day_end = np.datetime64(tmp_t + pd.Timedelta(days=1))
        ind = np.where((time > tmp_t) & (time < day_end), True, False)

        if len(time[ind]) > 0 and not np.isnan(power_consumption[ind]).any():
            t_dec.append(np.datetime64(tmp_t).astype('datetime64[s]'))
            dec.append(np.trapz(power_consumption[ind], time[ind].astype(int))/3600)

t_dec = np.array(t_dec)
dec = np.array(dec)

N = 7 # N days of predictors beforehand
final_ind = []
final_hourly = []

predictor_window = pd.Timedelta(days=N)

for ti, t in enumerate(t_dec):
    tmp_t = pd.Timestamp(t)

    ind = np.where((time >= tmp_t - predictor_window) & (time < tmp_t), True, False) # finding indices within the N prior days

    bad_ind = np.isnan(hourly_data.iloc[ind, 1::].values)
    if len(time[ind]) >= 24 * N and not bad_ind.any(): # rejecting any data with NaNs; useful for the student dataset
        final_ind.append(ti)
        final_hourly.append(hourly_data.iloc[ind, 1::].values)

target_time = t_dec[final_ind]
predictors = np.array(final_hourly)

X = predictors.reshape(len(predictors), -1)
y = np.array(dec[final_ind])

For the subsequent preprocessing, we work with Sklearn pipelines. We aim to apply a Sklearn StandardScaler on the predictors. Since the 27 columns have been split into 168 different columns, we first fit these transformers on the initial data (from the CSV) and save them using a joblib Memory. Then, we construct the preprocessor by applying each of these scalers to the corresponding columns among the 4536.

In [4]:
from sklearn.ensemble import GradientBoostingRegressor
from joblib import Memory
from tempfile import mkdtemp
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# original columns. 
cols = ['AirTemp', 'pres', 'rain', 'rh', 'wd', 'ws', 'Global_Solar_Flux',
        'Diffuse_Solar_Flux', 'Direct_Solar_Flux', 'Downwelling_IR_Flux', 'SAA',
        'SZA', 'PAC', 'TGBT [kW]', 'kw_heater_corridor1_zone1',
        'kw_heaters_corridor_zone2', 'kw_heaters_toilets_zone2',
        'kw_heatingcoolingtotal_zone1', 'kw_heatingcoolingtotal_zone2',
        'kw_lights_zone1', 'kw_lights_zone2', 'kw_total_zone1',
        'kw_total_zone2', 'kw_ventilation_zone1', 'kw_ventilation_zone2',
        'kw_water_heater_zone2', 'plugs_zone2']

# pipeline[i] will be a pipeline containing a scaler fitted on the i-th column and stored in a joblib Memory.
pipelines = []

for col in cols:
    # first create the memory
    mem = Memory(location=mkdtemp())
    
    # create the pipeline
    scaler_pipeline = Pipeline([('scaler', StandardScaler())], memory=mem)
    
    # fit the scaler
    scaler_pipeline.fit(hourly_data['AirTemp'].values.reshape(-1, 1))
    
    #append pipeline[i]
    pipelines.append(scaler_pipeline)

# build the preprocessor.
# The i-th columnn has been "split" on the i-th, (i+27)-th, (i+27*2)-th... columns.
preprocessor = ColumnTransformer(transformers=[
    (f"{cols[i%27]}-{i//27}", pipelines[i%27], [i]) for i in range(4536)
])

### Model

Initially, we hesitated between several models: DecisionTree, Multi-Layer Perceptron, GradientBoosting, and eXtreme Gradient Boosting. Ultimately, we decided to stick with the GradientBoostingRegressor provided in Sklearn. After multiple tests, we choose to use 300 estimators with a learning rate of 0.04, resulting in a good performance. Moreover, we use mean squared error without Friedman improvement. 

In [8]:
from xgboost import XGBRegressor

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(criterion='squared_error', learning_rate=0.04, n_estimators=300, verbose=1))
])
model.fit(X, y) #train on the full dataset

      Iter       Train Loss   Remaining Time 
         1         887.6147            3.34m
         2         832.4215            3.33m
         3         780.7148            3.32m
         4         733.0708            3.31m
         5         688.5990            3.30m
         6         647.7434            3.29m
         7         608.5958            3.27m
         8         573.2039            3.26m
         9         539.1684            3.25m
        10         508.2946            3.24m
        20         292.6742            3.11m
        30         183.5648            2.99m
        40         122.3920            2.88m
        50          87.6460            2.77m
        60          66.5240            2.65m
        70          53.2523            2.54m
        80          44.1689            2.43m
        90          37.6642            2.32m
       100          32.6297            2.21m
       200          14.4676            1.12m
       300           7.9095            0.00s


We train the model on the full dataset, and firstly evaluate it on the full data set with the provided method (relative squared error).

In [9]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_validate

# Provided method

def relative_squared_error(y_true, y_pred):
    """Relative squared error (RSE; also called relative mean square error). < 1 is good, = 1 is bad, > 1 really bad."""
    return np.mean((y_pred - y_true)**2)/np.mean((y_true - y_true.mean())**2)

# Turn it into a sklearn scorer
rse_scorer = make_scorer(relative_squared_error)

# Performance of the model on the full data
relative_squared_error(y, model.predict(X))

0.008342439664725029

To better evaluate the performance of the model, we cross validate it using a stratified 5-Fold with the provided scorer.

In [7]:
cross_validate(model, X, y, cv=5, scoring=rse_scorer)['test_score']

      Iter       Train Loss   Remaining Time 
         1         621.9775            2.67m
         2         581.1608            2.67m
         3         543.4558            2.62m
         4         508.4803            2.65m
         5         476.0522            2.63m
         6         445.7646            2.62m
         7         417.9700            2.60m
         8         392.0190            2.59m
         9         368.1454            2.60m
        10         345.8756            2.60m
        20         191.0344            2.51m
        30         113.7766            2.39m
        40          72.7633            2.29m
        50          50.5462            2.20m
        60          37.5574            2.11m
        70          29.3929            2.02m
        80          23.8564            1.93m
        90          19.7942            1.84m
       100          16.8699            1.76m
       200           6.4737           53.27s
       300           3.2106            0.00s
      Ite

array([0.76698578, 0.50386456, 0.36638312, 0.3271854 , 0.1228128 ])

We can plot the values predicted by the model along with the real values.

In [18]:
pd.options.plotting.backend = "plotly"
df = pd.DataFrame({'real': y, 'predicted': model.predict(X)}, index=target_time)
df.plot()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



Finally, we can export the model to a onnx file.

In [15]:
from skl2onnx import to_onnx

onnx_model = to_onnx(model, X[:1].astype(np.float32), target_opset=12, verbose=True)

with open('AStar-GradientBoost-Final.onnx', 'wb') as file:
    file.write(onnx_model.SerializeToString())

[to_onnx] initial_types=[('X', FloatTensorType(shape=[None, 4536]))]
[convert_sklearn] parse_sklearn_model
[convert_sklearn] convert_topology
[convert_operators] begin
[convert_operators] iteration 1 - n_vars=0 n_ops=9074
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFeatureExtractor'.
[call_converter] call converter for 'SklearnScaler'.
[call_converter] call converter for 'SklearnArrayFea

To test the export, we can reload it from the onnx file :

In [16]:
from onnxruntime import InferenceSession

sess = InferenceSession('AStar-GradientBoost-Final.onnx', providers=["CPUExecutionProvider"])
run_onnx = sess.run(None, {"X": X.astype(np.float32)})[0][:, 0]

and compare the prediction of the model and the run from the onnx :

In [17]:
pd.DataFrame({'ONNX': run_onnx, 'model': model.predict(X)}, index=target_time)

Unnamed: 0,ONNX,model
2022-01-08,51.117664,51.117663
2022-01-09,58.054474,58.054472
2022-01-10,106.206978,106.206977
2022-01-11,98.000046,98.000050
2022-01-12,98.405914,98.405905
...,...,...
2023-12-27,62.185150,62.185149
2023-12-28,69.891838,69.891840
2023-12-29,69.697113,69.697111
2023-12-30,59.056992,59.056992
