# Time-series Forecasting

## Description
The data consists of 52,416 observations of energy consumption on a 10-minute window. Every observation is described by the following feature columns.

Your task is to **aggregate the observations on an interval of 2 hours**. For this time interval, using the values of the **4 previous time intervals**, forecast the target value one step in the future. Choose which features you are going to use.

**You must train a Boosting model for the task. Choose the model based on the number, and type of features available.**



Features:

* Date: Time window of ten minutes.
* Temperature: Weather Temperature.
* Humidity: Weather Humidity.
* WindSpeed: Wind Speed.
* GeneralDiffuseFlows: “Diffuse flow” is a catchall term to describe low-temperature (< 0.2° to ~ 100°C) fluids that slowly discharge through sulfide mounds, fractured lava flows, and assemblages of bacterial mats and macrofauna.
* DiffuseFlows

Target:

SolarPower

## Dataset links:
* [DS1](https://drive.google.com/file/d/1-Pcpb1xWpKc8Cgs-P7xqBFHw2NM0dBsA/view?usp=sharing)
* [DS2](https://drive.google.com/file/d/1-Pul07w6LXpm-uo99qbNc86FHhwl4yQD/view?usp=sharing)

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Read the datasets

In [2]:
df1 = pd.read_csv('datasets/power_consumption_g3.csv')
df2 = pd.read_csv('datasets/power_consumption_g3_feat.csv')

In [3]:
df1

Unnamed: 0,Date,SolarPower
0,2017-06-22 11:50:00,35818.80795
1,2017-04-24 22:50:00,34628.20237
2,2017-11-05 09:00:00,22781.53846
3,2017-10-19 23:20:00,31925.77681
4,2017-03-25 17:10:00,30246.12766
...,...,...
52411,2017-02-08 16:10:00,31808.13559
52412,2017-07-04 17:40:00,35816.61130
52413,2017-07-02 17:00:00,31185.64784
52414,2017-08-02 16:40:00,39463.35183


In [4]:
df2

Unnamed: 0,Date,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows
0,2017-03-01 16:40:00,21.33,55.91,0.080,387.400,427.300
1,2017-07-27 06:30:00,23.10,48.58,4.908,10.450,8.630
2,2017-10-11 19:00:00,23.10,59.82,0.084,0.446,0.322
3,2017-02-10 06:50:00,12.25,80.80,4.916,0.051,0.111
4,2017-03-06 16:00:00,15.62,59.38,0.075,533.400,579.900
...,...,...,...,...,...,...
52411,2017-05-14 02:20:00,23.58,43.10,0.075,0.110,0.122
52412,2017-11-17 19:20:00,17.30,76.50,0.075,0.040,0.148
52413,2017-03-21 12:10:00,17.90,50.28,0.081,837.000,296.700
52414,2017-07-28 05:10:00,25.23,61.32,4.907,0.091,0.119


## Merge the datasets (and pre-processing if needed)

In [6]:
data = pd.merge(df1, df2, on='Date')
data['Date'] = pd.to_datetime(data['Date'])

In [7]:
data

Unnamed: 0,Date,SolarPower,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows
0,2017-06-22 11:50:00,35818.80795,27.43,33.46,4.924,879.000,48.180
1,2017-04-24 22:50:00,34628.20237,16.93,76.10,0.082,0.015,0.141
2,2017-11-05 09:00:00,22781.53846,17.13,88.70,0.073,180.100,171.200
3,2017-10-19 23:20:00,31925.77681,20.00,86.00,4.920,0.055,0.100
4,2017-03-25 17:10:00,30246.12766,17.18,43.83,0.086,480.600,485.400
...,...,...,...,...,...,...,...
52411,2017-02-08 16:10:00,31808.13559,11.20,74.90,4.915,45.870,45.430
52412,2017-07-04 17:40:00,35816.61130,26.41,74.00,4.921,437.200,218.400
52413,2017-07-02 17:00:00,31185.64784,27.01,70.50,4.924,538.500,199.700
52414,2017-08-02 16:40:00,39463.35183,26.59,84.70,4.908,356.800,261.300


## Group the datasets into time intervals of 2 hours

In [8]:
data.set_index('Date', inplace=True)

In [9]:
data_resampled = data.resample('2h').mean().dropna()
data_resampled.reset_index(inplace=True)
data_resampled

Unnamed: 0,Date,SolarPower,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows
0,2017-01-01 00:00:00,26927.594937,5.874636,76.154545,0.081917,0.060167,0.105667
1,2017-01-01 02:00:00,21447.088607,5.029333,78.008333,0.082583,0.061417,0.135083
2,2017-01-01 04:00:00,20641.518987,4.919667,74.641667,0.081667,0.061917,0.120833
3,2017-01-01 06:00:00,20094.683545,4.512750,74.575000,0.082417,0.063583,0.122500
4,2017-01-01 08:00:00,21255.189872,4.632167,73.791667,0.082417,79.281917,15.761833
...,...,...,...,...,...,...,...
4359,2017-12-30 14:00:00,29293.789606,14.513333,39.486364,0.077667,409.650000,42.163333
4360,2017-12-30 16:00:00,31262.864386,14.015000,43.236364,0.077500,153.905000,152.368333
4361,2017-12-30 18:00:00,37721.673005,10.112500,60.239091,0.075583,1.618917,1.676750
4362,2017-12-30 20:00:00,36183.523447,8.526667,66.832500,0.080917,0.062917,0.101667


## Create lags

In [21]:
n_lags = 4 

for lag in range(1, n_lags + 1):
    data_resampled[f'SolarPower_lag{lag}'] = data_resampled['SolarPower'].shift(lag)
    
data_lagged = data_resampled.dropna()
data_lagged

Unnamed: 0,Date,SolarPower,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows,SolarPower_lag1,SolarPower_lag2,SolarPower_lag3,SolarPower_lag4
4,2017-01-01 08:00:00,21255.189872,4.632167,73.791667,0.082417,79.281917,15.761833,20094.683545,20641.518987,21447.088607,26927.594937
5,2017-01-01 10:00:00,27986.835442,8.019333,63.835833,2.913333,346.072727,34.108333,21255.189872,20094.683545,20641.518987,21447.088607
6,2017-01-01 12:00:00,30060.759495,15.263333,57.075000,0.076167,486.391667,40.981667,27986.835442,21255.189872,20094.683545,20641.518987
7,2017-01-01 14:00:00,29558.481012,15.662500,56.914167,0.075667,377.458333,48.125000,30060.759495,27986.835442,21255.189872,20094.683545
8,2017-01-01 16:00:00,31576.708860,15.309167,59.112500,0.077250,160.075833,169.773333,29558.481012,30060.759495,27986.835442,21255.189872
...,...,...,...,...,...,...,...,...,...,...,...
4359,2017-12-30 14:00:00,29293.789606,14.513333,39.486364,0.077667,409.650000,42.163333,30490.240812,29649.683142,23720.152091,21307.984791
4360,2017-12-30 16:00:00,31262.864386,14.015000,43.236364,0.077500,153.905000,152.368333,29293.789606,30490.240812,29649.683142,23720.152091
4361,2017-12-30 18:00:00,37721.673005,10.112500,60.239091,0.075583,1.618917,1.676750,31262.864386,29293.789606,30490.240812,29649.683142
4362,2017-12-30 20:00:00,36183.523447,8.526667,66.832500,0.080917,0.062917,0.101667,37721.673005,31262.864386,29293.789606,30490.240812


# Split the dataset into 80% training and 20% testing datasets

In [37]:
features = ['Temperature', 'Humidity', 'WindSpeed', 'GeneralDiffuseFlows', 'DiffuseFlows'] + [f'SolarPower_lag{lag}' for lag in range(1, n_lags+1)]

target = 'SolarPower'

In [38]:
x = data_lagged[features]
y = data_lagged[target]

In [39]:
y

4       21255.189872
5       27986.835442
6       30060.759495
7       29558.481012
8       31576.708860
            ...     
4359    29293.789606
4360    31262.864386
4361    37721.673005
4362    36183.523447
4363    32050.697084
Name: SolarPower, Length: 4360, dtype: float64

In [41]:
x

Unnamed: 0,Temperature,Humidity,WindSpeed,GeneralDiffuseFlows,DiffuseFlows,SolarPower_lag1,SolarPower_lag2,SolarPower_lag3,SolarPower_lag4
4,4.632167,73.791667,0.082417,79.281917,15.761833,20094.683545,20641.518987,21447.088607,26927.594937
5,8.019333,63.835833,2.913333,346.072727,34.108333,21255.189872,20094.683545,20641.518987,21447.088607
6,15.263333,57.075000,0.076167,486.391667,40.981667,27986.835442,21255.189872,20094.683545,20641.518987
7,15.662500,56.914167,0.075667,377.458333,48.125000,30060.759495,27986.835442,21255.189872,20094.683545
8,15.309167,59.112500,0.077250,160.075833,169.773333,29558.481012,30060.759495,27986.835442,21255.189872
...,...,...,...,...,...,...,...,...,...
4359,14.513333,39.486364,0.077667,409.650000,42.163333,30490.240812,29649.683142,23720.152091,21307.984791
4360,14.015000,43.236364,0.077500,153.905000,152.368333,29293.789606,30490.240812,29649.683142,23720.152091
4361,10.112500,60.239091,0.075583,1.618917,1.676750,31262.864386,29293.789606,30490.240812,29649.683142
4362,8.526667,66.832500,0.080917,0.062917,0.101667,37721.673005,31262.864386,29293.789606,30490.240812


In [42]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

## Create the model, pre-process the data and make it suitable for training

In [43]:
model = XGBRegressor()

param_grid = {
    'n_estimators' : [50,100],
    'max_depth' : [3, 5],
    'learning_rate':[0.01, 0.1]
}

grid_search =GridSearchCV(XGBRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)

grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


## Perofrm hyper-parameter optimization with a 5-fold cross validation.

Important: Do not use many values for the hyper-parameters due to time constraints.

KEEP IN MIND THE DATASET IS TIME-SERIES.

## Fit the model with the best parameters on the training dataset

## Calculate the adequate metrics on the testing dataset

In [45]:
best_model = grid_search.best_estimator_

y_pred = best_model.predict(x_test)

In [51]:
mean_squared_error(y_test, y_pred)
mean_absolute_error(y_test, y_pred)
r2_score(y_test, y_pred)

0.9619661769744317

## Visualize the targets against the predictions