![](https://images.pexels.com/photos/532192/pexels-photo-532192.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

# Wind power forecasting
Wind energy is the energy of the wind transformed into useful energy through wind turbines. This renewable energy source is widely used because it is an alternative to fossil energy, it is clean, does not produce greenhouse gases and can generally be used in various locations, however there are still some environmental and social problems attached, such as soil compaction and the noise emitted by the blades. In addition, wind energy suffers a lot from the fluctuation of winds and, therefore, doors are opened for the application of Machine Learning models to be used to make generation forecasts. Finally, this project aims to forecast the wind power generation of a wind turbine located in Germany with historical data from 2011 to the end of 2021.

# Dictionary (Column)
- dt: Time series with timestep of 15 minutes.
- MW: Wind power (MW).
   
# References
- [kaggle dataset](https://www.kaggle.com/datasets/l3llff/wind-power)


### 1) Importing Libraries and Data loading

In [None]:
# Importing Libraries:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly as pl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import r_regression
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

import tensorflow as tf
from keras.layers import LSTM
from sklearn.metrics import mean_squared_error
from tensorflow.keras.layers import Conv1D,  MaxPooling1D

import os
from datetime import datetime as dt

In [None]:
# Loading dataset
df = pd.read_csv("data.csv")

In [None]:
# Copy of the dataset:
dado_horarios = df.copy()

### 2) Exploratory Data Analysis

In [None]:
# Let's see the first five rows:
dado_horarios.head()

In [None]:
# Some information about data type and memory:
dado_horarios.info()

In [None]:
# Changing the type of the Time series column:
dado_horarios['dt'] = pd.to_datetime(dado_horarios['dt'])

In [None]:
# Dt column as index:
dado_horarios.set_index('dt', inplace=True)

In [None]:
# Some decriptive statistics:
dado_horarios.describe()

In [None]:
# Missing data:
dado_horarios.isna().sum()

As we can see above, there is no missing data.

In [None]:
# Aggregating information:
dado_horarios['month'] = dado_horarios.index.month
dado_horarios['year'] = dado_horarios.index.year

group = dado_horarios.groupby(["year", "month"]).mean().reset_index()

In [None]:
# Ploting the monthly average Wind Power by year:
fig = px.line(data_frame=group, x='month', y='MW', color='year', title="Monthly Average Wind power generation by year")
fig.show()

Observations:
- Compared to the average energy produced in 2011, the energy produced in 2021 is much bigger for every month.


In [None]:
# Histograma da série temporal:
fig, ax = plt.subplots(ncols=4, nrows=3, sharex=False, sharey=False, figsize=(25, 20))

sns.histplot(x=group[group['year'] == 2011]['MW'], kde=True, ax=ax[0, 0])
sns.histplot(x=group[group['year'] == 2012]['MW'], kde=True, ax=ax[0, 1])
sns.histplot(x=group[group['year'] == 2013]['MW'], kde=True, ax=ax[0, 2])
sns.histplot(x=group[group['year'] == 2014]['MW'], kde=True, ax=ax[0, 3])
sns.histplot(x=group[group['year'] == 2015]['MW'], kde=True, ax=ax[1, 0])
sns.histplot(x=group[group['year'] == 2016]['MW'], kde=True, ax=ax[1, 1])
sns.histplot(x=group[group['year'] == 2017]['MW'], kde=True, ax=ax[1, 2])
sns.histplot(x=group[group['year'] == 2018]['MW'], kde=True, ax=ax[1, 3])
sns.histplot(x=group[group['year'] == 2019]['MW'], kde=True, ax=ax[2, 0])
sns.histplot(x=group[group['year'] == 2020]['MW'], kde=True, ax=ax[2, 1])
sns.histplot(x=group[group['year'] == 2021]['MW'], kde=True, ax=ax[2, 2])
ax[2, 3].set_visible(False)


ax[0, 0].set_title("Monthly average Wind power distribution of 2011")
ax[0, 1].set_title("Monthly average Wind power distribution of 2012")
ax[0, 2].set_title("Monthly average Wind power distribution of 2013")
ax[0, 3].set_title("Monthly average Wind power distribution of 2014")
ax[1, 0].set_title("Monthly average Wind power distribution of 2015")
ax[1, 1].set_title("Monthly average Wind power distribution of 2016")
ax[1, 2].set_title("Monthly average Wind power distribution of 2017")
ax[1, 3].set_title("Monthly average Wind power distribution of 2018")
ax[2, 0].set_title("Monthly average Wind power distribution of 2019")
ax[2, 1].set_title("Monthly average Wind power distribution of 2020")
ax[2, 2].set_title("Monthly average Wind power distribution of 2021")

plt.show()

Observations:
- These distributions do not seem to follow a normal distribution.

In [None]:
# Setting the time series column as index:
df.set_index('dt', inplace=True)

## 3) Preprocessing

The Class below has a method that can transform a dataset with a Time series structure into a dataset that can be used in a supervised manner.

In [None]:
# Proprocessing class 
class Preprocessamento:

    def timeseries_to_supervised(self, df, n_features, n_target):
        n_linhas = 0
        colunas_features = ['var(t - {})'.format(str(i)) for i in range(n_features, -1, -1) if i != 0]
        colunas_target = ['var(t)' if i==0 else 'var(t + {})'.format(str(i)) for i in range(0, n_target)]
        colunas_total = colunas_features + colunas_target
        lista=[]
        
        while n_linhas <= len(df) - n_target - n_features:
            quantidade_de_features_iteracao = df.iloc[n_linhas:n_linhas + n_features].values
            quantidade_de_target_iteracao = df.iloc[n_linhas + n_features: n_linhas + n_features + n_target]
            
            linha = np.concatenate([quantidade_de_features_iteracao, quantidade_de_target_iteracao], axis=0)
            linha_reshape = linha.reshape(1, -1)
            lista.append(linha_reshape[0])
    
            n_linhas += 1
        df_iter = pd.DataFrame(lista, columns=colunas_total)

        return df_iter


In [None]:
# Function that can plot several metrics:
def metricas(X_test, y_test, models):
    for name, model in models.items():
        if name == 'LSTM':
            y_pred = []
            for i in range(len(X_test)):
                X_test_linha = X_test[i, 0:]
                X_test_reshaped = X_test_linha.reshape(1, 1, len(X_test_linha))
                predicoes = model.predict(X_test_reshaped, batch_size=1, verbose=0)
                retorno = [x for x in predicoes[0]][0]
                y_pred.append(retorno)
            y_pred = np.array(y_pred)
        else:
            y_pred = model.predict(X_test)
        
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        coef_pearson = r_regression(y_pred.reshape(-1, 1), y_test)[0]
        print(f'Mean squared error: {mse}')
        print(f'Root Mean squared error: {rmse}')
        print(f'Coef de pearson: {coef_pearson}')
        print('###########################################\n')

        return mse, rmse, coef_pearson

In [None]:
# Splintting into train an test datasets:
percentagem_treino_inicial = 0.8

limite_treino_inicial = int(len(df)*percentagem_treino_inicial)
df_train_inicial = df.iloc[0:limite_treino_inicial]
df_test = df.iloc[limite_treino_inicial:]

In [None]:
# Splitting into training and validation datasets:
percentagem_treino = 0.8

limite_treino = int(len(df_train_inicial)*percentagem_treino)
df_train = df_train_inicial.iloc[0:limite_treino]
df_val = df_train_inicial.iloc[limite_treino:]

In [None]:
# Intance of Preprocessing class:
prep_obj = Preprocessamento()

We will use a window of 20 and a horizon of 10. In other words, we are going to use 20 features to predict ten timesteps ahead in time. Furthemore, it is important to say that we will make a model for each timestep ahead, so it is going to be 10 models for each algorithm.

In [None]:
# Transforming the time series into a supervised problem:
n_features = 20
n_target = 10

df_train_supervised = prep_obj.timeseries_to_supervised(df_train, n_features, n_target)
df_val_supervised = prep_obj.timeseries_to_supervised(df_val, n_features, n_target)
df_test_supervised = prep_obj.timeseries_to_supervised(df_test, n_features, n_target)

In [None]:
# Some descriptive statistics about the features:
df_train_supervised.describe()

Observations:
- All of the features have approxmately the same mean and Standard Deviation.

In [None]:
# Splitting into train and test fetures and targets:
X_train, y_train = df_train_supervised.values[:, 0:n_features], df_train_supervised.values[:, n_features:]
X_val, y_val = df_val_supervised.values[:, 0:n_features], df_val_supervised.values[:, n_features:]
X_test, y_test = df_test_supervised.values[:, 0:n_features], df_test_supervised.values[:, n_features:]

In [None]:
# Features shape:
print('Feature shapes:')
print(f'Training: {X_train.shape}')
print(f'Validation: {X_val.shape}')
print(f'Test: {X_test.shape}')
print('#########################\n')

# Target shape:
print('Target shapes:')
print(f'Training: {y_train.shape}')
print(f'Validation: {y_val.shape}')
print(f'Test: {y_test.shape}')


### 3.1) Data transformation

#### 3.1.1) Standard Scale

Definition:
- The Standard Scaler is a technique that rescales the distribution of a variable so that the mean of the observed sample is 0 and the standard deviation is 1. It is particularly useful for algorithms that rely on distance measures, such as K-means and K-nearest neighbors (KNN). Additionally, it is a recommended choice for algorithms based on neural networks.

OBS: Standard Scaler can perform slightly worst than the other transformations because it assumes that the data is normally distributed. However you can still standardize your data.

Matematical Definition:

$X_{new_{i}} = \frac{X_{i} - \hat{\mu}_{i}}{\sigma_{i}}$

- $\mu:$ Mean of the sample.
- $\sigma:$ Standard Deviation of the sample.

In [None]:
# Fitting a Standard Scaler object:
std_scaler = StandardScaler()
std_scaler.fit(X_train)

# Transforming all the sets:
X_train_std = std_scaler.transform(X_train) 
X_val_std = std_scaler.transform(X_val)
X_test_std = std_scaler.transform(X_test)

## 4) Fitting models

In [None]:
# Create nested directories: 
def make_directory(path):
    try:
        os.makedirs(path)
    except FileExistsError:
        print("File already exists!")

In [None]:
# Directory name:
directory = dt.now().strftime("%Y-%m-%d__%H_%M_%S")

### 4.1) Multilayer Neural Network

In [None]:
# Function that structure a simple neural network architecture:
def mlp_simples(device):

    with tf.device(device):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(100, activation='relu', input_shape=X_train.shape[1:]),
            tf.keras.layers.Dense(50, activation='relu'),
            tf.keras.layers.Dense(25, activation='relu'),
            tf.keras.layers.Dense(10, activation='relu'),
            tf.keras.layers.Dense(1)
        ])
        model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mse'])
        
    return model

In [None]:
# Making a directory for MLP models:
path_modelos_mlp_simples = "modelos_mlp_simples/{}".format(directory)
make_directory(path_modelos_mlp_simples)

# Training a MLP:
for target in range(n_target):
    
    # EarlyStopping callback:
    earlystopping = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
    checkpoint_mlp = tf.keras.callbacks.ModelCheckpoint('{}/model{}_mlp.h5'.format(path_modelos_mlp_simples, target))
    model_mlp = mlp_simples('/device:GPU:0')
    history_2 = model_mlp.fit(X_train_std, y_train[:, target], epochs=60, 
                    validation_data=(X_val_std, y_val[:, target]), callbacks=[earlystopping, checkpoint_mlp])

In [None]:
# Metrics for each horizon for MLP:
resultados_mlp = []
path = "modelos_mlp_simples/{}".format(directory)

for target, i in enumerate(os.listdir(path)):
    path_temp = path + "/" + str(i)
    modelo_carregado = tf.keras.models.load_model(path_temp)
    mse, rmse, coef_p = metricas(X_test_std, y_test[:, target], {'mlp':modelo_carregado})
    resultados_mlp.append([mse, rmse, coef_p])

### 4.2) Random Forest

In [None]:
# Training Random Forest models for each horizon:
resultados_rnd = []
for target in range(n_target):
    rnd_model = RandomForestRegressor(random_state=42)
    rnd_model.fit(X_train_std, y_train[:, target])
    print(f'Random Forest - {target}:')
    mse, rmse, coef_p = metricas(X_test_std, y_test[:, target], {'rnd':rnd_model})
    resultados_rnd.append([mse, rmse, coef_p])


### 4.3) XGBoost

In [None]:
# Training XGBoost models for each horizon:
resultados_xgb = []
for target in range(n_target):
    xgb_model = XGBRegressor(random_state=42)
    xgb_model.fit(X_train_std, y_train[:, target])

    print(f'XGBoost - {target}:')
    mse, rmse, coef_p = metricas(X_test_std, y_test[:, target], {'xgb':xgb_model})
    resultados_xgb.append([mse, rmse, coef_p])

### 4.4) Long Short term memory Neural Network (LSTM)

In [None]:
# Transforming the features into a 3D Matrix:
X_train_lstm = X_train_std.reshape(X_train_std.shape[0], 1, X_train_std.shape[1])
X_val_lstm = X_val_std.reshape(X_val_std.shape[0], 1, X_val_std.shape[1])

In [None]:
# LSTM's architecture:
def model_LSTM(device):
    with tf.device(device):
        model = tf.keras.Sequential()
        model.add(LSTM(1, batch_input_shape=(1, X_train_lstm.shape[1], X_train_lstm.shape[2]), stateful=True))
        model.add(tf.keras.layers.Dense(y_train.shape[1]))
        model.compile(loss='mean_squared_error', optimizer='adam')

    return model

earlystop = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

In [None]:
# Training a LSTM for each horizon:
path = "modelos_lstm/{}".format(directory)
make_directory(path)

for target in range(n_target):
    
    # EarlyStopping callback:
    earlystopping = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
    checkpoint = tf.keras.callbacks.ModelCheckpoint("{}/lstm_model{}.h5".format(path, target))
    modelo_lstm = model_LSTM('/device:GPU:0')

    for i in range(120):
        history = modelo_lstm.fit(X_train_lstm, y_train[:, target], epochs=1, batch_size=1, verbose=1, shuffle=False, 
        validation_data=(X_val_lstm, y_val[:, target]), validation_batch_size=1, callbacks=[earlystop, checkpoint])
        modelo_lstm.reset_states()

In [None]:
# Metrics for the LSTM:
resultados_lstm = []
path = "modelos_lstm/{}".format(directory)

for target, i in enumerate(os.listdir(path)):
    path_temp = path + "/" + str(i)
    modelo_carregado = tf.keras.models.load_model(path_temp)
    
    print(f'LSTM - {target}:')
    mse, rmse, coef_p = metricas(X_test_std, y_test[:, target], {'LSTM':modelo_carregado})
    resultados_lstm.append([mse, rmse, coef_p])

### 4.5) Convolutional 1D Neural Network

In [None]:
# Reshaping variables to fit as Input for a CNN 1D:
X_train_conv = np.array(X_train_std).reshape(X_train_std.shape[0], X_train_std.shape[1], 1)
X_val_conv = np.array(X_val_std).reshape(X_val_std.shape[0], X_val_std.shape[1], 1)
X_test_conv = np.array(X_test_std).reshape(X_test_std.shape[0], X_test_std.shape[1], 1)

In [None]:
# CNN 1D:
def timeseries_model_conv(device):
    with tf.device(device):
        model = tf.keras.Sequential()
        model.add(Conv1D(filters=64, kernel_size=7, activation="relu", padding="same", input_shape=(X_train_conv.shape[1], 1)))
        model.add(MaxPooling1D(pool_size=2))
        model.add(Conv1D(filters=128, kernel_size=3, activation="relu", padding="same"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(Conv1D(filters=256, kernel_size=3, activation="relu", padding="same"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(tf.keras.layers.Flatten())
        model.add(tf.keras.layers.Dense(128, activation="relu"))
        model.add(tf.keras.layers.Dropout(0.5))
        model.add(tf.keras.layers.Dense(64, activation="relu"))
        model.add(tf.keras.layers.Dense(1))
        model.compile(optimizer="adam", loss="mse", metrics=["mse"])
        return model

In [None]:
# Training a 1D CNN for each horizon:
path = "modelos_cnn/{}".format(directory)
make_directory(path)

for target in range(n_target):
    earlystop = tf.keras.callbacks.EarlyStopping(patience=10)
    checkpoint = tf.keras.callbacks.ModelCheckpoint("{}/cnn_model{}.h5".format(path, target))
    model_conv = timeseries_model_conv("/device:GPU:0")

    history_conv = model_conv.fit(X_train_conv, y_train[:, target], validation_data=(X_val_conv, y_val[:, target]), 
                  epochs=120, callbacks=[earlystop, checkpoint])

In [None]:
# Metrics fr the 1D CNN:
resultados_cnn = []
path = 'modelos_cnn/{}'.format(directory)

for index, model in enumerate(os.listdir(path)):
    path_temp = path + "/" + model
    modelo_carregado = tf.keras.models.load_model(path_temp)
    
    print(f'CNN 1D - {target}:')
    mse, rmse, coef_p = metricas(X_test_conv, y_test[:, target], {'CNN':modelo_carregado})
    resultados_cnn.append([mse, rmse, coef_p])

### 5) Results

In [None]:
# Plotting the result figures:
resultados_mlp = np.array(resultados_mlp)
resultados_rnd = np.array(resultados_rnd)
resultados_xgb = np.array(resultados_xgb)
resultados_lstm = np.array(resultados_lstm)
resultados_cnn = np.array(resultados_cnn)
metricas_iterar = ['MSE', 'RMSE', 'COEF_PEARSON']

# Creating a directory to store all the images:
path = 'imagens/{}'.format(directory)
make_directory(path)

for index, nome in enumerate(metricas_iterar):
    resultados_mlp_reshaped = resultados_mlp[:, index].reshape(-1, 1)
    resultados_rnd_reshaped = resultados_rnd[:, index].reshape(-1, 1)
    resultados_xgb_reshaped = resultados_xgb[:, index].reshape(-1, 1)
    resultados_lstm_reshaped = resultados_lstm[:, index].reshape(-1, 1)
    resultados_cnn_reshaped = resultados_cnn[:, index].reshape(-1, 1)

    array_resultados_mse = np.concatenate([resultados_mlp_reshaped, resultados_rnd_reshaped, 
                                          resultados_xgb_reshaped, resultados_lstm_reshaped,
                                          resultados_cnn_reshaped], axis=1)

    df_resultados = pd.DataFrame(array_resultados_mse, columns=['MLP', 'RANDOM_FOREST', 'XGBOOST', 'LSTM', 'CNN']).reset_index(names='Horizontes')
    df_resultados_melted = df_resultados.melt(id_vars='Horizontes', value_name=nome, var_name='Modelos')

    fig_resultado = px.line(df_resultados_melted, x='Horizontes', y=nome, hover_data=['Modelos'], color='Modelos',
    title='{} - 12 Horizontes'.format(nome))
    
    pl.io.write_image(fig=fig_resultado, file='imagens\{}\{}.jpg'.format(directory, nome), width=1000, height=500)