Team name: Bsahtek le ML

Students: 
- TESTARD Arthur: 105022
- VERDON Valentin: 106002
- VERDIER Nahel: 105247

# TDT4173 - Machine Learning
## Final Notebook of project

## Table of contents

# 
0. [Imports and data initialization](#0-imports-and-data-initialization)
1. [Some Data analysis](#1-data-analysis)
    1. [Global Analysis](#11-global-analysis)
    2. [Nan study](#12-nan-study)
    3. [Study of the target](#13-study-of-the-target)
    4. [Constant columns study](#15-constant-columns-study)
    5. [Correlation study](#16-correlation-study)
    <!-- 1. [Sub paragraph](#2-research-leads) -->
2. [Research leads](#2-research-leads)
    1. [Signal analysis](#21-signal-analysis)
3. [Preprocessing](#3-preprocessing)
    1. [Management of Nan values](#31-management-of-nan-values)
    2. [Constant columns management](#32-constant-columns-management)
    3. [Normalization](#33-normalization)
    4. [Feature engineering](#34-feature-engineering)
    5. [Management of constant parts of pv_measurement on B and C](#35-management-of-constant-parts-of-pv_measurement-on-b-and-c)
    6. [Interpolation of the output values](#36-interpolation-of-the-output-values)
    7. [Input reshaping](#37-input-reshaping)
    8. [Pre-processing function](#38-preprocessing-function)
4. [Pre-processing ideas that didn't work](#4-pre-processing-ideas-that-didnt-work)
    1. [Polynomial features](#41-polynomial-features)
    2. [Mean of features](#42-mean-of-features)
5. [Separation of training and test sets](#5-separation-of-training-and-test-sets)
    1. [Split training and test sets on estimated set dates](#51-split-training-and-test-sets-on-estimated-set-dates)
    2. [Split training and test sets on prediction dates range and estimated set](#52-split-training-and-test-sets-on-prediction-dates-range-and-estimated-set)
6. [Models](#6-models)
    1. [Models performance comparison](#61-models-performance-comparison)
    2. [XGBoost hyper-parameters](#62-xgboost-hyper-parameters)
    3. [Hyper-parameters selction](#63-post-processing)
    4. [Some other models which did not succeed](#64-some-other-models-which-did-not-succeed)
7. [Areas for improvement](#7-areas-for-improvement)
8. [Conclusion](#conclusion)
<!-- 7. [Another paragraph](#part7) -->

## 0. Imports and data initialization

In [None]:
import os
import sys
sys.path.append('../')

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from tqdm import tqdm
import scipy
import pickle
from datetime import datetime, timedelta
from functools import reduce

from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics

from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, learning_curve, train_test_split

import xgboost as xgb
from xgboost import plot_importance, plot_tree

import tensorflow as tf
import autokeras as ak

In [None]:
class DataFolder:
    def __init__(self, folder_name: str):
        self.folder_name: str
        self.X_test_estimated: str = f"{folder_name}/X_test_estimated.parquet"
        self.X_train_estimated: str = f"{folder_name}/X_train_estimated.parquet"
        self.X_train_observed: str = f"{folder_name}/X_train_observed.parquet"
        self.train_targets: str | None = f"{folder_name}/train_targets.parquet"

A = DataFolder(folder_name='A')
B = DataFolder(folder_name='B')
C = DataFolder(folder_name='C')

In [None]:
def read_files(diff_path: str = '../'):
    train_a = pd.read_parquet(diff_path + A.train_targets)
    train_b = pd.read_parquet(diff_path + B.train_targets)
    train_c = pd.read_parquet(diff_path + C.train_targets)

    X_train_estimated_a = pd.read_parquet(diff_path + A.X_train_estimated)
    X_train_estimated_b = pd.read_parquet(diff_path + B.X_train_estimated)
    X_train_estimated_c = pd.read_parquet(diff_path + C.X_train_estimated)

    X_train_observed_a = pd.read_parquet(diff_path + A.X_train_observed)
    X_train_observed_b = pd.read_parquet(diff_path + B.X_train_observed)
    X_train_observed_c = pd.read_parquet(diff_path + C.X_train_observed)

    X_test_estimated_a = pd.read_parquet(diff_path + A.X_test_estimated)
    X_test_estimated_b = pd.read_parquet(diff_path + B.X_test_estimated)
    X_test_estimated_c = pd.read_parquet(diff_path + C.X_test_estimated)

    return train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c

## 1. Data analysis

### 1.1 Global analysis

In [None]:
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

In this section, we'll get to grips with the data and begin to understand its structure, what it's all about, and so on...

In [None]:
print("Files dimensions:")
print(f"train_a: {train_a.shape}, train_b:, {train_b.shape}, train_c: {train_c.shape}")
print(f"X_train_estimated_a:, {X_train_estimated_a.shape}, X_train_estimated_b: {X_train_estimated_b.shape}, X_train_estimated_c: {X_train_estimated_c.shape}")
print(f"X_train_observed_a: {X_train_observed_a.shape}, X_train_observed_b: {X_train_observed_b.shape}, X_train_observed_c: {X_train_observed_c.shape}")
print(f"X_train_estimated_a: {X_train_estimated_a.shape}, X_train_estimated_b: {X_train_estimated_b.shape}, X_train_estimated_c: {X_train_estimated_c.shape}")

We observe that `X_train_estimated` and `X_test_estimated` has one more column than `X_test_observed` which is `date_calc`.

Look at the variable types:

In [None]:
types = []
for type in X_train_estimated_a.dtypes.values:
    if type not in types: types.append(type)

for type in X_train_estimated_b.dtypes.values:
    if type not in types: types.append(type)

for type in X_train_estimated_c.dtypes.values:
    if type not in types: types.append(type)

print("Our data types are only that kind:", types, "(datetime and float)")

We look at the start and end dates of the files:

In [None]:
print("X_train_observed_a:", X_train_observed_a['date_forecast'].min(), X_train_observed_a['date_forecast'].max())
print("X_train_estimated_a:", X_train_estimated_a['date_forecast'].min(), X_train_estimated_a['date_forecast'].max())
print("X_train_observed_b:",X_train_observed_b['date_forecast'].min(), X_train_observed_b['date_forecast'].max())
print("X_train_estimated_b:", X_train_estimated_b['date_forecast'].min(), X_train_estimated_b['date_forecast'].max())
print("X_train_observed_c:", X_train_observed_c['date_forecast'].min(), X_train_observed_c['date_forecast'].max())
print("X_train_estimated_c:", X_train_estimated_c['date_forecast'].min(), X_train_estimated_c['date_forecast'].max())

### 1.2 Nan study

Here, we'll see how our missing data is distributed, so that we can deal with this problem later. Gives the proportion of missing values for each column in the DataFrame, sorted from highest to lowest, as a percentage of the total number of rows in the DataFrame.

We will just look at `X_train_observed_a` as an example in order to earn so time on it.

In [None]:
X_train_estimated_a.isna().sum().sort_values(ascending=False)/(len(X_train_estimated_a))

We notice that the columns with NAN values are `snow_density`, `ceiling_height_agl` and `cloud_base_agl`. `snow_density` is empty at 90% while the others are empty at 5 to 20%

### 1.3. Study of the target

We study the target to see how it evolves over time. If there is information to be deduced that our model will need to understand and use to make its predictions. 

Show variables for 3 locations

In [None]:
#show the variable pv_measurement as a functio of time
fig, axs = plt.subplots(3, 1, figsize=(30, 20), sharex=True)

train_a[['time','pv_measurement']].set_index('time').plot(ax=axs[0], title='pv_measurement on location A')
train_b[['time','pv_measurement']].set_index('time').plot(ax=axs[1], title='pv_measurement on location B')
train_c[['time','pv_measurement']].set_index('time').plot(ax=axs[2], title='pv_measurement on location C')

Our data show periodicities. In section 2.1.Signal analysis we have noticed 3 main frequencies: the year, the day and 12h.

Our variable also has gaps or constant parts in certain areas that need to be explored in detail.

However, our variable varies a lot according to the weather, hence the need for a Machine Learning model.

In [None]:
train_a = train_a.rename(columns={'time': 'date_forecast'})
train_b = train_b.rename(columns={'time': 'date_forecast'})
train_c = train_c.rename(columns={'time': 'date_forecast'})

merged_df_a = pd.merge(X_train_observed_a, train_a, on='date_forecast', how='inner')
merged_df_b = pd.merge(X_train_observed_b, train_b, on='date_forecast', how='inner')
merged_df_c = pd.merge(X_train_observed_c, train_c, on='date_forecast', how='inner')

# Add prefixes to column names during concatenation
merged_df_a.columns = 'a_' + merged_df_a.columns
merged_df_b.columns = 'b_' + merged_df_b.columns
merged_df_c.columns = 'c_' + merged_df_c.columns

merged_df_a = merged_df_a.rename(columns={'a_date_forecast': 'date_forecast'})
merged_df_b = merged_df_b.rename(columns={'b_date_forecast': 'date_forecast'})
merged_df_c = merged_df_c.rename(columns={'c_date_forecast': 'date_forecast'})

# Concatenate the DataFrames
aux =  pd.merge(merged_df_a, merged_df_b, on='date_forecast', how='inner')
total = pd.merge(aux, merged_df_c, on='date_forecast', how='inner')

print('correlation between A and B of pv_measurement:',total['a_pv_measurement'].corr(total['b_pv_measurement']))
print('corrélation between A and C of pv_measurement:',total['a_pv_measurement'].corr(total['c_pv_measurement']))
print('corrélation between B and C of pv_measurement:',total['b_pv_measurement'].corr(total['c_pv_measurement']))


### 1.4. Inputs comparison

|Sun-related variables|Snow-related variables|Atmospheric pressure variables|Rain variables|Wind variables|Cloud-related variables|Visibility and altitude variables|Variables related to diffuse and direct light|Day/night and shade variable|
|--- |:-: |:-: |:-: |:-: |:-: |:-: |:-:   |--:   |
|`clear_sky_energy_1h:J`|`fresh_snow_12h:cm`|`msl_pressure:hPa`|`precip_5min:mm`|`wind_speed_10m:ms`|`effective_cloud_cover:p`|`ceiling_height_agl:m`|`diffuse_rad:W`|`is_day:idx`|
|`clear_sky_rad:W`|`fresh_snow_1h:cm`|`pressure_100m:hPa`|`rain_water:kgm2`|`wind_speed_u_10m:ms`|`total_cloud_cover:p`|`elevation:m`|`diffuse_rad_1h:J`|`is_in_shadow:idx`|
|`sun_azimuth:d`|`fresh_snow_24h:cm`|`pressure_50m:hPa`|`prob_rime:p`|`wind_speed_v_10m:ms`|`relative_humidity_1000hPa:p`|`visibility:m`|`direct_rad:W`||
|`sun_elevation:d`|`fresh_snow_3h:cm`|`sfc_pressure:hPa`|`precip_type_5min:idx`||`absolute_humidity_2m:gm3`||`direct_rad_1h:D`||
|`sun_elevation:d`|`fresh_snow_6h:cm`|`t_1000hPa:K`|`dew_or_rime`||`air_density_2m:kgm3`||||
||`snow_density:kgm3`|`wind_speed_w_1000hPa:ms`|`dew_point_2m`||`cloud_base_agl:m`||||
||`snow_depth:cm`||||||||
||`snow_drift:idx`||||||||
||`snow_melt_10min:mm`||||||||
||`snow_water:kgm2`||||||||
||`super_cooled_liquid_water:kgm2`||||||||

In [None]:
scaler_a = MinMaxScaler()
scaler_b = MinMaxScaler()
scaler_c = MinMaxScaler()

X_a_to_analyse = X_train_observed_a.drop(columns='date_forecast', inplace=False)
X_b_to_analyse = X_train_observed_b.drop(columns='date_forecast', inplace=False)
X_c_to_analyse = X_train_observed_c.drop(columns='date_forecast', inplace=False)

columns_match_already_done = []
X_a_normed = scaler_a.fit_transform(X_a_to_analyse)
X_b_normed = scaler_b.fit_transform(X_b_to_analyse)
X_c_normed = scaler_c.fit_transform(X_c_to_analyse)

We plot here the columns which put together seems to be strongly linked (like we can write a relation such as $f(x_1) = x_2$, with $f$ a usual function, or a combination of it).

If we look to every one of these combinations, we can notice that the link is pretty obvious. For example data about sun, such the energy or it position in the sky, are strongly linked.

In [None]:
columns = list(X_a_to_analyse.columns.values)
columns_to_plot = [ 
    ("dew_point_2m:K","absolute_humidity_2m:gm3"),
    ("msl_pressure:hPa","pressure_100m:hPa"),
    ("msl_pressure:hPa","sfc_pressure:hPa"),
    ("pressure_100m:hPa","sfc_pressure:hPa"),
    ("t_1000hPa:K", "absolute_humidity_2m:gm3"),
    ("dew_point_2m:K", "t_1000hPa:K"),
    ("diffuse_rad:W", "sun_azimuth:d"),
    ("diffuse_rad_1h:J","sun_azimuth:d"),
    ("direct_rad:W", "direct_rad_1h:J"),
    ("direct_rad:W", "sun_azimuth:d"),
    ("direct_rad_1h:J", "sun_azimuth:d"),
    ("sun_elevation:d","sun_azimuth:d"),
]

for c1, c2 in columns_to_plot:
    c1_index = columns.index(c1)
    c2_index = columns.index(c2)
    plt.scatter(X_a_normed[:, c2_index], X_a_normed[:, c1_index], alpha=1, label='A position')
    plt.scatter(X_b_normed[:, c2_index], X_b_normed[:, c1_index], alpha=0.1, label='B position')
    plt.scatter(X_c_normed[:, c2_index], X_c_normed[:, c1_index], alpha=0.01, label='C position')
    plt.xlabel(c1)
    plt.ylabel(c2)
    plt.grid()
    plt.legend()
    plt.title(f"{c2} = f({c1})")
    plt.show()
    columns_match_already_done.append((c1, c2))
    columns_match_already_done.append((c2, c1))

### 1.5. Constant columns study

We are going to determine here, which columns are constants, so useless for our future model.

In [None]:
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

print(X_train_observed_a.std()[X_train_observed_a.std() == 0].keys(), X_train_observed_b.std()[X_train_observed_b.std() == 0].keys(), X_train_observed_c.std()[X_train_observed_c.std() == 0].keys())
print(X_train_estimated_a.std()[X_train_estimated_a.std() == 0].keys(), X_train_estimated_b.std()[X_train_estimated_b.std() == 0].keys(), X_train_estimated_c.std()[X_train_estimated_c.std() == 0].keys())
print(X_test_estimated_a.std()[X_test_estimated_a.std() == 0].keys(), X_test_estimated_b.std()[X_test_estimated_b.std() == 0].keys(), X_test_estimated_c.std()[X_test_estimated_c.std() == 0].keys())

We can say three things about this result. First, we can see that the `elevation:m` is like a parameters which tell us the position of the solar panel. Secondly we see that our weather estimation do not predict `wind_speed_w_1000hPa:m`, so we need to consider it in our model. Last thing is that we could ajust our model with the snow parameters, because it is constant in test estimated sets.

### 1.6. Correlation study

With the help of the correlation matrices, we can begin to study the relationships between the different variables in order to potentially distinguish groups, or other features of our dataset.

In [None]:
def build_corr_matrix(data_frames, figsize=(20,20), annot=True):
    plt.figure(figsize=figsize)
    sns.heatmap(data_frames.corr(method='pearson'), annot=annot, cmap=plt.cm.Reds)
    
keys = [1, 2, 3]

frames_train_estimated = [X_train_estimated_a.drop(columns=["date_calc"]), X_train_observed_a, X_test_estimated_a.drop(columns=["date_calc"])]
X_frames_a = pd.concat(frames_train_estimated, keys=keys)
X_frames_a.reset_index(level=0, inplace=True, names='frame_type')

train_a = train_a.rename(columns={'time': 'date_forecast'})
X_y_a = X_frames_a.merge(train_a, on='date_forecast', how='inner')
X_y_a.reset_index(drop=True, inplace=True)

build_corr_matrix(X_y_a, figsize=(50,50))


The most correlated variables for pv_measurement are :

*   `clear_sky_energy_1h:J`: clear sky energy of previous time period, available up to 24h [J/m2]
*   `clear_sky_rad:W`: clear sky radiation flux [W/m2]
*   `diffuse_rad:W`: diffuse radiation flux [W/m2]
*   `diffuse_rad_1h:J`: accumulated diffuse radiation of previous time period, available up to 24h [J/m2]
*   `direct_rad:W`: direct radiation flux [W/m2]

In [None]:
frames_train_estimated = [X_train_estimated_b.drop(columns=["date_calc"]), X_train_observed_b, X_test_estimated_b.drop(columns=["date_calc"])]
X_frames_b = pd.concat(frames_train_estimated, keys=keys)
X_frames_b.reset_index(level=0, inplace=True, names='frame_type')

train_b = train_b.rename(columns={'time': 'date_forecast'})
X_y_b = X_frames_b.merge(train_b.dropna(), on='date_forecast', how='inner')
X_y_b.reset_index(drop=True, inplace=True)

frames_train_estimated = [X_train_estimated_c.drop(columns=["date_calc"]), X_train_observed_c, X_test_estimated_c.drop(columns=["date_calc"])]
X_frames_c = pd.concat(frames_train_estimated, keys=keys)
X_frames_c.reset_index(level=0, inplace=True, names='frame_type')

train_c = train_c.rename(columns={'time': 'date_forecast'})
X_y_c = X_frames_c.merge(train_c.dropna(), on='date_forecast', how='inner')
X_y_c.reset_index(drop=True, inplace=True)

frames_location= [X_y_a, X_y_b, X_y_c]
X_y = pd.concat(frames_location, keys=keys)
X_y.reset_index(level=0, inplace=True, names='location')

build_corr_matrix(X_y, figsize=(50,50))


In [None]:
frames_train_estimated = [X_train_estimated_a, X_train_estimated_b, X_train_estimated_c]
X_train_estimated = pd.concat(frames_train_estimated, keys=keys)
X_train_estimated.reset_index(level=0, inplace=True, names='location')

frames_train_observed = [X_train_observed_a, X_train_observed_b, X_train_observed_c]
X_train_observed = pd.concat(frames_train_observed, keys=keys)
X_train_observed.reset_index(level=0, inplace=True, names='location')

frames_test_estimated = [X_test_estimated_a, X_test_estimated_b, X_test_estimated_c]
X_test_estimated = pd.concat(frames_test_estimated, keys=keys)
X_test_estimated.reset_index(level=0, inplace=True, names='location')

X_train_estimated_merged = pd.concat(frames_train_estimated, axis=1)
X_train_estimated_merged.reset_index(drop=True, inplace=True)

X_train_observed_merged = pd.concat(frames_train_observed, axis=1)
X_train_observed_merged.reset_index(drop=True, inplace=True)

X_test_estimated_merged = pd.concat(frames_test_estimated, axis=1)
X_test_estimated_merged.reset_index(drop=True, inplace=True)

build_corr_matrix(X_train_estimated_merged, figsize=(100,100), annot=True)


## 2. Research leads

### 2.1. Signal analysis

Because our data were presenting some periodicities, intuitively, one of our first idea were to analyse the different signals we have, starting by our target, `pv_measurement`. However, as we can see on the following plot, the data is not completly cleared, specially on B and C. We will come back to this point in Preprocessing part.

In [None]:
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

def filter_dates_when_constants(df, date_c = 'time', y = 'pv_measurement', delta = { 'days': 3 }):
    df = df.copy()
    mask_y_change = df[y] != df[y].shift(1)

    start_date = None
    end_date = None

    constant_periods = []

    for index, row in df.iterrows():
        if not mask_y_change[index]:
            if start_date is None:
                start_date = row[date_c]
            end_date = row[date_c]
        else:
            if start_date is not None and (end_date - start_date) >= pd.Timedelta(**delta):
                constant_periods.append((start_date, end_date))
            start_date = None
            end_date = None

    if start_date is not None and (end_date - start_date) >= pd.Timedelta(**delta):
        constant_periods.append((start_date, end_date))
    return constant_periods

def delete_date_range_from_df(df, dates, date_c = 'time'):
    df = df.copy()
    c = 0
    for start_date, end_date in dates:
        mask = (df[date_c] >= start_date) & (df[date_c] < end_date)
        df = df[~mask]
    df.reset_index(drop=True, inplace=True)
    return df

delta = { 'hours': 12 * 4}
train_a = delete_date_range_from_df(train_a, filter_dates_when_constants(train_a, delta=delta))
train_b = delete_date_range_from_df(train_b, filter_dates_when_constants(train_b, delta=delta))
train_c = delete_date_range_from_df(train_c, filter_dates_when_constants(train_c, delta=delta))

fig, axs = plt.subplots(3, 1, figsize=(30, 20), sharex=True)

train_a[['time','pv_measurement']].set_index('time').plot(ax=axs[0], title='pv_measurement on location A')
train_b[['time','pv_measurement']].set_index('time').plot(ax=axs[1], title='pv_measurement on location B')
train_c[['time','pv_measurement']].set_index('time').plot(ax=axs[2], title='pv_measurement on location C')

As we said, we then tryied to analys this signal with basic components such as Fourier analysis.

In [None]:
def get_fft_transforms(train):
    y = train["pv_measurement"].dropna().values
    time_diff = train["time"].diff().mean().total_seconds()
    sampling_rate = 1 / time_diff

    n = len(y)
    freq = np.fft.fftfreq(n, 1 / sampling_rate)
    fft_y = np.fft.fft(y)
    amp_fft_y = np.abs(fft_y)
    phase = np.angle(fft_y)
    return freq, fft_y, amp_fft_y, phase, sampling_rate

trains = { 'A': train_a.dropna(subset='pv_measurement'), 'B': train_b.dropna(subset='pv_measurement'), 'C': train_c.dropna(subset='pv_measurement') }
locations = trains.keys()

freqs, fft_ys, amp_fft_ys, phases, sampling_rates = [
    { 
        loc: get_fft_transforms(trains[loc])[k] for loc in locations 
    } 
    for k in range(5)]

plt.figure(figsize=(30, 20))
k = 1
for loc in locations:
    plt.subplot(3, 1, k)
    plt.plot(freqs[loc][:len(freqs[loc])//2], amp_fft_ys[loc][:len(freqs[loc])//2])
    plt.title(f"Spectrum of the pv_measurement at {loc} location")
    plt.xlabel("Frequency (Hz)")
    plt.ylabel("Amplitude")
    plt.grid()
    k += 1
plt.show()

plt.figure(figsize=(30, 20))
k = 1
for loc in locations:
    plt.subplot(3, 1, k)
    plt.plot(freqs[loc][:len(freqs[loc])//2], 20 * np.log10(amp_fft_ys[loc][:len(freqs[loc])//2]))
    plt.title(f"Spectrum in magnitude of the pv_measurement at {loc} location")
    plt.xlabel("Frequency (Hz)")
    plt.ylabel("Magnitude (dB)")
    plt.grid()
    k += 1
plt.show()

We then looked on the most important frequencies in those spectrums.

In [None]:
def print_peak_frequencies(amp_fft_y, freq, threshold, loc):
    peaks, _ = scipy.signal.find_peaks(amp_fft_y[:len(amp_fft_y)//2], height=threshold)
    peak_frequencies = freq[:len(freq)//2][peaks]

    period_size = int(1/peak_frequencies[0])
    continuous_component = np.mean(trains[loc]["pv_measurement"].dropna().values[:period_size])

    print("Location:", loc)
    print(f'Most important periods (in days): \n{1 / peak_frequencies / 3600 / 24}')
    print(f'Value of the continous component: {continuous_component}\n\n')

thresholds = { 'A': .5e7, 'B': .5e6, 'C': .25e6 }
for loc in locations:
    print_peak_frequencies(amp_fft_ys[loc], freqs[loc], thresholds[loc], loc)

First commentaries:

We know from the analysis of the nan values that A got the most clean datas in term of `pv_measurement` values. So our analysis will mostly be based on what we see on A. We can notice 3 most important frequencies: one for the year, one for the day and one for a half-day (12 hours). If we look more on the frequency plot, we can notice a most little one frequency (that our threshold impeach us to read it on the last print). This seems to be a peak for a period of 8 hours, according to the code cell bellow.

Because B and C are not much clean, we can suppose that the big differencies we found with A comes from the Nan values, which create some empty cells in these frames, which are compensated by increasing the frequency values. However, we did not pay attention to it much at first be because most of our analysis were based on A data.

In [None]:
freq_a_1 = np.min(np.where(freqs['A'] > .00003)) 
freq_a_2 = np.max(np.where(freqs['A'] < .00004))
freq_arg = np.argmax(fft_ys['A'][:len(fft_ys['A'])//2][freq_a_1:freq_a_2])
1 / freqs['A'][:len(fft_ys['A'])//2][freq_a_1:freq_a_2][freq_arg] / 3600 

We can confirm what we sayied on B and C compared to A if we look on the differents sampling rates depending on the situation. Theorically, it should be close to one hour ($=3600$ seconds) because our values are measured every hours. But if we look on `1 / sampling_rates['B']` and `1 / sampling_rates['C']` we see that it's more than it for B and C locations. This comes from Nan values and confirms our point above.

We can notice that `1 / sampling_rates['B']` is a bit bigger than an hour. We can explain it by the gap of one week between `X_train_observed_a` and `X_train_estimated_a`, which exists as well in `train_a`.

In [None]:
1 / sampling_rates['A'] / 3600, 1 / sampling_rates['B'] / 3600, 1 / sampling_rates['C'] / 3600

The first thing we can try is to recalculate the model by the inverse of Fourier's transformation.

In [None]:
i_fft_ys = { loc: np.fft.ifft(fft_ys[loc]) for loc in locations }

plt.figure(figsize=(30, 20))
k = 1
for loc in locations:
    plt.subplot(3, 1, k)
    plt.plot(trains[loc]['pv_measurement'], label='real pv_measurement')
    plt.plot(i_fft_ys[loc], label='ifft pv_measurement')
    plt.title(f"Spectrum of the pv_measurement at {loc} location")
    plt.xlabel("Frequency (Hz)")
    plt.ylabel("Amplitude")
    plt.grid()
    k += 1
plt.show()

We notice a gap created in C but in facts its due to the index. 

Now the idea is to keep only the most important frequencies in order to have a model which can be written like this:
$$y[n] = \hat{y}[n] + r[n]$$

where $n$ is the index of the output, $y[n]$ is the real value of `pv_measurement` at index $n$ (or time $t$), $\hat{y}[n]$ is the value at index $n$ of the signal filtered predicted by signal analysis and $r[n]$ is the value at index $n$ of the noise created by mostly, the weather, from our inputs `X_train_estimated`, `X_train_observed`, etc. It would be design by a machine learning model. Actually we did not had the time to test this feature entirely, because of a lack of time our goals priotization. So, it is not entirely designed, but we will detail as far as we came to it.

In [None]:
end_dates = { 'A': '2022-10-21', 'B': '2022-03-15', 'C': '2022-04-01'}
# end_dates = { 'A': X_train_observed_a['date_forecast'].max(), 'B': X_train_observed_b['date_forecast'].max(), 'C': X_train_observed_c['date_forecast'].max() }

start_dates = { 'A': '2020-10-21', 'B': '2020-03-15', 'C': '2020-04-01' }

def get_model(fft_values, threshold=60, sample_rate=1):

    n = len(fft_values)

    frequencies = np.fft.fftfreq(n, 1 / sample_rate)
    amplitudes = fft_values * (np.abs(fft_values) > threshold)
    phases = np.angle(fft_values)
    return {"frequencies": frequencies, "amplitudes": amplitudes, "phases": phases}

def reconstruct_signal(model, duration,sample_rate):
    frequencies = model["frequencies"]
    amplitudes = model["amplitudes"]
    phases = model["phases"]

    t = np.arange(0, duration, 1)
    signal = np.zeros(len(t), dtype=np.complex128)
    
    for freq, amp, phase in tqdm(zip(frequencies, amplitudes, phases)):
        signal += amp * np.exp(2j * np.pi * freq * t / sample_rate + phase)
    return signal / len(frequencies)

def get_thresholds_to_get_n_freq(signal, nb_freq, threshold, step):
    assert step > 0
    fft = np.fft.fft(signal)
    abs_fft = np.abs(fft[:len(fft)//2])

    freqs = [ f for f in abs_fft if f > threshold ]
    threshold += step
    while len(freqs) > nb_freq:
        freqs = [ f for f in abs_fft if f > threshold ]
        threshold += step
    threshold = threshold if len(freqs) > 0 else threshold - step
    return threshold

def get_filtred_signal(signal, nb_freqs, sample_rate, nb_days_to_predict = 0, threshold = 0, scaler = StandardScaler):
    scaler_pred = scaler
    scaler = scaler()
    Y_normed = scaler.fit_transform(np.array(signal['pv_measurement'].dropna()).reshape(-1, 1)).reshape(-1)

    threshold = get_thresholds_to_get_n_freq(signal=Y_normed, nb_freq=nb_freqs, threshold=0, step=.5)
    model = get_model(fft_values=np.fft.fft(Y_normed), threshold=threshold, sample_rate=sample_rate)
    pred_from_model_data = np.real(reconstruct_signal(model, duration=len(model["frequencies"]) + nb_days_to_predict, sample_rate=sample_rate)) 
    scaler_pred = scaler_pred()
    pred_normed = scaler_pred.fit_transform(pred_from_model_data.reshape(-1, 1)).reshape(-1)

    Y_filtred = scaler.inverse_transform(pred_normed.reshape(-1, 1)).reshape(-1)

    # If we want to filter negative values
    Y_filtred[Y_filtred < 0] = 0
    return Y_filtred

nb_freqs = 2
nb_days_to_predict = 0

trains_on_dates = { loc: trains[loc][(trains[loc]["time"] < pd.Timestamp(end_dates[loc])) & (trains[loc]["time"] >= pd.Timestamp(start_dates[loc]))] for loc in locations }
Y_filtred = { loc: get_filtred_signal(trains_on_dates[loc], nb_freqs, sampling_rates[loc], nb_days_to_predict=nb_days_to_predict) for loc in locations }

sp = 1

plt.figure(figsize=(50, 30))
for loc in locations:
    plt.subplot(3, 1, sp)
    plt.plot(Y_filtred[loc], color='b')
    plt.plot(np.array(trains_on_dates[loc]['pv_measurement']), color='orange')
    plt.title(f"Prophet on location {loc}")
    sp += 1

However, we explored different ways to design $\hat{y}[n]$. The first one is a raw filter on the whole signal. This method were not much efficient. In our researchs we found `prophet`, a Python (and R) library which gives a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [None]:
end_dates = { 'A': X_train_estimated_a['date_forecast'].max(), 'B': X_train_estimated_b['date_forecast'].max(), 'C': X_train_estimated_c['date_forecast'].max() }
trains = { loc: trains[loc][trains[loc]['time'] <= end_dates[loc]] for loc in locations}

prophet_scalers = { loc: MinMaxScaler() for loc in locations }

trains_raw_prophet = { loc: trains[loc].dropna(subset='pv_measurement').reset_index().rename(columns={'time': 'ds', 'pv_measurement': 'y'}) for loc in locations }

trains_prophet = trains_raw_prophet
for loc in locations:
    trains_prophet[loc]['y'] = prophet_scalers[loc].fit_transform(np.array(trains_raw_prophet[loc]['y']).reshape(-1, 1)).reshape(-1)

models_prophet = { loc: Prophet(changepoint_prior_scale=0.05) for loc in locations }
predictions_prophet = {}
forecast = {}

for loc in locations:
    models_prophet[loc].fit(trains_prophet[loc])
    predictions_prophet[loc] = models_prophet[loc].make_future_dataframe(periods=66, freq='h')
    forecast[loc] = models_prophet[loc].predict(predictions_prophet[loc])

sp = 1

plt.figure(figsize=(50, 30))
for loc in locations:
    plt.subplot(3, 1, sp)
    plt.plot(trains_prophet[loc]['y'], color='orange')
    plt.plot(forecast[loc]['yhat'], color='b')
    plt.title(f"Prophet on location {loc}")
    sp += 1

This results, from prophet and the filter signal, are not really satisfying. Then came the idea, inpired by [this paper](https://peerj.com/preprints/3190.pdf), to see what's happen if we plot one signal for each hour (it would make $24 * 3 = 72$ models). We then first split our signals by hours and plot what we get with prophet prediction and our filter.

In [None]:
# USEFULL VALUES TO COMPILE
hours = [ f"0{h}" if h < 10 else str(h) for h in range(24) ]

nb_freqs = 1

trains_on_dates = { loc: trains[loc][(trains[loc]["time"] < pd.Timestamp(end_dates[loc])) & (trains[loc]["time"] >= pd.Timestamp(start_dates[loc]))] for loc in locations }
trains_on_hours = { loc: { h: trains_on_dates[loc][trains_on_dates[loc]['time'].dt.strftime('%H:%M:%S').str.endswith(f'{h}:00:00')] for h in hours } for loc in locations }

# MAYBE DROP NA ON B AND C
Y_h_filtred = { loc: { h: get_filtred_signal(trains_on_hours[loc][h], nb_freqs, sampling_rates[loc] * 24, nb_days_to_predict=nb_days_to_predict, scaler=StandardScaler) for h in hours} for loc in locations }

for loc in locations:
    sp = 1
    plt.figure(figsize=(50, 30))
    for h in hours:
        plt.subplot(4, 6, sp)
        plt.plot(np.array(trains_on_hours[loc][h]['pv_measurement']), color='orange')
        plt.plot(Y_h_filtred[loc][h], color='b')
        plt.title(f"Filtred signal at time h={h} for location {loc}")
        sp += 1

def convert_hours_to_days(signal):
    min_len = np.min([len(signal[h]) for h in hours ])
    y_pred = []
    for d in range(min_len):
        for h in hours:
            y_pred.append(signal[h][d])
    return np.array(y_pred)

reconstructed_train = { loc: convert_hours_to_days(Y_h_filtred[loc]) for loc in locations }

sp = 1

plt.figure(figsize=(30, 20))
for loc in locations:
    plt.subplot(3, 1, sp)
    plt.plot(reconstructed_train[loc], color='b', label='reconstruted from signal analysis')
    plt.plot(np.array(trains_on_dates[loc]['pv_measurement']), color='orange', label='real value')
    plt.title(f"Prophet on location {loc}")
    plt.legend()
    sp += 1

We get here a far more satisfying result. There is a problem for B, we did not get why the curve does not go to 0 value.

In [None]:
df = { loc: {} for loc in locations }
trains_prophet_on_hours = { loc: { h: trains_prophet[loc][trains_prophet[loc]['ds'].dt.strftime('%H:%M:%S').str.endswith(f'{h}:00:00')] for h in hours } for loc in locations }

for loc in locations:
    for h in hours:
        date_index = [ (pd.Timestamp(start_dates[loc]) + d).strftime("%Y-%m-%d") for d in [ timedelta(days=k) for k in range(len(trains_prophet_on_hours[loc][h])) ] ]
        trains_prophet_on_hours[loc][h] = pd.DataFrame({'ds': date_index, 'y': trains_prophet_on_hours[loc][h]['y']})

models_prophet = { loc: { h: Prophet(changepoint_prior_scale=0.05) for h in hours} for loc in locations }
predictions_prophet = { loc: {} for loc in locations }
forecast = { loc: {} for loc in locations }

for loc in locations:
    for h in hours:
        models_prophet[loc][h].fit(trains_prophet_on_hours[loc][h])
        predictions_prophet[loc][h] = models_prophet[loc][h].make_future_dataframe(periods=0, freq='h')
        forecast[loc][h] = models_prophet[loc][h].predict(predictions_prophet[loc][h])

for loc in locations:
    sp = 1
    plt.figure(figsize=(50, 30))
    for h in hours:
        plt.subplot(4, 6, sp)
        plt.plot(np.array(trains_prophet_on_hours[loc][h].reset_index()['y']), color='orange')
        plt.plot(np.array(forecast[loc][h]['yhat']), color='b')
        plt.title(f"Prophet on location {loc}")
        sp += 1

# Y_to_reconstruct = { loc: { h: prophet_scalers[loc].inverse_transform(np.array(forecast[loc][h]['yhat']).reshape(-1, 1)).reshape(-1) for h in hours } for loc in locations }
Y_to_reconstruct = { loc: { h: np.array(forecast[loc][h]['yhat']) for h in hours } for loc in locations }
reconstructed_train_prophet = { loc: convert_hours_to_days(Y_to_reconstruct[loc]) for loc in locations }

sp = 1

plt.figure(figsize=(30, 20))
for loc in locations:
    plt.subplot(3, 1, sp)
    plt.plot(reconstructed_train_prophet[loc], color='b', label='reconstruted from signal analysis')
    plt.plot(np.array(trains_prophet[loc]['y']), color='orange', label='real value')
    plt.title(f"Prophet on location {loc}")
    plt.legend()
    sp += 1

These results are less satisfying than the precedent result. But let's see what happens if we try to predict the noise, $r[n]$, with Prophet.

In [None]:
noise_prophet = { loc: { h: np.array(trains_on_hours[loc][h]['pv_measurement']) - Y_h_filtred[loc][h] for h in hours } for loc in locations }

trains_noise_prophet_on_hours = { loc: {} for loc in locations }

for loc in locations:
    for h in hours:
        date_index = [ (pd.Timestamp(start_dates[loc]) + d).strftime("%Y-%m-%d") for d in [ timedelta(days=k) for k in range(len(noise_prophet[loc][h])) ] ]
        trains_noise_prophet_on_hours[loc][h] = pd.DataFrame({'ds': date_index, 'y': noise_prophet[loc][h]})

models_prophet = { loc: { h: Prophet(changepoint_prior_scale=0.05) for h in hours} for loc in locations }
predictions_prophet = { loc: {} for loc in locations }
forecast = { loc: {} for loc in locations }

for loc in locations:
    for h in hours:
        models_prophet[loc][h].fit(trains_noise_prophet_on_hours[loc][h])
        predictions_prophet[loc][h] = models_prophet[loc][h].make_future_dataframe(periods=0, freq='h')
        forecast[loc][h] = models_prophet[loc][h].predict(predictions_prophet[loc][h])

for loc in locations:
    sp = 1
    plt.figure(figsize=(50, 30))
    for h in hours:
        plt.subplot(4, 6, sp)
        plt.plot(np.array(noise_prophet[loc][h]), color='orange')
        plt.plot(np.array(forecast[loc][h]['yhat']), color='b')
        plt.title(f"Prophet on location {loc}")
        sp += 1

To conclude this part, we did not found much successful results on it. The curve obtained were not that bad but we did not had much time to merge this with the biggest model. Maybe this approach were to much complicated as a first one and we should have focus on it later. From now, we consider it as a way to upgrade our current model that we are going to present in the next parts.

## 3. Preprocessing

During our study, we explored many ways to pre-process our inputs. Some were compatible to each other, some were not. We are going to present in this part, all we did as pre-process and those we used.

### 3.1. Management of NAN values

As we saw earlier, we have 3 variables with Nan values: `snow_density:kgm3`,`ceiling_height_agl:m`,  `cloud_base_agl:m`.


Since `snow_density:kgm3` is more than 95% empty, we're going to delete it because it's unusable.


As for the other two, we've seen that they have missing values, probably due to measurement errors. We'll simply perform a **linear interpolation**, taking the average of the before and after values.

In [None]:
def gestion_nan(df):
  df_copy = df.copy()
  #delete of the snow density column
  df_copy = df_copy.drop('snow_density:kgm3',axis=1)
  # Approximation of the other two columns
  df_copy['ceiling_height_agl:m'] = df_copy['ceiling_height_agl:m'].interpolate(method='linear', limit_direction='both')
  df_copy['cloud_base_agl:m'] = df_copy['cloud_base_agl:m'].interpolate(method='linear', limit_direction='both')
  return df_copy

### 3.2. Constant columns management

As said we said in the Data analysis part, we decided to delete all the column which were constants in the test estimated sets, for each location.

In [None]:
columns_to_drop_a = X_test_estimated_a.std()[X_test_estimated_a.std() == 0].keys() 
columns_to_drop_b = X_test_estimated_b.std()[X_test_estimated_b.std() == 0].keys() 
columns_to_drop_c = X_test_estimated_c.std()[X_test_estimated_c.std() == 0].keys()

def gestion_constant_columns(df, location):
    columns_to_drop = columns_to_drop_a if location=='A' else columns_to_drop_b if location=='B' else columns_to_drop_c
    return df.drop(columns=columns_to_drop)

### 3.3. Normalization 

Our data have variable ranges that are quite different from one another. So we're going to do some normalization to solve this problem.

To do this, we're going to use different normalizations, and when we train the model we'll see which normalization corresponds best to our dataset.

The various standardizations used are as follows: StandardScaler, Normalizer, MinMaxScaler from the library sklearn.preprocessing

Here's the typical function we'll use to normalize the data.

Note: it adapts according to the train and test set, since the scaler must be the same for both stages.

In [None]:
def sklearn_z_score_normalize_dataframe(df,return_scaler=False,scaler=None,fit=True):
    """
    Normalizes a DataFrame using z-score normalization (mean and standard deviation) from Scikit-Learn.

    Parameters:
    df (pd.DataFrame): The DataFrame to be normalized.

    Returns:
    pd.DataFrame: The z-score normalized DataFrame.
    """
    if scaler == None :
      # Create a StandardScaler instance
      scaler = StandardScaler()

      # Fit the scaler on the DataFrame and transform the data
      normalized_data = scaler.fit_transform(df)

    else :
      if fit:
        normalized_data = scaler.fit_transform(df)
      else:
        normalized_data = scaler.transform(df)

    # Create a new DataFrame with the scaled data
    normalized_df = pd.DataFrame(normalized_data, columns=df.columns)

    # retourner le scaler
    if return_scaler :
      return normalized_df,scaler
    return normalized_df

### 3.4. Feature Engineering

Our data is made up of many features.


**Creating Features**:

As for the `date_forecast` feature. We saw during the signal processing analysis that this feature contains events that repeat at given periods (years/days), so we're going to split our feature into several sub-features to try and obtain relationships with the target.

We therefore separate according to: *hours, day of the week, season, month, year, day of the year and day of the month*.

In order to take into consideration the difference between *observed* and *estimated*, we can add a feature which represent the delay between the prediction and the reality of the weather. We will compute it by calculate the difference between `date_calc` and `date_forecast`.

**Modification of Features**:

When we explore our data we can see that `sun_azimuth:d` corresponds to the position of the sun in degrees. So we can see that 0 and 360 correspond to the same thing. So we're going to transform this feature using a cosine to make it continuous and consistent with what it represents.

In [None]:
extraction = X_train_observed_a['sun_azimuth:d'][0:199]
transfo = extraction.apply(lambda x : np.cos(x/180*np.pi + np.pi))

plt.figure(figsize=(10,5))
plt.plot(X_train_observed_a['date_forecast'][0:199],extraction/180-1,label="old",ls="--")
plt.plot(X_train_observed_a['date_forecast'][0:199],transfo,label="new",ls="--")
plt.plot(train_a['time'][0:49],train_a["pv_measurement"][0:49]/2500-1,label="target",alpha=0.5)
plt.legend()
plt.grid()
plt.title("sun_azimuth modification")
plt.show()

In [None]:
def create_features(df, label):
    # creating features from dateforecast
    df['hour'] = df["date_forecast"].dt.hour
    df['dayofweek'] = df["date_forecast"].dt.dayofweek
    df['quarter'] = df["date_forecast"].dt.quarter
    df['month'] = df["date_forecast"].dt.month
    df['year'] = df["date_forecast"].dt.year
    df['dayofyear'] = df["date_forecast"].dt.dayofyear
    df['dayofmonth'] = df["date_forecast"].dt.day

    # creating features from datecalc
    if "date_calc" in df.columns :
      
      df['date_calc'] = pd.to_datetime(df['date_calc'])
      df['date_calc'].fillna(df['date_forecast'], inplace=True)
      df['delta_hour'] = df['hour'] - df["date_calc"].dt.hour

      df['correction'] = df['date_forecast'].dt.dayofyear - df['date_calc'].dt.dayofyear
      df.loc[df['correction'] == 1, 'delta_hour'] += 24
      df.drop('correction', axis=1, inplace=True)

      df = df.drop(["date_calc","date_forecast"],axis=1) 
    else :

      df.drop(["date_forecast"],axis=1) 
      df['delta_hour'] = 0

    # Modification of sun_azimuth
    df['sun_azimuth:d'] = df['sun_azimuth:d'].apply(lambda x : np.cos(x/180*np.pi + np.pi))

    # return the target if necessary
    if label:
        y = df[label]
        df = df.drop(label,axis=1)
        return df, y
    return df

### 3.5. Management of constant parts of pv_measurement on B and C

If we look closely at the shape of pv_measurement on B and C, we can see that there are certain sections during which the feature doesn't vary even at night when it should.

We therefore assume that these are erroneous measurements that need to be removed.

In [None]:
def filter_dates_when_constants(df, date_c = 'time', y = 'pv_measurement', delta = { 'days': 3 }):
    df = df.copy()
    mask_y_change = df[y] != df[y].shift(1)

    start_date = None
    end_date = None

    constant_periods = []

    for index, row in df.iterrows():
        if not mask_y_change[index]:
            if start_date is None:
                start_date = row[date_c]
            end_date = row[date_c]
        else:
            if start_date is not None and (end_date - start_date) >= pd.Timedelta(**delta):
                constant_periods.append((start_date, end_date))
            start_date = None
            end_date = None

    if start_date is not None and (end_date - start_date) >= pd.Timedelta(**delta):
        constant_periods.append((start_date, end_date))
    return constant_periods

def delete_date_range_from_df(df, dates, date_c = 'time'):
    df = df.copy()
    c = 0
    for start_date, end_date in dates:
        mask = (df[date_c] >= start_date) & (df[date_c] < end_date)
        df = df[~mask]
    df.reset_index(drop=True, inplace=True)
    return df

In [None]:
# minimum duration
delta = { 'hours': 12 * 4}

# list of these
filter_dates_when_constants(train_b, delta=delta)

# Deletion B
train_b_delete = delete_date_range_from_df(train_b, filter_dates_when_constants(train_b, delta=delta))
plt.figure(figsize=(10,5))
plt.scatter(train_b_delete["time"],train_b_delete["pv_measurement"],s=0.2)
plt.grid()
plt.title("B without constant periods")
plt.show()

# Deletion C
train_c_delete = delete_date_range_from_df(train_c, filter_dates_when_constants(train_c, delta=delta))
plt.figure(figsize=(10,5))
plt.scatter(train_c_delete["time"],train_c_delete["pv_measurement"],s=0.2)
plt.grid()
plt.title("C without constant periods")
plt.show()

### 3.6. Interpolation of the output values

One of our idea, which worked pretty well, was to interpolate the values of the output. We tryed different interpolations, but the one which looked as the best one to do so, were the linear one. It's mathically defined as following, and the function which do the job, is `interpolate_output_values`.$$y(t + 1/4) = 0.75 * y(t) + 0.25 * y(t+1)$$
$$y(t + 1/2) = 0.5 * y(t) + 0.5 * y(t+1)$$
$$y(t + 3/4) = 0.25 * y(t) + 0.75 * y(t+1)$$
where t is a round hour (such as 12:00:00, 13:00:00 etc).

In [None]:
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

interpolation_method = 'linear'
def interpolate_output_values(df):
    freq = '15T'
    df = df.set_index('time').resample(freq).asfreq()
    df['pv_measurement'] = df['pv_measurement'].interpolate(method=interpolation_method)
    df = df.reset_index()
    return df

train_a_interpolated = interpolate_output_values(train_a)
train_b_interpolated = interpolate_output_values(train_b)
train_c_interpolated = interpolate_output_values(train_c)

basic_range = train_a['time'].loc[3000:3023].index
interpolation_range = train_a['time'].loc[4 * 3000: 4 * 3023].index
plt.figure(figsize=(20,10))
plt.plot(train_a['time'][basic_range], train_a['pv_measurement'][basic_range], label='pv_measurement values on A')
plt.scatter(train_a_interpolated['time'][interpolation_range], train_a_interpolated['pv_measurement'][interpolation_range], color='orange', label='quarter-hourly linear-interpolation')
plt.scatter(train_a['time'][basic_range], train_a['pv_measurement'][basic_range], label='pv_measurement values on A')
plt.title('pv_measurement on A values and linear interpolation values of its self')
plt.legend()

This interpolation permit us to create 4 times more data, on every location, which are, in a restrective way, similar to the original curve.

### 3.7. Input reshaping

We also thought about some ways to see from a different point of view the data. One was to create more inputs from the data of the past hour. Firstly, we had to select which one of the 4 lines is the most important one to predict the output. We suggested that the solar panel were quite reactive, which means that, for every hours, we have to take into consideration the value of the current hour. It means, for example, at anyday, for an output at the 12:00:00, we have to consider the values from X at 12:00:00. And, because we want to consider what happened during the last hour, we had, as new columns, the values at 11:15:00, 11:30:00 and 11:45:00. The code to create a such matrix is defined as `reshape_frame_to_match_output` (our advise is to not run this function because it require a lot of time to reshape the inputs, we recognize that its not much optimized). 

We had to take some others assumptions into consideration as well. For example, we have in the estimated-testing sets, the time line is not continuous. For example, some days are skipped, so we designed it such as, it duplicate the value of 23:45:00 every days (in order to not get nan values), and the input at 00:00:00 are taken 4 times every days. We considered that, that point was not much important to duplicate the data, because the sun's data are nill (there is no sun during the night as far we experimented it yet), so it should not influence much the result.

In [None]:
from datetime import datetime

def reshape_frame_to_match_output(frame_input):
    frame = frame_input[1:]
    groups = [frame[i:i+4] for i in range(0, len(frame), 4)]

    aggregated_groups = []
    
    first_input = pd.DataFrame(frame_input.loc[0])

    group = pd.concat([first_input, first_input, first_input, first_input])
    group_without_nan = group.fillna('')
    new_input = group_without_nan.stack().reset_index(drop=True)
    aggregated_groups.append(new_input)     
    
    for group in tqdm(groups):
        if (len(np.array(group['date_forecast'])) > 3):
            if ((np.array(group['date_forecast']))[3].astype(datetime).time() == datetime(year=1970, month=1, day=1, hour=0, minute=0, second=0).time()):
                group.reset_index(drop=True, inplace=True)
                temp_group = group.loc[3]
                group.loc[3] = group.loc[2]

                group_without_nan = group.fillna('')
                new_input = group_without_nan.stack().reset_index(drop=True)
                aggregated_groups.append(new_input)

                group.loc[0] = temp_group
                group.loc[1] = temp_group
                group.loc[2] = temp_group
                group.loc[3] = temp_group

                group_without_nan = group.fillna('')
                new_input = group_without_nan.stack().reset_index(drop=True)
                aggregated_groups.append(new_input)
            else:
                group_without_nan = group.fillna('')
                new_input = group_without_nan.stack().reset_index(drop=True)
                # if group == groups[0]: print(len(new_input))
                aggregated_groups.append(new_input)
        else:
            group.reset_index(drop=True, inplace=True)
            for k in range(4 - len(group)):
                group = pd.concat([group, pd.DataFrame(group.loc[len(group) - 1]).T], ignore_index=True)
            group_without_nan = group.fillna('')
            new_input = group_without_nan.stack().reset_index(drop=True)
            aggregated_groups.append(new_input)
            
    new_table = pd.concat(aggregated_groups, axis=1, ignore_index=False).T

    columns = []
    for k in range(4):
        [columns.append(f"{c}_{k}") for c in frame.keys()]

    new_table.columns = columns
    new_table.reset_index(drop=True, inplace=True)
    return new_table

The function is called and run the following way (in comment because it takes to much time to run). We also added a function to save this dataframes to not have to run it everytimes.

In [None]:
# X_train_estimated_a_reshaped = reshape_frame_to_match_output(X_train_estimated_a)
# X_train_observed_a_reshaped = reshape_frame_to_match_output(X_train_observed_a)
# X_test_estimated_a_reshaped = reshape_frame_to_match_output(X_test_estimated_a)

# X_train_estimated_b_reshaped = reshape_frame_to_match_output(X_train_estimated_b)
# X_train_observed_b_reshaped = reshape_frame_to_match_output(X_train_observed_b)
# X_test_estimated_b_reshaped = reshape_frame_to_match_output(X_test_estimated_b)

# X_train_estimated_c_reshaped = reshape_frame_to_match_output(X_train_estimated_c)
# X_train_observed_c_reshaped = reshape_frame_to_match_output(X_train_observed_c)
# X_test_estimated_c_reshaped = reshape_frame_to_match_output(X_test_estimated_c)

# files_names = ['X_train_estimated.parquet', 'X_train_observed.parquet', 'X_test_estimated.parquet']
# folders = [ f"{loc}_reshaped/" for loc in locations ]
# X_reshaped = [ X_train_estimated_a_reshaped, X_train_observed_a_reshaped, X_test_estimated_a_reshaped, X_train_estimated_b_reshaped, X_train_observed_b_reshaped, X_test_estimated_b_reshaped, X_train_estimated_c_reshaped, X_train_observed_c_reshaped, X_test_estimated_c_reshaped]

# for folder in folders:
#     for file in files_names:
#         index_in_X = (files_names.index(file) + 1) * (folders.index(folder) + 1) - 1
#         X_reshaped[index_in_X].replace('', np.nan).to_parquet(folder + file)

### 3.8. Preprocessing function

In [None]:
def partial_preprocess(A = [X_train_observed_a, X_train_estimated_a], B = [X_train_observed_b, X_train_estimated_b], C = [X_train_observed_c, X_train_estimated_c], drop_columns = True):
    X_total_a = pd.concat(A, ignore_index=True)
    X_total_b = pd.concat(B, ignore_index=True)
    X_total_c = pd.concat(C, ignore_index=True)

    X_total_a_y = pd.merge(X_total_a, train_a.rename(columns={'time': 'date_forecast'}, inplace=False).dropna(subset='pv_measurement', inplace=False), on='date_forecast', how='inner')
    X_total_b_y = pd.merge(X_total_b, train_a.rename(columns={'time': 'date_forecast'}, inplace=False).dropna(subset='pv_measurement', inplace=False), on='date_forecast', how='inner')
    X_total_c_y = pd.merge(X_total_c, train_a.rename(columns={'time': 'date_forecast'}, inplace=False).dropna(subset='pv_measurement', inplace=False), on='date_forecast', how='inner')
    
    X_total_a_y_nan = gestion_nan(X_total_a_y)
    X_total_b_y_nan = gestion_nan(X_total_b_y)
    X_total_c_y_nan = gestion_nan(X_total_c_y)

    if drop_columns:
        X_total_a_y_nan = gestion_constant_columns(X_total_a_y_nan, 'A')
        X_total_b_y_nan = gestion_constant_columns(X_total_b_y_nan, 'B')
        X_total_c_y_nan = gestion_constant_columns(X_total_c_y_nan, 'C')
    return X_total_a_y_nan, X_total_b_y_nan, X_total_c_y_nan

X_total_a_y_nan, X_total_b_y_nan, X_total_c_y_nan = partial_preprocess()


### 3.9. Concacenate all locations in one

An other pre-processing idea that we tryied were to make one model instead of doing three (one for each location). The following code do this job.

In [None]:
def concat_df_from_differents_locations(dfs):
    dfs_to_concat = { k: df.copy() for (k, df) in dfs.items() }
    for key, df in dfs_to_concat.items():
        for loc in locations:
            df[f'location_{loc}'] = 0
        df[f'location_{key}'] = 1
    dfs_to_concat_listed = [ df for _, df in dfs_to_concat.items() ]
    df_to_return = pd.concat(dfs_to_concat_listed, ignore_index=True)
    return df_to_return

def drop_columns_for_concatenated_input(df):
    columns_to_drop = []
    columns_to_drop_a = X_test_estimated_b.std()[X_test_estimated_b.std() == 0].keys()
    columns_to_drop_b = X_test_estimated_b.std()[X_test_estimated_b.std() == 0].keys()
    columns_to_drop_c = X_test_estimated_b.std()[X_test_estimated_b.std() == 0].keys()
    for column in columns_to_drop_a: 
        if column not in columns_to_drop: columns_to_drop.append(column)
    for column in columns_to_drop_b:
        if column not in columns_to_drop: columns_to_drop.append(column)
    for column in columns_to_drop_c:
        if column not in columns_to_drop: columns_to_drop.append(column)
    return df.drop(columns=columns_to_drop)

# THESE FUNCTIONS WILL BE EXPLAINED IN A FURTHER PART
start_2019 = pd.to_datetime("2019-04-21")
end_2019 = pd.to_datetime("2019-07-22")

start_2020 = pd.to_datetime("2020-04-21")
end_2020 = pd.to_datetime("2020-07-22")

start_2021 = pd.to_datetime("2021-04-21")
end_2021 = pd.to_datetime("2021-07-22")

start_2022 = pd.to_datetime("2022-04-21")
end_2022 = pd.to_datetime("2022-07-22")

def create_mask_for_split_training_and_testing(df, start_estimated, time_column = 'date_forecast'):
    mask_2020 = ((df[time_column] >= start_2020) & (df[time_column] < end_2020))
    mask_2021 = ((df[time_column] >= start_2021) & (df[time_column] < end_2021))
    mask_2022 = ((df[time_column] >= start_2022) & (df[time_column] < end_2022))
    mask_estimated = (df[time_column] >= start_estimated)
    return mask_2020, mask_2021, mask_2022,mask_estimated

def split_training_testing_set(df, start_estimated, random_state=42, test_size = .1, time_column = 'date_forecast'):
    df_to_split = df.copy()
    mask_2020, mask_2021, mask_2022, mask_estimated = create_mask_for_split_training_and_testing(df_to_split, start_estimated)
    df_summers = df_to_split[ mask_2020 | mask_2021 | mask_2022 | mask_estimated]
    df_not_summer = df_to_split[~(mask_2020 | mask_2021 | mask_2022 | mask_estimated)]
    test_size = test_size * (len(df_summers) + len(df_not_summer)) / len(df_summers)
    train_data_summer, pv_test_not_ordered = train_test_split(df_summers, test_size=test_size, random_state=random_state)

    pv_train_not_ordered = pd.concat([train_data_summer, df_not_summer])
    pv_train = pv_train_not_ordered.sort_values(by='date_forecast')
    pv_test = pv_test_not_ordered.sort_values(by='date_forecast')
    return pv_train, pv_test

def concat_to_location_and_pre_process_for_training(
        X_trains = { 'A': [X_train_observed_a, X_train_estimated_a], 'B': [X_train_observed_b, X_train_estimated_b], 'C': [X_train_observed_c, X_train_estimated_c]}, 
        X_tests =  { 'A': X_test_estimated_a, 'B': X_test_estimated_b, 'C': X_test_estimated_c}, 
        y_trains = { 'A': train_a, 'B': train_b, 'C': train_c}, 
        y_scalers = { 'A': StandardScaler(), 'B': StandardScaler(), 'C': StandardScaler()}
):      
        test_size, random_state = 0.1, 42
        
        X_total_a_y_nan, X_total_b_y_nan, X_total_c_y_nan = partial_preprocess(drop_columns=False)

        X_total_a_y_nan = drop_columns_for_concatenated_input(X_total_a_y_nan)
        X_total_b_y_nan = drop_columns_for_concatenated_input(X_total_b_y_nan)
        X_total_c_y_nan = drop_columns_for_concatenated_input(X_total_c_y_nan)

        pv_train_a, pv_test_a = split_training_testing_set(X_total_a_y_nan, X_train_estimated_a["date_forecast"].mean(), test_size=test_size, random_state=random_state)
        pv_train_b, pv_test_b = split_training_testing_set(X_total_b_y_nan, X_train_estimated_b["date_forecast"].mean(), test_size=test_size, random_state=random_state)
        pv_train_c, pv_test_c = split_training_testing_set(X_total_c_y_nan, X_train_estimated_c["date_forecast"].mean(), test_size=test_size, random_state=random_state)

        X_train_a, y_train_a = create_features(pv_train_a, label='pv_measurement')
        X_test_a, y_test_a = create_features(pv_test_a, label='pv_measurement')

        X_train_b, y_train_b = create_features(pv_train_b, label='pv_measurement')
        X_test_b, y_test_b = create_features(pv_test_b, label='pv_measurement')

        X_train_c, y_train_c = create_features(pv_train_c, label='pv_measurement')
        X_test_c, y_test_c = create_features(pv_test_c, label='pv_measurement')

        X_train_a_norm, y_scaler_a = sklearn_z_score_normalize_dataframe(X_train_a,return_scaler=True, scaler=y_scalers['A'])
        X_train_b_norm, y_scaler_b = sklearn_z_score_normalize_dataframe(X_train_b,return_scaler=True, scaler=y_scalers['B'])
        X_train_c_norm, y_scaler_c = sklearn_z_score_normalize_dataframe(X_train_c,return_scaler=True, scaler=y_scalers['C'])

        X_test_a_norm = sklearn_z_score_normalize_dataframe(X_test_a,return_scaler=False,scaler=scaler_a)
        X_test_b_norm = sklearn_z_score_normalize_dataframe(X_test_b,return_scaler=False,scaler=scaler_b)
        X_test_c_norm = sklearn_z_score_normalize_dataframe(X_test_c,return_scaler=False,scaler=scaler_c)

        X_train_to_concat = { 'A': X_train_a_norm, 'B': X_train_b_norm, 'C': X_train_c_norm }
        X_train_norm = concat_df_from_differents_locations(X_train_to_concat)

        X_test_to_concat = { 'A': X_test_a_norm, 'B': X_test_b_norm, 'C': X_test_c_norm }
        X_test_norm = concat_df_from_differents_locations(X_test_to_concat)

        y_train_a, y_scaler_a = sklearn_z_score_normalize_dataframe(pd.DataFrame(y_train_a),return_scaler=True,scaler=y_scaler_a)
        y_train_b, y_scaler_b = sklearn_z_score_normalize_dataframe(pd.DataFrame(y_train_b),return_scaler=True,scaler=y_scaler_b)
        y_train_c, y_scaler_c = sklearn_z_score_normalize_dataframe(pd.DataFrame(y_train_c),return_scaler=True,scaler=y_scaler_c)

        y_test_a = pd.DataFrame(y_scaler_a.transform(pd.DataFrame(y_test_a)))
        y_test_b = pd.DataFrame(y_scaler_b.transform(pd.DataFrame(y_test_b)))
        y_test_c = pd.DataFrame(y_scaler_c.transform(pd.DataFrame(y_test_c)))

        y_train = pd.concat([y_train_a, y_train_b, y_train_c], ignore_index=True)
        y_test= pd.concat([y_test_a, y_test_b, y_test_c], ignore_index=True)
        
        y_scalers = { 'A': y_scaler_a, 'B': y_scaler_b, 'C': y_scaler_c}

        return X_train_norm, X_test_norm, y_train, y_test, y_scalers

def postprocess_concatenated_input(preds, y_scalers):
    pred_a = y_scalers['A'].inverse_transform(pd.DataFrame({'pv_measurement_prediction': preds['A']})).reshape(-1)
    pred_b = y_scalers['B'].inverse_transform(pd.DataFrame({'pv_measurement_prediction': preds['B']})).reshape(-1)
    pred_c = y_scalers['C'].inverse_transform(pd.DataFrame({'pv_measurement_prediction': preds['C']})).reshape(-1)
    return np.concatenate((np.concatenate((pred_a,pred_b)), pred_c))
     
X_train_norm, X_test_norm, y_train, y_test, y_scalers = concat_to_location_and_pre_process_for_training()


## 4. Pre-processing ideas that didn't work

### 4.1 Polynomial Features

In order to find new relationships between features, we're going to apply a method called `Polynomial Features`, which consists of adding new features to the dataset that are the products of those already present.

In our dataset, we currently have over 50 features. We can't take all the combinations, as too many features would prevent the model from finding good combinations (with very long training times). So we're going to run a first model without `Polynomial Features` to get a ranking of the importance of the features, so as to take only a certain number. This is what the `recup_var` function does.

We then add to our dataframe the polynomial combinations of these features to the desired degree.

In [None]:
def recup_var(reg,number=10):
  feature_importance = reg.get_booster().get_fscore()
  sorted_feature_importance = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)
  top_features = [feature for feature, _ in sorted_feature_importance[:number]]
  return top_features

def column_filter(df,columns):
  df_filtre = df[columns]
  return df_filtre

def creation_df_polynomial_feature(df,degre=1):

  # Creation of a PolynomialFeatures object
  poly = PolynomialFeatures(degre, include_bias=False)

  # Apply the polynomial caractéristiques to the datas
  poly_features = poly.fit_transform(df)

  # Create a new dataframe
  poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(input_features=df.columns))

  return poly_df

Unfortunately, the results obtained by testing several combinations of features did not improve the score. We therefore decided to shelve this avenue.

If you'd like to implement it and test it with our code, all you have to do is train a model and put it into the recup_var function, then retrieve the filtered dataframe with the filter_column function. Finally, just put it into the creation_df_polynomial_feature function.

### 4.2. Mean of features

We have seen that dataframes containing features are sliced by quarter-hour, whereas those containing targets are sliced by hour.

To try and take into account what happens in the hour as weather (as it impacts `pv_measurement`), we tried to take the average of the different weather components in the hour.

Please note that in the test file, the days are taken at random and we start our predictions at 0h, so we need to take this into account. This is why our code will follow the following algorithm every 24h:
* take the weather at 0h
* remove the weather at 0h and 23h15/30/45
* add the average weather for the 4 lines every 4 lines

In [None]:
def traitement_df(df_input):

    if "date_calc" in list(df_input.columns):
      df_input = df_input.drop(["date_calc"],axis=1)

    df_output = pd.DataFrame(columns=df_input.columns)

    aux = df_input[df_input.index % 4 == 0]
    date_col = aux["date_forecast"]
    date_col.reset_index(drop=True, inplace=True)

    df_input = df_input.drop(["date_forecast"],axis=1)

    for i in range(0, len(df_input), 96):
        df_etude = df_input.iloc[i:i+96]

        first_row = df_etude.iloc[0]
        df_output = pd.concat([df_output, first_row.to_frame().transpose()], ignore_index=True)

        df_etude = df_etude.iloc[1:-3]
        df_etude.reset_index(drop=True, inplace=True)

        means = df_etude.groupby(df_etude.index // 4).mean()
        df_output = pd.concat([df_output, means], ignore_index=True)

    df_output["date_forecast"] = date_col

    return df_output

## 5. Separation of training and test sets

During our study, we used different ways to split the training and test sets. We found out that some ways were more efficient than some others. We first used a "classical approach", which consisted by taking last 10% of the whole training set (observed and estimated) for each locations. We then realized that, because our predictions would be during a date range of spring/summer, we must had to include some dates from this time in our test set. Moreover, our predictions will be based on estimated values, which means that we have to take that kind of values into consideration, in order to get the distribution of estimated values in our prevision as well.

So we are going to present the first split we did, which were less efficient, than the second one that we are going to present then.

### 5.1. Split training and test sets on estimated set dates

In [None]:
def split_training_testing_sets_on_test_set_dates():
    quantile = .25
    pv_train_a = X_total_a_y_nan[X_total_a_y_nan['date_forecast'] <= X_train_estimated_a['date_forecast'].quantile(quantile)].copy()
    pv_test_a = X_total_a_y_nan[X_total_a_y_nan['date_forecast'] > X_train_estimated_a['date_forecast'].quantile(quantile)].copy()

    pv_train_b = X_total_b_y_nan[X_total_b_y_nan['date_forecast'] <= X_train_estimated_b['date_forecast'].quantile(quantile)].copy()
    pv_test_b = X_total_b_y_nan[X_total_b_y_nan['date_forecast'] > X_train_estimated_b['date_forecast'].quantile(quantile)].copy()

    pv_train_c = X_total_c_y_nan[X_total_c_y_nan['date_forecast'] <= X_train_estimated_c['date_forecast'].quantile(quantile)].copy()
    pv_test_c = X_total_c_y_nan[X_total_c_y_nan['date_forecast'] > X_train_estimated_c['date_forecast'].quantile(quantile)].copy() 
    return pv_train_a, pv_test_a, pv_train_b, pv_test_b, pv_train_c, pv_test_c

### 5.2. Split training and test sets on prediction dates range and estimated set

In [None]:
start_2020 = pd.to_datetime("2020-04-21")
end_2020 = pd.to_datetime("2020-07-22")

start_2021 = pd.to_datetime("2021-04-21")
end_2021 = pd.to_datetime("2021-07-22")

start_2022 = pd.to_datetime("2022-04-21")
end_2022 = pd.to_datetime("2022-07-22")

def create_mask_for_split_training_and_testing(df, start_estimated, time_column = 'date_forecast'):
    mask_2020 = ((df[time_column] >= start_2020) & (df[time_column] < end_2020))
    mask_2021 = ((df[time_column] >= start_2021) & (df[time_column] < end_2021))
    mask_2022 = ((df[time_column] >= start_2022) & (df[time_column] < end_2022))
    mask_estimated = (df[time_column] >= start_estimated)
    return mask_2020, mask_2021, mask_2022,mask_estimated

def split_training_testing_set(df, start_estimated, random_state=42, test_size = .1, time_column = 'date_forecast'):
    df_to_split = df.copy()
    mask_2020, mask_2021, mask_2022, mask_estimated = create_mask_for_split_training_and_testing(df_to_split, start_estimated)
    df_summers = df_to_split[ mask_2020 | mask_2021 | mask_2022 | mask_estimated]
    df_not_summer = df_to_split[~(mask_2020 | mask_2021 | mask_2022 | mask_estimated)]
    test_size = test_size * (len(df_summers) + len(df_not_summer)) / len(df_summers)
    train_data_summer, pv_test_not_ordered = train_test_split(df_summers, test_size=test_size, random_state=random_state)

    pv_train_not_ordered = pd.concat([train_data_summer, df_not_summer])
    pv_train = pv_train_not_ordered.sort_values(by='date_forecast')
    pv_test = pv_test_not_ordered.sort_values(by='date_forecast')
    return pv_train, pv_test

random_state = 42
pv_train_a, pv_test_a = split_training_testing_set(X_total_a_y_nan, X_train_estimated_a["date_forecast"].mean(), random_state=random_state)
pv_train_b, pv_test_b = split_training_testing_set(X_total_b_y_nan, X_train_estimated_b["date_forecast"].mean(), random_state=random_state)
pv_train_c, pv_test_c = split_training_testing_set(X_total_c_y_nan, X_train_estimated_c["date_forecast"].mean(), random_state=random_state)

print("train_a :",pv_train_a.shape)
print("test_a :",pv_test_a.shape)
print("Ratio test/total :", round(pv_test_a.shape[0]/(pv_test_a.shape[0]+pv_train_a.shape[0]),3)*100, '%')
print("train_b :",pv_train_b.shape)
print("test_b :",pv_test_b.shape)
print("Ratio test/total :", round(pv_test_b.shape[0]/(pv_test_b.shape[0]+pv_train_b.shape[0]),3)*100, '%')
print("train_c :",pv_train_c.shape)
print("test_c :",pv_test_c.shape)
print("Ratio test/total :", round(pv_test_c.shape[0]/(pv_test_c.shape[0]+pv_train_c.shape[0]),3)*100, '%')

## 6. Models

### 6.1. Models performance comparison

To simplify the reading of our study, we will only detail our results for location A.

What's more, A has the highest target values (4 to 5 times higher than B and C), which gives us a good idea of the trend to follow.

In [None]:
# Load the datas
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

# observed + estimated
X_total_a = pd.concat([X_train_observed_a,X_train_estimated_a])
X_total_b = pd.concat([X_train_observed_b,X_train_estimated_b])
X_total_c = pd.concat([X_train_observed_c,X_train_estimated_c])

# Rename + Delete lignes without pv_measurement
train_a.rename(columns={'time': 'date_forecast'}, inplace=True)
train_b.rename(columns={'time': 'date_forecast'}, inplace=True)
train_c.rename(columns={'time': 'date_forecast'}, inplace=True)

train_a.dropna(subset=['pv_measurement'], inplace=True)
train_b.dropna(subset=['pv_measurement'], inplace=True)
train_c.dropna(subset=['pv_measurement'], inplace=True)

# Match between features and target
X_total_a_y = pd.merge(X_total_a, train_a, on='date_forecast', how='inner')
X_total_b_y = pd.merge(X_total_b, train_b, on='date_forecast', how='inner')
X_total_c_y = pd.merge(X_total_c, train_c, on='date_forecast', how='inner')

# Nan
X_total_a_y_nan = gestion_nan(X_total_a_y)
X_total_b_y_nan = gestion_nan(X_total_b_y)
X_total_c_y_nan = gestion_nan(X_total_c_y)

# Train/Test split
split_date_a = X_train_estimated_a['date_forecast'].quantile(0.25)
split_date_b = X_train_estimated_b['date_forecast'].quantile(0.25)
split_date_c = X_train_estimated_c['date_forecast'].quantile(0.25)

pv_train_a, pv_test_a, pv_train_b, pv_test_b, pv_train_c, pv_test_c = split_training_testing_sets_on_test_set_dates()

# Date_forecast type
pv_train_a['date_forecast'] = pd.to_datetime(pv_train_a['date_forecast'])
pv_test_a['date_forecast'] = pd.to_datetime(pv_test_a['date_forecast'])

pv_train_b['date_forecast'] = pd.to_datetime(pv_train_b['date_forecast'])
pv_test_b['date_forecast'] = pd.to_datetime(pv_test_b['date_forecast'])

pv_train_c['date_forecast'] = pd.to_datetime(pv_train_c['date_forecast'])
pv_test_c['date_forecast'] = pd.to_datetime(pv_test_c['date_forecast'])


#feature Engineering
X_train_a, y_train_a = create_features(pv_train_a, label='pv_measurement')
X_test_a, y_test_a = create_features(pv_test_a, label='pv_measurement')

X_train_b, y_train_b = create_features(pv_train_b, label='pv_measurement')
X_test_b, y_test_b = create_features(pv_test_b, label='pv_measurement')

X_train_c, y_train_c = create_features(pv_train_c, label='pv_measurement')
X_test_c, y_test_c = create_features(pv_test_c, label='pv_measurement')

# Normalisation
scaler_a = StandardScaler()
scaler_b = StandardScaler()
scaler_c = StandardScaler()

X_train_a_norm,scaler_a = sklearn_z_score_normalize_dataframe(X_train_a,return_scaler=True, scaler=scaler_a)
X_train_b_norm,scaler_b = sklearn_z_score_normalize_dataframe(X_train_b,return_scaler=True, scaler=scaler_b)
X_train_c_norm,scaler_c = sklearn_z_score_normalize_dataframe(X_train_c,return_scaler=True, scaler=scaler_c)

X_test_a_norm = sklearn_z_score_normalize_dataframe(X_test_a,return_scaler=False,scaler=scaler_a, fit=False)
X_test_b_norm = sklearn_z_score_normalize_dataframe(X_test_b,return_scaler=False,scaler=scaler_b, fit=False)
X_test_c_norm = sklearn_z_score_normalize_dataframe(X_test_c,return_scaler=False,scaler=scaler_c, fit=False)

# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42) #, max_depth = 10
rf_model.fit(X_train_a_norm, y_train_a)

# Gradient Boosting
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42) #learning_rate=0.1,
gb_model.fit(X_train_a_norm, y_train_a)

# Créez un modèle XGBoost
reg_a = xgb.XGBRegressor(n_estimators=100)
reg_a.fit(X_train_a_norm, y_train_a,
          eval_set=[(X_train_a_norm, y_train_a), (X_test_a_norm, y_test_a)],
          early_stopping_rounds=50,
          verbose=True,
          )

# GRADIENT BOOSTING
# Liste pour stocker les erreurs de test à chaque itération
valid_errors = []

for i, y_pred in enumerate(gb_model.staged_predict(X_test_a_norm)):
    valid_errors.append(mean_squared_error(y_test_a, y_pred))

# Créez une figure pour chaque taux d'apprentissage
fig = plt.figure(figsize=(10, 6))
plt.subplot(1, 1, 1)
plt.title("Learning curve Gradient Boosting")
plt.plot(
    np.arange(100) + 1,
    gb_model.train_score_,
    "b-",
    label="Training Set Deviance",
)
plt.plot(
    np.arange(100) + 1, valid_errors, "r-", label="Test Set Deviance"
)
plt.legend(loc="upper right")
plt.xlabel("Boosting Iterations")
plt.ylabel("Deviance")
fig.tight_layout()
plt.show()

# XGBOOST
train_errors = []
test_errors = []

results = reg_a.evals_result()
train_errors = results['validation_0']["rmse"]
test_errors = results['validation_1']["rmse"]

# Tracez les courbes d'apprentissage
plt.figure(figsize=(10, 6))
plt.plot(train_errors, label='Train')
plt.plot(test_errors, label='Test')
plt.xlabel('Boosting Rounds')
plt.ylabel("rmse")
plt.legend()
plt.title('Learning Curves XGBoost')
plt.show()


In [None]:
# Prediction
rf_predictions = rf_model.predict(X_test_a_norm)
gb_predictions = gb_model.predict(X_test_a_norm)
reg_a_predictions = reg_a.predict(X_test_a_norm)

# Evaluation
rf_mse = mean_squared_error(y_test_a, rf_predictions)
gb_mse = mean_squared_error(y_test_a, gb_predictions)
reg_a_mse = mean_squared_error(y_test_a, reg_a_predictions)
print(f"Random Forest MSE: {round(rf_mse,0)}")
print(f"Gradient Boosting MSE: {round(gb_mse,0)}")
print(f"XGBoost MSE: {round(reg_a_mse,0)}")

print("\n")

mae_rf = np.mean(np.abs(y_test_a - rf_predictions))
mae_gb = np.mean(np.abs(y_test_a - gb_predictions))
mae_reg_a = np.mean(np.abs(y_test_a - reg_a_predictions))
print(f"Random Forest MAE: {round(mae_rf,2)}")
print(f"Gradient Boosting MAE: {round(mae_gb,2)}")
print(f"XGBoost MAE: {round(mae_reg_a,2)}")

With these first results, we can see that the models have converged well (see learning curves above). We first chose to train our models using RMSE, as it allows us to take better account of outliers (if we look at the distribution of pv_measurement we see that values can reach large values, but quite rarely).

In the evaluation phase, we can see that on both measures, the XGBoost model performs much better than the other two models.

In addition to this study, we can look at the relative importance of each variable on the model by plotting the importance graph generated by the "tree" type models.

In [None]:
plot_importance(reg_a, height=.9)
plt.gcf().set_size_inches(14, 8)

In this ranking, we can see a predominance of 3 variables: `absolute_humidity`,`diffuse_rad`,`direct_rad`.

This seems coherent, as these variables have a direct impact on the sun's rays. As for the radiation variables, these relate directly to sunshine characteristics, which will be proportional to the power generated on the solar panels.

As for humidity, we can deduce that water droplets in the air have a major impact on light transmission and reflection, and therefore on the deduced power generated.

### 6.2. XGBoost hyper-parameters

We thus decided to focus on a XGBoost model, which implies creating models (for A, B and C) with the best hyper-parameters. We thus made the following code in order to find the best hyper-parameters. The thing is, that algorithm has an exponential complexity in terms of the number of hyper-parameters we want to try. 
Thus to make it converge, we were tooking the last best-values, and thus we tryed the neighboors values. And we reapeated it. 

In [None]:
# Load the datas
train_a, train_b, train_c, X_train_estimated_a, X_train_estimated_b, X_train_estimated_c, X_train_observed_a, X_train_observed_b, X_train_observed_c, X_test_estimated_a, X_test_estimated_b, X_test_estimated_c = read_files()

# Interpolate values
train_a = interpolate_output_values(train_a)
train_b = interpolate_output_values(train_b)
train_c = interpolate_output_values(train_c)

# observed + estimated
X_total_a = pd.concat([X_train_observed_a,X_train_estimated_a])
X_total_b = pd.concat([X_train_observed_b,X_train_estimated_b])
X_total_c = pd.concat([X_train_observed_c,X_train_estimated_c])

# Rename + Delete lignes without pv_measurement
train_a.rename(columns={'time': 'date_forecast'}, inplace=True)
train_b.rename(columns={'time': 'date_forecast'}, inplace=True)
train_c.rename(columns={'time': 'date_forecast'}, inplace=True)

train_a.dropna(subset=['pv_measurement'], inplace=True)
train_b.dropna(subset=['pv_measurement'], inplace=True)
train_c.dropna(subset=['pv_measurement'], inplace=True)
# Match between features and target
X_total_a_y = pd.merge(X_total_a, train_a, on='date_forecast', how='inner')
X_total_b_y = pd.merge(X_total_b, train_b, on='date_forecast', how='inner')
X_total_c_y = pd.merge(X_total_c, train_c, on='date_forecast', how='inner')

# Nan
X_total_a_y_nan = gestion_nan(X_total_a_y)
X_total_b_y_nan = gestion_nan(X_total_b_y)
X_total_c_y_nan = gestion_nan(X_total_c_y)

# Train/Test split
pv_train_a, pv_test_a = split_training_testing_set(X_total_a_y_nan, X_train_estimated_a["date_forecast"].mean(), random_state=random_state)
pv_train_b, pv_test_b = split_training_testing_set(X_total_b_y_nan, X_train_estimated_b["date_forecast"].mean(), random_state=random_state)
pv_train_c, pv_test_c = split_training_testing_set(X_total_c_y_nan, X_train_estimated_c["date_forecast"].mean(), random_state=random_state)

# Date_forecast type
pv_train_a['date_forecast'] = pd.to_datetime(pv_train_a['date_forecast'])
pv_test_a['date_forecast'] = pd.to_datetime(pv_test_a['date_forecast'])

pv_train_b['date_forecast'] = pd.to_datetime(pv_train_b['date_forecast'])
pv_test_b['date_forecast'] = pd.to_datetime(pv_test_b['date_forecast'])

pv_train_c['date_forecast'] = pd.to_datetime(pv_train_c['date_forecast'])
pv_test_c['date_forecast'] = pd.to_datetime(pv_test_c['date_forecast'])


#feature Engineering
X_train_a, y_train_a = create_features(pv_train_a, label='pv_measurement')
X_test_a, y_test_a = create_features(pv_test_a, label='pv_measurement')

X_train_b, y_train_b = create_features(pv_train_b, label='pv_measurement')
X_test_b, y_test_b = create_features(pv_test_b, label='pv_measurement')

X_train_c, y_train_c = create_features(pv_train_c, label='pv_measurement')
X_test_c, y_test_c = create_features(pv_test_c, label='pv_measurement')

X_train_a = gestion_constant_columns(X_train_a, 'A')
X_test_a = gestion_constant_columns(X_test_a, 'A')

X_train_b = gestion_constant_columns(X_train_b, 'B')
X_test_b = gestion_constant_columns(X_test_b, 'B')

X_train_c = gestion_constant_columns(X_train_c, 'C')
X_test_c = gestion_constant_columns(X_test_c, 'C')

# Normalisation
scaler_a = StandardScaler()
scaler_b = StandardScaler()
scaler_c = StandardScaler()

X_train_a_norm,scaler_a = sklearn_z_score_normalize_dataframe(X_train_a,return_scaler=True, scaler=scaler_a)
X_train_b_norm,scaler_b = sklearn_z_score_normalize_dataframe(X_train_b,return_scaler=True, scaler=scaler_b)
X_train_c_norm,scaler_c = sklearn_z_score_normalize_dataframe(X_train_c,return_scaler=True, scaler=scaler_c)

X_test_a_norm = sklearn_z_score_normalize_dataframe(X_test_a,return_scaler=False,scaler=scaler_a, fit=False)
X_test_b_norm = sklearn_z_score_normalize_dataframe(X_test_b,return_scaler=False,scaler=scaler_b, fit=False)
X_test_c_norm = sklearn_z_score_normalize_dataframe(X_test_c,return_scaler=False,scaler=scaler_c, fit=False)

# THE BEST HYPER-PARAMETERS WE FOUND OUT
eval_metric = "rmse"
hyper_params_a = {'subsample': 0.6, 'refresh_leaf': 1, 'n_estimators': 4000, 'min_child_weight': 15, 'max_depth': 4, 'gamma': 0.5, 'eta': 0.005, 'colsample_bytree': 0.8}
hyper_params_b = {'n_estimators': 4000, 'subsample': 0.6, 'refresh_leaf': 1, 'min_child_weight': 5, 'max_depth': 5, 'gamma': 0.5, 'eta': 0.01, 'colsample_bytree': 0.6}
hyper_params_c = {'n_estimators': 4000, 'subsample': 0.6, 'refresh_leaf': 1, 'min_child_weight': 5, 'max_depth': 5, 'gamma': 0.5, 'eta': 0.05, 'colsample_bytree': 1.0}

In [None]:
reg_a = xgb.XGBRegressor(
    eval_metric = eval_metric,
    **hyper_params_a
    )

reg_b = xgb.XGBRegressor(
    eval_metric = eval_metric,
    **hyper_params_b
    )

reg_c = xgb.XGBRegressor(
    eval_metric = eval_metric,
    **hyper_params_c
    )

In [None]:
# An example of hyper-parameters range to find the best suit of it

gen_params = {
        'n_estimators': [2000, 3000, 3500],
        'gamma': [0.5, 1, 5],
        'subsample': [0.6, 1.0],
        'max_depth': [4, 5],
        'eta': [ 0.005, 0.01, 0.05],
        'n_estimators': [ 3000, 4000 ]
        }

params_a = {
        'min_child_weight': [ 10, 12, 15 ],
        'colsample_bytree': [0.8, 1.0],
        'refresh_leaf': [ .7, 1 ],
        **gen_params
        }

params_b = {
        'min_child_weight': [3, 5, 7],
        'colsample_bytree': [.7, 0.6, .5],
        'refresh_leaf': [ 1 ],
        **gen_params
        }

params_c = {
        'min_child_weight': [3, 5, 7],
        'colsample_bytree': [ 0.6 ],
        'refresh_leaf': [ 1 ],
        **gen_params
        }

# An example of hyper-parameters range, based on the best one founds, 
# to run more quickly the program

gen_params = {
        'n_estimators': [5000],
        'gamma': [0.5],
        'subsample': [0.6],
        'n_estimators': [5000]
        }

params_a = {
        'min_child_weight': [ 15, 17 ],
        'colsample_bytree': [0.8],
        'max_depth': [4],
        'eta': [ 0.005 ],
        'refresh_leaf': [ 1 ],
        **gen_params
        }

params_b = {
        'min_child_weight': [5],
        'colsample_bytree': [0.6],
        'max_depth': [5],
        'eta': [ 0.01 ],
        'refresh_leaf': [ 1 ],
        **gen_params
        }

params_c = {
        'min_child_weight': [5],
        'colsample_bytree': [1],
        'max_depth': [5],
        'eta': [ 0.05 ],
        'refresh_leaf': [ 1 ],
        **gen_params
        }

In [None]:
param_comb_a = reduce(lambda x,y: x*y, [len(v) for v in params_a.values()])
param_comb_b = reduce(lambda x,y: x*y, [len(v) for v in params_b.values()])
param_comb_c = reduce(lambda x,y: x*y, [len(v) for v in params_c.values()])
n_jobs = 10


random_search_a = RandomizedSearchCV(reg_a, param_distributions=params_a, n_iter=param_comb_a, scoring='neg_mean_squared_error', n_jobs=n_jobs, cv=2, verbose=3, random_state=1001 )
random_search_b = RandomizedSearchCV(reg_b, param_distributions=params_b, n_iter=param_comb_b, scoring='neg_mean_squared_error', n_jobs=n_jobs, cv=2, verbose=3, random_state=1001 )
random_search_c = RandomizedSearchCV(reg_c, param_distributions=params_c, n_iter=param_comb_c, scoring='neg_mean_squared_error', n_jobs=n_jobs, cv=2, verbose=3, random_state=1001 )

random_search_a.fit(
    X_train_a_norm, y_train_a,
    eval_set=[(X_train_a_norm, y_train_a), (X_test_a_norm, y_test_a)],
    early_stopping_rounds=50,
    verbose=True,
    )

print('\n All results:')
print('\n Best estimator:')
print(random_search_a.best_estimator_)
print('\n Best hyperparameters:')
print(random_search_a.best_params_)

random_search_b.fit(
    X_train_b_norm, y_train_b,
    eval_set=[(X_train_b_norm, y_train_b), (X_test_b_norm, y_test_b)],
    early_stopping_rounds=50,
    verbose=True,
    )

print('\n All results:')
print(random_search_b.cv_results_)
print('\n Best estimator:')
print(random_search_b.best_estimator_)
print('\n Best hyperparameters:')
print(random_search_b.best_params_)

random_search_c.fit(
    X_train_c_norm, y_train_c,
    eval_set=[(X_train_c_norm, y_train_c), (X_test_c_norm, y_test_c)],
    early_stopping_rounds=50,
    verbose=True,
    )

print('\n All results:')
print(random_search_c.cv_results_)
print('\n Best estimator:')
print(random_search_c.best_estimator_)
print('\n Best hyperparameters:')
print(random_search_c.best_params_)

reg_a = random_search_a.best_estimator_
reg_b = random_search_b.best_estimator_
reg_c = random_search_c.best_estimator_

reg_a.fit(X_train_a_norm, y_train_a,
          eval_set=[(X_train_a_norm, y_train_a), (X_test_a_norm, y_test_a)],
          early_stopping_rounds=50,
          verbose=True,
          )
reg_b.fit(X_train_b_norm, y_train_b,
          eval_set=[(X_train_b_norm, y_train_b), (X_test_b_norm, y_test_b)],
          early_stopping_rounds=50,
          verbose=True,
          )
reg_c.fit(X_train_c_norm, y_train_c,
          eval_set=[(X_train_c_norm, y_train_c), (X_test_c_norm, y_test_c)],
          early_stopping_rounds=50,
          verbose=True,
          )

# Créez des listes vides pour stocker les erreurs d'entraînement et de test
train_errors = []
test_errors = []

# Accédez aux erreurs d'entraînement et de test après chaque itération
results = reg_a.evals_result()
train_errors = results['validation_0'][eval_metric]
test_errors = results['validation_1'][eval_metric]

# Tracez les courbes d'apprentissage
plt.figure(figsize=(10, 6))
plt.plot(train_errors, label='Train')
plt.plot(test_errors, label='Test')
plt.xlabel('Boosting Rounds')
plt.ylabel(eval_metric)
plt.legend()
plt.title('Learning Curves')
plt.show()

min_error_a = min(test_errors)
# Créez des listes vides pour stocker les erreurs d'entraînement et de test
train_errors = []
test_errors = []

# Accédez aux erreurs d'entraînement et de test après chaque itération
results = reg_b.evals_result()
train_errors = results['validation_0'][eval_metric]
test_errors = results['validation_1'][eval_metric]

# Tracez les courbes d'apprentissage
plt.figure(figsize=(10, 6))
plt.plot(train_errors, label='Train')
plt.plot(test_errors, label='Test')
plt.xlabel('Boosting Rounds')
plt.ylabel(eval_metric)
plt.legend()
plt.title('Learning Curves')
plt.show()

min_error_b = min(test_errors)
# Créez des listes vides pour stocker les erreurs d'entraînement et de test
train_errors = []
test_errors = []

# Accédez aux erreurs d'entraînement et de test après chaque itération
results = reg_c.evals_result()
train_errors = results['validation_0'][eval_metric]
test_errors = results['validation_1'][eval_metric]

min_error_c = min(test_errors)
# Tracez les courbes d'apprentissage
plt.figure(figsize=(10, 6))
plt.plot(train_errors, label='Train')
plt.plot(test_errors, label='Test')
plt.xlabel('Boosting Rounds')
plt.ylabel(eval_metric)
plt.legend()
plt.title('Learning Curves')
plt.show()

We discovered during our study that, as we already said, that the results of this algorithm, to find to best parameters were strongly correlated. Our submissions on kaggle were depending a lot of which test sets we choose. The thing is it was very though, and required a lot of time, to look for the best test set and and find the model with the best hyperparameters on it. 

In [None]:
plot_importance(reg_a, height=.9)
plt.gcf().set_size_inches(14, 8)
plot_importance(reg_b, height=0.9)
plt.gcf().set_size_inches(14, 8)
plot_importance(reg_c, height=0.9)
plt.gcf().set_size_inches(14, 8)

### 6.3. Post-processing

When we look on the results that we get from the last model, we see that we get some values lowest to zero. If we look then on the training set, the values are not null (which makes sense). So we decided to add a little function to replace all the values below 10 by 0.

In [None]:
def preprocessing_test(df, scaler, loc):
  X_test = df.drop(["id","location","prediction"],axis=1)
  X_test = gestion_constant_columns(X_test, loc)
  X_test = create_features(X_test, None)
  X_test = gestion_nan(X_test)
  X_test = sklearn_z_score_normalize_dataframe(X_test,return_scaler=False,scaler=scaler, fit=False)
  return X_test

test = pd.read_csv("test.csv")
test_copy = test.copy()
test.rename(columns={'time': 'date_forecast'}, inplace=True)
test["date_forecast"] = pd.to_datetime(test["date_forecast"])

merged_df_a = pd.merge(X_test_estimated_a, test, on='date_forecast', how='inner')
merged_df_b = pd.merge(X_test_estimated_b, test, on='date_forecast', how='inner')
merged_df_c = pd.merge(X_test_estimated_c, test, on='date_forecast', how='inner')

X_test_a_test = preprocessing_test(merged_df_a, scaler_a, 'A')
X_test_b_test = preprocessing_test(merged_df_b, scaler_b, 'B')
X_test_c_test = preprocessing_test(merged_df_c, scaler_c, 'C')

result_A = reg_a.predict(X_test_a_test)
result_B = reg_b.predict(X_test_b_test)
result_C = reg_c.predict(X_test_c_test)

results = np.concatenate((np.concatenate((result_A,result_B)), result_C))
results[results <  10] = 0

plt.plot(results)


### 6.4. Some other models which did not succeed

AutoML

In [None]:
# Reduced parameters to run quickly
epochs = 500
max_trial = 1
batch_size = 256
metrics = [tf.keras.metrics.RootMeanSquaredError(), tf.keras.metrics.MeanSquaredError(), tf.keras.metrics.MeanAbsoluteError()]
# objective='val_root_mean_squared_error'

# It tries 10 different models.
reg_a = ak.StructuredDataRegressor(
    max_trials=max_trial,
    metrics=metrics,
    overwrite=True,
    project_name='regA1',
    # objective=objective
    )

reg_a.fit(
    X_train_a_norm, 
    y_train_a, 
    validation_data=(X_test_a_norm, y_test_a),
    epochs=epochs,
    batch_size=batch_size
    )

predicted_y = reg_a.predict(X_test_a_norm)

print(reg_a.evaluate(X_test_a_norm, y_test_a))

## 7. Areas for improvement

In this section, we are going to present the different areas of improvement we thought to do, but we did not had the time to implement them. However, we think that those areas would be interesting to think about, and could be interesting to develop, according to this project, or for a further one of us.

- Management of the training and test set:
    
    A first thing, which we already mentionned, and was a big challenge to upgrade, was the split between the training and testing set. We tryed many split and some of our upgrades which were an improvement on one test set, were not on an another. So it was really hard to deal with this. However, this is probably a common problem in machine learning, but we were still strugling with this. An idea, to fit the most possible the data we want to predict, were to adjust are testing set on the days which has similar values that the one we want to estimate. On the other hand, we wanted that our model must be the most general as we could so we did not do it at the first time. 

- Management of the observed/estimated set:

    A big issue that we faced with was how to take into consideration the difference of distribution between the observed and estimated set. The thing is, our biggest sets are the observed ones and on the other hand, we had to predict the values on estimated sets. Our only improvment in the submitted models, was to take some values in both training and testing sets. However, we could have improve that part by doing a lot of things but somes were requiring to much time to make it. For example, we thought about making kind of a GAN (generative adversarial network) to generate a convertissor from estimated values to observed ones, based on the `pv_measurement` values. However that kind of idea requires too much time to be very efficient so we did not focused much on it.

- Location combination:

    We experimented in our study to do a single model instead of three. It did not worked well but we figured out at the last moment that it could have been interesting to do 2, one for A and one for B and C because the hyperparameters of this two locations were very similar.

- Model combination:

    An idea that we had to improve the model, were to make some different models. We thought, as an intuitive way, that perhaps some models, with some hypermeters are better to predict on bad weather, and some, with other hyperparameters are better to predict on good weather. This is a lonely aspect, but we could take much more into consideration. Our idea would have be at first, with for example, two models which predicts a value each, $\hat{y_1}$ and $\hat{y_2}$, our final prediction would be a linear combination of those which makes our predicted value as: $$\hat{y} = \alpha * \hat{y_1} + \beta * \hat{y_2}$$
    Thus, this model could be also improved with a model which predict, based on the inputs, those coefficients $\alpha$ and $\beta$. The coefficients, would not be the same depending on the weather.

    We can notice that, that idea is not restricted on two, but we can had a lot more.

- Reduce the number of the features on our reshaped model:

    An other we thought about, was to reduce the number of features on our reshaped model based on the importance of them. Thus, we would have different models, one after the other, reducing the number of features, little by little.

- Trying other Machine Learning models:

    There are a lot more Machine Learning models existing nowdays but we only mainly focused on XGBoost, because a lack of time from us, spending to much time at first to find more difficult ways.

## Conclusion

In this part we are goning to summarize all we did and said concerning our study on this project, analyze and critic our methods, what worked well and failed for our futher projects and thus, make a conclusion.

In this report we tryied to present, the most possible, how we get through this project, the most possible as a chronological way. Firstly, we analysed the data which were given to us, trying to understand it and notice the most relevant information from it for the model we were going to devellop. Secondly, we presented one of our biggest research lead that we tryied to devellop at the begining of the project. We had really good hope about it at the beginning. However, the methods we had to devellop to get through it were to much complicated so it was not a success. We thus, re-start from scratch with a more classical approach which consisted in using already made-up models, such as Random Forest, XGBoost, AutoML and trying to improve it. This approach were really efficient at the beginning. When we started to use it, we get really satisfying results. Then, we tried more and more some pre-processing ideas, which were presented in this parts above. Thus, some ideas which thought were improvement were not working well when we tryed it. Actually, some thing were missing in that approach, it was to fit the hyperparamters with this. We already tryed earlier to find the best hyperparameters but the first approach was not successful. We get a lot of improvement in the method developped in [XGBoost hyper-parameters](#62-xgboost-hyper-parameters). Though, we developped this method quite late compared to the deadline of the project, so we could not develop it as we willing to; the time requires to develop it takes a lot of time because of the exponential complexity of the algorithm. So, we could not tryed all the improvement we thought about.

Actually, we are quite satisfiyed by our final results, concerning the short-time in which we developed the final model. However, we are quite disappointed because our first approach was not successful and we lost a lot of time on it. In further projects we would not realize the same approach. Making easy features at first is far more efficient that trying a big one in which we are not sure. This is something well known actually but it never been that true for us. We can of bet a lot on the first approach and it was not worth it, while the approach step by step was far more successful in the restricted time we had to develop it. Actually, our first experiment with XGBoost, our main program, started two weeks before the deadline, and our biggest improvement were in the last few days. So, probably with some more days, we could have improved it with what we said in the part [Areas for improvement](#7-areas-for-improvement). So, as a futher notice for us, we could have manage better our time by managing better our areas leads. That is how we conclude this project which were quite challenging but very exciting in terms of technical issue but also in terms of management issues. Even if our results are not the best, and could be seen as a failure from a specific point of view, we learnt a lot from it.