# ⚡ Machine Learning Project: Electricity Price Explanation ⚙️📉

**LANOTTE-MORO Alix, CHANDECLERC Antoine, GILLES Julien, FLEURY Nathan**  
**November 14, 2024**

## Project Overview

This project focuses on analyzing daily variations in electricity futures prices within the European market, specifically for France and Germany. Given the complex interplay of meteorological, energy production, and geopolitical factors impacting price fluctuations, the objective is to build a model that explains these movements rather than predicting exact prices. Leveraging Spearman correlation as the primary evaluation metric, we will employ a range of models—from linear regression benchmarks to advanced machine learning approaches (e.g., Random Forests, XGBoost, and RNNs)—to capture both linear and non-linear influences on price variations. This model aims to support energy producers, traders, and policymakers by providing insights to optimize trading strategies, manage risks, and strengthen the resilience of the European energy grid.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler

from scipy.stats import spearmanr

import warnings 
warnings.filterwarnings("ignore")

## Data Importation and first preview

In [None]:
X_df =pd.read_csv('X_train_NHkHMNU.csv', delimiter= ',')
y_df =pd.read_csv('y_train_ZAN5mwg.csv', delimiter= ',')
X_test_df =pd.read_csv('X_test_final.csv', delimiter= ',')

df = pd.merge(X_df,y_df,on='ID')

print(df.shape)
df.head(10)

### Pourcentage of missing values by columns

In [None]:
nb_missing = df.isna().sum()
rate_missing = nb_missing / df.ID.nunique()
fig, ax = plt.subplots(figsize=(4,6))
ax1 = ax
rate_missing.plot(kind="barh", ax=ax1)
ax1.grid()

We notice that there are a certain number of missing values, particularly for DE_NET_IMPORT and DE_NET_EXPORT, where the rate exceeds 8%. These data will need to be handled. Several options are available to us: filling them with the mean, replacing them with zeros, etc.

We chose to fill the missing values in each column with the respective column mean. Since the dataset is relatively small (around 1500 rows), dropping rows with missing values would significantly reduce the available data. By imputing missing values with the mean, we retain as many rows as possible, ensuring that the dataset remains comprehensive.

In [None]:
#List of Columns with missing values
columns_to_fill = [
    "DE_FR_EXCHANGE", "FR_DE_EXCHANGE", "DE_NET_EXPORT", "FR_NET_EXPORT", 
    "DE_NET_IMPORT", "FR_NET_IMPORT", "DE_RAIN", "FR_RAIN", 
    "DE_WIND", "FR_WIND", "DE_TEMP", "FR_TEMP"
]
df_general = df

#Fill missing values with the mean
for column in columns_to_fill:
    if column in df_general.columns:
        df_general[column].fillna(df_general[column].mean(), inplace=True)

### Plot of the distribution of each columns of the data

In [None]:
features = [feature for feature in df.columns if feature != "COUNTRY"]

nb_col = 6
nb_row = - (-len(features)//6)
fig, ax = plt.subplots(nb_row, nb_col, figsize=(14,14))

for i, feature in enumerate(features):
    i_col = i % nb_col
    i_row = i // nb_col
    ax1 = ax[i_row, i_col]
    
    ax1.set_title(feature)
    ax1.grid()
    df[feature].hist(bins= 30, ax=ax1, alpha=0.7)

plt.tight_layout()

fig, ax = plt.subplots()
ax1 = ax
y_df["TARGET"].hist(bins= 30, ax=ax1, alpha=0.7)
ax1.set_title("TARGET")

In [None]:
#Distribution data points by country
print("Nb of data points by country:")
print(df.COUNTRY.value_counts())

We aim to predict prices for France and Germany. To better understand the factors influencing price in each country, we will plot the data separately. This allows us to evaluate whether the factors driving price variations are consistent or differ between the two countries.

In [None]:
#Divide the data by country, drop the country column and fill missing values with the mean
df_de = df[df['COUNTRY'] == 'DE']
df_de = df_de.drop(columns=['COUNTRY'])
df_de.fillna(df_de.mean(), inplace=True)

df_fr = df[df.COUNTRY == "FR"]
df_fr = df_fr.drop(columns=['COUNTRY'])
df_fr.fillna(df_fr.mean(), inplace=True)

### Plot of the distribution of each column by country

In [None]:
nb_col = 6
nb_row = - (-len(features)//6)
fig, ax = plt.subplots(nb_row, nb_col, figsize=(14,14))

for i, feature in enumerate(features):
    i_col = i % nb_col
    i_row = i // nb_col
    ax1 = ax[i_row, i_col]
    
    ax1.set_title(feature)
    ax1.grid()
    df_fr[feature].hist(bins= 30, ax=ax1, alpha=0.7, label= "FR")
    df_de[feature].hist(bins= 30, ax=ax1, alpha=0.6, label= "DE")
    ax1.legend()

plt.tight_layout()

We observe that the **distributions of various features** are generally similar between **France** and **Germany**. However, it is notable that the data associated with **Germany** exhibit greater **variability** across several variables. This can be attributed to a more **comprehensive** and **dense database** on the French side.  

It could be relevant to separate the data into **two distinct sets** to develop **specific models** for each country. Indeed, the **economic differences** between France and Germany are significant and can influence their respective behaviors, particularly regarding **electricity prices**. These differences are attributable to several factors, including:  

- **The structure of the energy mix**: France relies heavily on **nuclear energy**, whereas Germany has gradually reduced its use of nuclear energy in favor of **renewable sources** such as wind and solar. This creates differing dynamics in electricity supply and **sensitivity to weather conditions**.  

- **Energy policies**: The energy strategies differ between the two countries, with a faster energy transition in **Germany**, accompanied by **subsidy and regulatory policies** to promote renewables. In France, dependence on nuclear energy leads to a policy focus on **supply stability**.  

- **Climatic and geographical conditions**: Climatic variations, such as differences in **temperature** and **precipitation**, differently affect energy consumption in each country. For instance, France may experience higher heating demand in winter, while Germany, with its significant **wind power infrastructure**, is more impacted by wind variations.  

These differences suggest that a single model might not adequately capture the unique dynamics of each country. Therefore, an approach using **separate models** for **France** and **Germany** could better reflect their economic and energy-specific characteristics.


### Analysis of Data Availability 

In [None]:
F = df.DAY_ID.value_counts().sort_index()
F = F.reindex(range(F.index.max()))
F = F.fillna(0)
F = F.value_counts()
F /= F.sum()
fig, ax = plt.subplots(figsize=(4,1))
ax1 = ax
ax1.set_title("Number of points per day in time period")
F.plot(kind="barh", ax=ax1)
ax1.grid()

The distribution of days based on data availability (53% of days with data for both countries, 30% without data, and the remainder with partial data available for only one of the two countries) provides important insights into the quality and representativeness of the sample used in our analysis.  

This distribution highlights the need to adopt strategies for handling missing data. Two approaches can be considered:  

- **Separate models by country** for days where data is available for only one country.  
- **Imputation or exclusion of days without data**, depending on their proportion and the potential impact on the analysis.  

These observations call for caution when interpreting the results, as irregular data availability may affect conclusions regarding temporal phenomena or direct comparisons between France and Germany.



In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x="COUNTRY", y="TARGET", data=df)
plt.title("Distribution of the target variable 'TARGET' by country")
plt.xlabel("COUNTRY")
plt.ylabel("TARGET")
plt.grid(True)
plt.show()

Examining the distribution of the target variable for both countries reveals that the median is similar and close to zero for both France and Germany. This suggests that the target values are well **centered around zero** in both cases, indicating a generally balanced distribution of positive and negative values. This centrality can be interpreted as follows:  

- **Absence of trend bias**: The proximity of the median to zero for both countries suggests there is no systematic bias in the target values. In other words, the values do not skew significantly toward positive or negative levels, which is favorable for comparative analyses or symmetric predictive models.  
- **Quantile**: The interquartile range is slightly narrower for France, meaning that most TARGET values are more concentrated around the median compared to Germany. However, the presence of numerous outliers in Francesignificantly extends the range of extreme values, potentially requiring special handling to prevent undue influence on the models.  

#### Presence of Outliers  

Another striking aspect is the presence of **numerous outliers in the French data**, whereas extreme values are less frequent in Germany. This could reflect several dynamics:  

- **Increased variability in French data**: The higher frequency of extreme values in France might indicate greater volatility in the market or factors influencing the target variable.  
- **Influence of country-specific conditions**: France and Germany differ in infrastructure and energy mix (e.g., France's heavier reliance on nuclear energy compared to Germany). These structural differences could lead to extreme variations in certain energy indicators, especially during crises or periods of exceptional demand.  
- **Potential impact on predictive models**: These outliers could complicate modeling for the French data. Extreme values might influence the mean and variance, which could affect models sensitive to significant deviations. Techniques for managing outliers (e.g., data transformation or smoothing extreme values) may be necessary to enhance model robustness.  

#### Implications for Analysis  

The observations regarding centrality and outliers have several implications for analysis and modeling:  

1. **Strategies for managing extreme values**: The outliers in the French data may require special handling, such as outlier removal or the application of normalization or logarithmic transformation techniques to reduce their impact.  
2. **Country-specific approaches**: Since the target variable's distribution differs between France and Germany (particularly regarding outliers), it may be advisable to build separate models or apply differentiated preprocessing for each country.  
3. **Use of robust models**: In the presence of numerous outliers, more robust models, such as those based on the median (quantile regression) or ensemble algorithms (e.g., random forest or gradient boosting), might be better suited than standard linear models.  

#### Summary  

Although the French and German distributions of the target variable exhibit similar centrality, the higher frequency of outliers in France calls for a cautious approach. Proper handling of these outliers is essential to ensure the quality of results and the robustness of predictive models.


### Pearson's Correlation Matrix

In [None]:
plt.figure(figsize=(16, 10))
df_cleaned = df_general.drop(columns=['ID', 'DAY_ID','COUNTRY'])
pearson_corr = df_cleaned.corr(method='pearson')
sns.heatmap(pearson_corr, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, annot_kws={"size": 8})
plt.title("Pearson's Correlation Matrix of the initial dataset")
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()


### Spearman's Correlation Matrix

In [None]:
plt.figure(figsize=(16, 10))
df_cleaned = df.drop(columns=['ID', 'DAY_ID','COUNTRY'])
spearman_corr = df_cleaned.corr(method='spearman')
sns.heatmap(spearman_corr, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, annot_kws={"size": 8})
plt.title("Spearman's Correlation Matrix of the initial dataset")   
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()

By examining the **correlation matrices**, it is noticeable that the **Spearman** matrix shows stronger correlations with the '**TARGET**' row compared to the **Pearson** matrix. This difference arises because **Spearman** evaluates the **monotonic relationship** between variables, whether linear or not, based on their **ranks** rather than their absolute values. This method better captures relationships when the data is **nonlinear** or contains **outliers**, which might distort the **Pearson** correlation.


### Energy Production Distribution by Country

In [None]:
energy_sources = ['DE_COAL', 'FR_COAL', 'DE_GAS', 'FR_GAS', 'DE_NUCLEAR', 'FR_NUCLEAR', 'DE_WINDPOW', 'FR_WINDPOW']

n_rows = len(energy_sources) // 2

plt.figure(figsize=(14, n_rows * 5))

for i, source in enumerate(energy_sources, 1):
    plt.subplot(n_rows, 2, i)
    sns.histplot(df_general[source], bins=30, kde=True)
    plt.title(f'Histogram of {source}')
    plt.xlabel(source)
    plt.ylabel('Frequency')
    
plt.tight_layout()
plt.show()

These charts demonstrate that while certain energy sources, such as **wind power**, appear to be similarly distributed between **France** and **Germany**, the two countries clearly adopt **distinct energy strategies**. Notably, **France** stands out for its greater **stability** in **nuclear energy** production, while it remains relatively **weak** in terms of **coal** production.


### Spearman's Correlation Matrix for Data Related to France  

In [None]:
df_fr_cleaned=df_fr.drop(['ID','DAY_ID'],axis=1)
corr_matrix = df_fr_cleaned.corr()
plt.figure(figsize=(16, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
spearman_corr = df_fr_cleaned.corr(method='spearman')
sns.heatmap(spearman_corr, annot=True,mask=mask, fmt=".2f", cmap="magma", cbar=True, annot_kws={"size": 8})
plt.title("Matrice triangulaire de corrélation de Spearman")
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()

### Spearman Correlation Matrix for Data Related to Germany 

In [None]:
df_de_cleaned=df_de.drop(['ID','DAY_ID'],axis=1)
corr_matrix = df_de_cleaned.corr()
plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

spearman_corr = df_de_cleaned.corr(method='spearman')
sns.heatmap(spearman_corr,mask=mask, annot=True, fmt=".3f", cmap="magma", cbar=True, annot_kws={"size": 8})
plt.title("Matrice triangulaire de corrélation de Spearman")
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()

## Model Evaluation

### Spearman's Correlation in Our Project

In this project, we aim to explain daily variations in electricity futures prices in Europe. Given the complex and non-linear relationships between factors like energy production, weather conditions, and market prices, **Spearman's correlation** is a suitable tool to evaluate our model's performance, alongside other metrics like R², MAE, and MSE.

### Why Use Spearman?

1. **Monotonicity**: The relationships between variables, such as the impact of gas prices on electricity prices, may be monotonic but non-linear. Spearman captures these monotonic trends where Pearson would fail.

2. **Robustness to Outliers**: Since the data can contain extreme values (e.g., sudden price spikes), Spearman is less sensitive to outliers, providing a more stable evaluation.

3. **Capturing Non-linear Relationships**: Variables like renewable energy production or electricity exchanges may have a non-linear effect. Spearman helps measure the monotonic relationship, even if it's not strictly linear.

### Spearman's Rank Correlation Formula

The Spearman rank correlation coefficient ($\rho$) is calculated as:

$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$

Where:
- $d_i$ is the difference between the ranks of corresponding values in the two datasets.
- $n$ is the total number of data pairs.

### Additional Evaluation Metrics

- **R²**: This metric tells us the proportion of the variance in the target variable (electricity prices) explained by the model. A higher R² indicates a better fit.
  
- **MAE (Mean Absolute Error)**: This measures the average magnitude of errors in predictions, providing insight into how close the model's predictions are to the actual values, with smaller values being preferable.
  
- **MSE (Mean Squared Error)**: This gives us the squared average of the model's prediction errors, emphasizing larger errors more heavily. Like MAE, smaller values indicate better model performance.

Spearman's correlation is ideal for our project because it allows us to evaluate complex, non-linear trends in electricity prices while being robust to outliers. Alongside Spearman, we also use R², MAE, and MSE to provide a comprehensive view of our model's performance and ensure it captures the underlying dynamics of the electricity market effectively.


### K-Fold Cross-Validation and Model Evaluation

We use **k-fold cross-validation** to assess model performance on multiple data splits. The procedure involves:

1. **Data Preparation**: Features (`X`) and target (`y`) are separated from the dataset.
2. **K-Fold Cross-Validation**: Data is split into `k` folds, with each fold used for testing once.
3. **Standardization**: Features are scaled to have zero mean and unit variance using **StandardScaler**.
4. **Model Training**: The model is trained on the training data and predictions are made on the test set.
5. **Evaluation Metrics**: We calculate **Spearman's correlation**, **R²**, **MAE**, and **MSE** to assess model accuracy.
6. **Result Aggregation**: Metrics from each fold are collected in a DataFrame and returned.

This method provides a robust evaluation of the model’s performance across different subsets of data.


In [None]:
from sklearn.model_selection import KFold
# Import timeseries split
from sklearn.model_selection import TimeSeriesSplit

def kfoldCrossValidation(df, M, k):
    X = df.drop(columns=['ID', 'TARGET'], errors='ignore')
    y = df['TARGET']

    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    result = pd.DataFrame(columns=['Spearman', 'R2', 'MAE', 'MSE'])
  
    X = np.array(X)
    y = np.array(y)
  
    # Cross-validation loop
    for train_index, test_index in kfold.split(X):
        X_train, X_test = X[train_index], X[test_index]     
        y_train, y_test = y[train_index], y[test_index]

        # Data Standardization
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

        # Training and prediction   
        M.fit(X_train, y_train)
        y_pred_test = M.predict(X_test)
    
        # Metrics calculation
        result_temp = pd.DataFrame([{
            'Spearman': spearmanr(y_test, y_pred_test).correlation, 
            'R2': M.score(X_test, y_test), 
            'MAE': np.mean(np.abs(y_test - y_pred_test)), 
            'MSE': np.mean((y_test - y_pred_test) ** 2)
        }])
    
        if not result_temp.isna().all().all():  
            result = pd.concat([result, result_temp], ignore_index=True)
    
    return result, y_test, y_pred_test


## Aggregation Model Function



Since we have separate models for each country, a function is required to aggregate the predictions from these two distinct machine learning models, each designed specifically to predict electricity prices in France and Germany. Each model is developed independently to account for the unique market dynamics, energy sources, and economic factors in each country.

### 1. Split the dataset by Country

The function split_dataset_by_country, which takes the original dataset as a parameter, enables splitting the dataset into two new datasets: df_fr (for the rows concerning France) and df_de (for the rows concerning Germany).

In [None]:
def split_dataset_by_country(df):
    df_fr = df[df['COUNTRY'] == 'FR']
    df_de = df[df['COUNTRY'] == 'DE']
    return df_fr, df_de

### 2. Drop irrelevent Columns

We decided, for each country (and thus for each model), to focus on the columns specifically related to each country. The function drop_irrelevant_columns allows us to remove the columns related to Germany from the df_fr dataset and the columns related to France from the df_de dataset.

In [None]:
def drop_irrelevant_columns(df, country):
    if country == 'FR':
        columns_to_drop = ['COUNTRY', 'DE_CONSUMPTION', 'DE_FR_EXCHANGE', 'FR_NET_EXPORT',
                           'DE_NET_EXPORT', 'DE_NET_IMPORT', 'DE_GAS', 'DE_COAL', 'DE_HYDRO',
                           'DE_NUCLEAR', 'DE_SOLAR', 'DE_WINDPOW', 'DE_LIGNITE', 'DE_RESIDUAL_LOAD',
                           'DE_RAIN', 'DE_WIND', 'DE_TEMP']
    elif country == 'DE':
        columns_to_drop = ['COUNTRY', 'FR_CONSUMPTION', 'FR_DE_EXCHANGE', 'DE_NET_EXPORT',
                           'FR_NET_EXPORT', 'FR_NET_IMPORT', 'FR_GAS', 'FR_COAL', 'FR_HYDRO',
                           'FR_NUCLEAR', 'FR_SOLAR', 'FR_WINDPOW', 'FR_RESIDUAL_LOAD',
                           'FR_RAIN', 'FR_WIND', 'FR_TEMP']
        
    columns_to_drop = [col for col in columns_to_drop if col in df.columns]

    return df.drop(columns=columns_to_drop)

### 3. Fill Missing Values

Each dataset contains a certain number of missing values (as previously noted), and given the relatively limited amount of data, we couldn’t simply remove the rows with missing values. Therefore, we use the fill_missing_values function on df_fr and df_de individually to replace the missing values with the average of the column where the missing value is located.

In [None]:
def fill_missing_values(df):
    return df.fillna(df.mean())

### 4. Merging the results

Although we trained models and made predictions separately for France and Germany, the final model to submit at the end of the challenge is a single model. The function combine_results allows us to test the previous metrics, but this time on the entire set of our predictions.

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

def combine_results(y_test_fr, y_pred_fr, y_test_de, y_pred_de):
    y_pred_fr = pd.Series(y_pred_fr)
    y_pred_de = pd.Series(y_pred_de)
    y_test_de = pd.Series(y_test_de)
    y_test_fr = pd.Series(y_test_fr)
    
    y_true_combined = pd.concat([y_test_fr, y_test_de], axis=0)
    y_pred_combined = pd.concat([pd.Series(y_pred_fr), pd.Series(y_pred_de)], axis=0)
    combined_r2 = r2_score(y_true_combined, y_pred_combined)
    combined_mse = mean_squared_error(y_true_combined, y_pred_combined)
    combined_spearman = spearmanr(y_true_combined, y_pred_combined).correlation
    combined_mae = np.mean(np.abs(y_true_combined - y_pred_combined))   
    return combined_r2, combined_mse, combined_mae, combined_spearman

### 5. Final Function

Finally, we implement our global function, which sequentially utilizes the previously defined functions (1 to 4). This function provides the model's performance metrics for France, Germany, and the entire dataset.

The parameters "FeatureEngineering" and "DropIrrelevantColumns" will be applied later, and we will provide an explanation at that time.

In [None]:
def Final_model(df, model_fr, model_de,FeatureEngineering=None, DropIrrelevantColumns=True):

    # Step 1 : datasets splitting
    df_fr, df_de = split_dataset_by_country(df)

    if FeatureEngineering is None:
        if DropIrrelevantColumns: 
            df_fr = drop_irrelevant_columns(df_fr, 'FR')
            df_de = drop_irrelevant_columns(df_de, 'DE')
        else:
            df_fr = df_fr.drop(columns=['COUNTRY'])
            df_de = df_de.drop(columns=['COUNTRY'])

    else:
        df_fr = df_fr.drop(columns=['COUNTRY'])
        df_de = df_de.drop(columns=['COUNTRY'])
        df_fr, df_de = FeatureEngineering(df_fr,df_de)

    # Step 3 : Missing values replacement
    df_fr = fill_missing_values(df_fr)
    df_de = fill_missing_values(df_de)

    result_fr, y_test_fr, y_pred_fr = kfoldCrossValidation(df_fr, model_fr, 6)
    result_de, y_test_de, y_pred_de = kfoldCrossValidation(df_de, model_de, 6)

    # Step 8 : Combining the results and evaluating
    combined_r2, combined_mse, combined_mae, combined_spearman = combine_results(y_test_fr, y_pred_fr, y_test_de, y_pred_de)
    result_global = pd.DataFrame([{
        'R2': combined_r2,
        'MSE': combined_mse,
        'MAE': combined_mae,
        'Spearman': combined_spearman
    }])

    return result_fr, result_de, result_global

## First Model Using Linear Regression

### Model only with country separation

Building on our prior efforts, it’s now straightforward to test and evaluate different models.

Training of a model using simple LinearRegression

In [None]:
result_basic = Final_model(df, LinearRegression(), LinearRegression(), DropIrrelevantColumns=False)
result_basic[2]

We observe that our Spearman correlation, R², and MSE scores are relatively low. However, it's crucial to consider the context. In financial markets, price movements are influenced by thousands of factors with complex, non-linear relationships, making it nearly impossible to develop a model that fully explains electricity prices today. For reference, the best model from the ENS QRT challenge addressing this problem achieved a Spearman correlation of 0.6. Our objective is to enhance these evaluation metrics as much as possible.

Display the results by country:

In [None]:
print("Results for France:")
print(result_basic[0],"\n")
print("Mean results for France:")
result_basic[0].mean()

In [None]:
print("Results for Germany:")
print(result_basic[1],"\n")
print("Mean results for Germany:")
result_basic[1].mean()

As observed previously, it comes as no surprise that the German model outperforms the French one. This aligns with our earlier findings, where the features in the German dataset exhibited higher Spearman and Pearson correlation scores compared to those in the French dataset. Consequently, our model demonstrates better performance in Germany than in France.

### Model with separation and deleting of useless column for each country

Next, we explore whether the model's performance improves when features specific to France are removed for the German model, and vice versa. This is accomplished using the `DropIrrelevantColumns` function, which handles the removal of these features.

In [None]:
result = Final_model(df,LinearRegression(), LinearRegression(), DropIrrelevantColumns=True)
print("Results for the overall data:")
result[2]

We observe a slightly improved score after applying this approach, so we will retain it for the remainder of the project.

Results for each Country

In [None]:
print("Results for France:")
print(result[0],"\n")
print("Mean results for France:")
result[0].mean()

In [None]:
print("Results for Germany:")
print(result[1],"\n")
print("Mean results for Germany:")
result[1].mean()

## Feature engineering

We want to add more columns to create more precise models so we will do some feature engineering 

We can compute rolling statistics to capture trends and fluctuations over time. 
The rolling mean smooths the data to highlight long-term trends, while the rolling standard deviation measures volatility and variability within specific time windows (weekly and monthly). 
The slope, calculated using linear regression on rolling windows, captures the rate of change or trend direction, helping to identify whether values are generally increasing or decreasing over time.

In [None]:
# Function to calculate rolling statistics (mean, std, median, min, max, slope) for DE_ and FR_ columns
def add_statistics(df, variables, windows):
    def slope(y):
        return np.polyfit(range(len(y)), y, 1)[0] if len(y) > 0 else np.nan
    
    df.order = df.sort_values(by='DAY_ID')
    
    for var in variables:
        for window in windows:
            for country in ['DE_', 'FR_']:
                col = f'{country}{var}'
                df[f'{col}_MEAN_{window}D'] = df[col].rolling(window=window).mean()
                df[f'{col}_STD_{window}D'] = df[col].rolling(window=window).std()
                df[f'{col}_MEDIAN_{window}D'] = df[col].rolling(window=window).median()
                df[f'{col}_MIN_{window}D'] = df[col].rolling(window=window).min()
                df[f'{col}_MAX_{window}D'] = df[col].rolling(window=window).max()
                df[f'{col}_SLOPE_{window}D'] = df[col].rolling(window=window).apply(slope, raw=True)
    return df


Then we can use technical indicators widely use in finance like volatility and moving averages that are used to analyze time-series data. Volatility, typically measured by rolling standard deviation, helps quantify the level of fluctuation or uncertainty in the data. Moving averages, especially exponential moving averages (EMA), are used to smooth short-term fluctuations and highlight long-term trends or cycles, making it easier to detect underlying patterns in the data.

In [None]:
def add_indicators(df):
    for commodity in ['GAS_RET', 'COAL_RET', 'CARBON_RET']:
        df[f'{commodity}_VOLATILITY_WEEKLY'] = df[commodity].rolling(window=7).std()
        df[f'{commodity}_VOLATILITY_MONTHLY'] = df[commodity].rolling(window=30).std()
        df[f'{commodity}_EMA_MONTHLY'] = df[commodity].ewm(span=30, adjust=False).mean()
    
    return df

Finally we can add column specific with this domain area such as energy ratios, weather effects, and clustering, wich will provide further insights into underlying patterns. Weather effects, such as temperature or wind, are crucial in energy demand prediction, as they directly influence consumption and production.

In [None]:
# Function to add energy source ratios and effects
def add_energy(df):
    energy_sources = ['GAS', 'COAL', 'HYDRO', 'NUCLEAR', 'SOLAR', 'WINDPOW']
    
    for country in ['DE_', 'FR_']:
        total_energy = sum(df[f'{country}{source}'] for source in energy_sources)
        
        for source in energy_sources:
            df[f'{country}{source}_RATIO'] = df[f'{country}{source}'] / total_energy

        df[f'{country}WIND_SOLAR'] = df[f'{country}WINDPOW'] + df[f'{country}SOLAR']
        df[f'{country}TEMP_EFFECT'] = df[f'{country}TEMP'] * df[f'{country}CONSUMPTION']
        df[f'{country}WIND_EFFECT'] = df[f'{country}WIND'] * df[f'{country}WINDPOW']
        df[f'{country}SOLAR_EFFECT'] = (df[f'{country}SOLAR'] / df[f'{country}TEMP']).replace([np.inf, -np.inf], np.nan).fillna(0)
    
    return df

Now that we have rearrange all our functions to add columns we can create a single function that call the others

In [None]:
# Main function to add custom features by combining all the previous functions
def add_columns(df):
    periods=[7, 30]# Weekly and Monthly
    variables = ['CONSUMPTION', 'GAS', 'COAL', 'HYDRO', 'NUCLEAR', 'SOLAR', 'WINDPOW', 'TEMP', 'RAIN', 'WIND']

    df=add_statistics(df, variables, periods)
    df=add_energy(df)
    df=add_indicators(df)
    
    df.fillna(df.mean(), inplace=True)
    return df

The following function calculates the correlation matrix and selects features that are either highly correlated with each other or have low correlation with the target variable. It combines both conditions to decide which features to drop

In [None]:
def select_features_based_on_correlation(df, target_column, multicollinear_threshold, correlation_threshold):
    # Calculate the Spearman correlation matrix because it is the metric that we choose
    corr_matrix=df.corr(method='spearman')
    # Identify columns that are highly correlated with each other
    # (excluding the target variable correlation)
    high_corr_var=np.where(corr_matrix > multicollinear_threshold)
    high_corr_var=[(corr_matrix.index[x], corr_matrix.columns[y]) 
                     for x, y in zip(*high_corr_var) 
                     if x!=y and x < y]
    # Extract the names of columns to drop based on multicollinearity
    multicollinear_features=set([item for sublist in high_corr_var for item in sublist])
    # Identify features that have a low correlation with the target variable
    low_corr_with_target=corr_matrix[target_column][abs(corr_matrix[target_column]) < correlation_threshold].index.tolist()
    # Combine features to drop due to multicollinearity and low correlation with target
    features_to_drop=multicollinear_features.union(low_corr_with_target)
    # Determine the final list of features to keep
    features_to_keep=[feature for feature in df.columns if feature not in features_to_drop and feature!=target_column]
    
    return features_to_keep

Here, we filter the dataset for France (COUNTRY == 'FR'), remove certain columns, and keep some of the original columns, which can still have an impact on our prediction even if they are less correlated than the ‘artificial’ columns from feature engineering.

### Final function that will be used for the feature engineering

In [None]:
def FeatureEngineering(df_fr, df_de, df_fr_test=None, df_de_test=None):

    # We retain the columns in the main dataset, but not the columns with a high correlation with the others or a lower correlation with TARGET
    col_fr_best = select_features_based_on_correlation(df_fr, 'TARGET', 0.93, 0.08)

    # Select columns that do not start with 'DE' (to avoid mixing German data)
    df_no_de = df_fr[df_fr.columns[~df_fr.columns.str.startswith('DE')]]
    col_fr = [col for col in df_no_de.columns if col in col_fr_best]
    df_fr.fillna(df_fr.mean(), inplace=True)
    df_fr_test.fillna(df_fr_test.mean(), inplace=True) if df_fr_test is not None else None

    # We retain the columns in the main dataset, but not the columns with a high correlation with the others or a lower correlation with TARGET
    col_de_best = select_features_based_on_correlation(df_de, 'TARGET', 0.93, 0.08)

    # Select columns that do not start with 'FR' (to avoid mixing French data)
    df_no_fr = df_de[df_de.columns[~df_de.columns.str.startswith('FR')]]
    col_de = [col for col in df_no_fr.columns if col in col_de_best]
    df_de.fillna(df_de.mean(), inplace=True)
    df_de_test.fillna(df_de_test.mean(), inplace=True) if df_de_test is not None else None

    original_fr = df_fr[col_fr]
    original_de = df_de[col_de]

    # Apply feature engineering
    df_fr = add_columns(df_fr)
    df_de = add_columns(df_de)

    if df_fr_test is not None:
        original_fr_test = df_fr_test[col_fr]
        df_fr_test = add_columns(df_fr_test)
    
    if df_de_test is not None:
        original_de_test = df_de_test[col_de]
        df_de_test = add_columns(df_de_test)

    # Select the best columns based on correlation
    best_features_fr = select_features_based_on_correlation(df_fr, 'TARGET', 0.8, 0.08)
    best_features_de = select_features_based_on_correlation(df_de, 'TARGET', 0.8, 0.15)

    # Keep the selected columns while adding the original columns
    df_fr = pd.concat([original_fr, df_fr[best_features_fr+['TARGET']]], axis=1)
    df_de = pd.concat([original_de, df_de[best_features_de+['TARGET']]], axis=1)

    df_fr_test = pd.concat([original_fr_test, df_fr_test[best_features_fr]], axis=1) if df_fr_test is not None else None
    df_de_test = pd.concat([original_de_test, df_de_test[best_features_de]], axis=1) if df_de_test is not None else None

    if df_fr_test is not None and df_de_test is not None:
        return df_fr, df_de, df_fr_test, df_de_test
    else:
        return df_fr, df_de

## Model Using LinearRegression and Feature Engineering

#### Linear Regression

In [None]:
result = Final_model(df,LinearRegression(), LinearRegression(),FeatureEngineering, DropIrrelevantColumns=True)
result[2]

The Speaman Correlation improved a lot from 0.22 to almost 0.3. The features engineering give a better performance to our Model.

In [None]:
result[0].mean()

In [None]:
result[1].mean()

## Model Using RandomForestRegressor

Let's experiment with the **RandomForestRegressor**, which is highly effective for capturing non-linear relationships. To maximize its potential, we use **GridSearchCV**, a technique that systematically tests multiple combinations of hyperparameters for the Random Forest and selects the best configuration for our model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

rf = RandomForestRegressor(random_state=42)

grid_search_1 = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search_2 = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)

result = Final_model(df, grid_search_1, grid_search_2, FeatureEngineering, True)


In [None]:
print("GridSearchCV_1 coeffcients :", grid_search_1.best_params_)
print("GridSearchCV_2 coeffcients :", grid_search_2.best_params_)

In [None]:
result[2]

The **RandomForestRegressor** demonstrates a markedly higher Spearman Correlation score compared to **Linear Regression**, highlighting its enhanced ability to capture relationships within the data. Additionally, it achieves a notable improvement in both the **R² score** (0.06 versus 0.035) and **MSE**, further underscoring its effectiveness. These metrics indicate that the RandomForestRegressor is better equipped to capture underlying patterns, particularly non-linear relationships. The model’s higher R² and lower MSE reflect its superior capacity to explain variance in the data and deliver more accurate predictions. This performance emphasizes the advantages of employing a non-linear model like Random Forest, showcasing its robustness in predictive power and generalization.

Moving forward, we plan to improve the model’s performance by fine-tuning the **RandomForestRegressor** and optimizing feature engineering specifically tailored to this model. These enhancements are expected to further reinforce its advantage over Linear Regression, particularly in terms of Spearman Correlation, by fully leveraging its ability to capture non-linear relationships between variables and the target. Additionally, we intend to investigate more advanced and sophisticated models to drive further gains in predictive accuracy and overall performance.

## Output for Challenge submission

Since our project is part of a challenge, we need to apply our model to a X_test dataset and submit the predictions we make. The function below handles this process, ensuring that the model's predictions are properly generated and ready for submission.

In [None]:
def output_for_submission(df, X_test_df,model_fr,model_de, feature_engineering = False):
    
    # Step 1 : datasets splitting
    df_fr, df_de = split_dataset_by_country(df)
    df_fr_test, df_de_test = split_dataset_by_country(X_test_df)

    start_df_fr_test = df_fr_test.copy()
    start_df_de_test = df_de_test.copy()


    if feature_engineering == False:
        #Step 2: Dropping unnecessary columns   
        df_fr = drop_irrelevant_columns(df_fr, 'FR')
        df_de = drop_irrelevant_columns(df_de, 'DE')
        df_fr_test = drop_irrelevant_columns(df_fr_test, 'FR')
        df_de_test = drop_irrelevant_columns(df_de_test, 'DE')

    else:
        df_fr = df_fr.drop(columns=['COUNTRY'])
        df_de = df_de.drop(columns=['COUNTRY'])

        df_fr_test = df_fr_test.drop(columns=['COUNTRY'])
        df_de_test = df_de_test.drop(columns=['COUNTRY'])

    # Step 3 : Missing values replacement
    df_fr = fill_missing_values(df_fr)
    df_de = fill_missing_values(df_de)

    df_fr_test = fill_missing_values(df_fr_test)
    df_de_test = fill_missing_values(df_de_test)

    if(feature_engineering):
        df_fr, df_de, df_fr_test, df_de_test = FeatureEngineering(df_fr,df_de, df_fr_test, df_de_test) 

    df_fr_scaled = StandardScaler().fit_transform(df_fr)
    df_de_scaled = StandardScaler().fit_transform(df_de)
    df_fr_scaled = pd.DataFrame(df_fr_scaled, columns=df_fr.columns)
    df_de_scaled = pd.DataFrame(df_de_scaled, columns=df_de.columns)
    df_fr_scaled.drop(columns=['TARGET'], inplace=True)
    df_de_scaled.drop(columns=['TARGET'], inplace=True)

    model_fr.fit(df_fr_scaled, df_fr['TARGET'])
    model_de.fit(df_de_scaled, df_de['TARGET'])

    df_fr_test_scaled = StandardScaler().fit_transform(df_fr_test)
    df_de_test_scaled = StandardScaler().fit_transform(df_de_test)
    df_fr_test_scaled = pd.DataFrame(df_fr_test_scaled, columns=df_fr_test.columns)
    df_de_test_scaled = pd.DataFrame(df_de_test_scaled, columns=df_de_test.columns)

    y_pred_fr = pd.DataFrame(model_fr.predict(df_fr_test_scaled))
    y_pred_de = pd.DataFrame(model_de.predict(df_de_test_scaled))

    y_pred_fr['ID'] = start_df_fr_test['ID'].values
    y_pred_de['ID'] = start_df_de_test['ID'].values

    #Rename columns
    y_pred_fr.columns = ['TARGET', 'ID']
    y_pred_de.columns = ['TARGET', 'ID']

    #Change the order of columns for the submition
    y_pred_fr = y_pred_fr[['ID', 'TARGET']]
    y_pred_de = y_pred_de[['ID', 'TARGET']]

    y_pred = pd.concat([y_pred_fr, y_pred_de], axis=0)
   
    return y_pred

In [None]:
output = output_for_submission(df, X_test_df,LinearRegression(), LinearRegression(),True)

In [None]:
output.to_csv('submission.csv', index=False)

## Future of the Project 🚀

The future of the project focuses on improving our model's performance and prediction accuracy. We plan to **fine-tune the RandomForest** 🌲, as it has the potential to capture non-linear relationships between the features and the target variable, which should give it a bigger edge over linear regression 📉. 

Additionally, we will explore more sophisticated models 🧠, such as **ensemble methods**, **gradient boosting**, and **neural networks** 🤖, to push the performance even further. As we continue, we’ll refine the **feature engineering** 🔧 process, optimize **hyperparameters** ⚙️, and incorporate **domain-specific insights** 📊 to enhance the model’s robustness and reliability.

Our ultimate goal 🎯 is to create a predictive model that can effectively handle the complex dynamics of the problem and achieve higher accuracy in forecasting 🌟.


In [None]:

def get_best_grid_parameters(data,model,parametre, FeatureEngineering=None, DropIrrelevantColumns=True):

    # Step 1 : datasets splitting
    df_fr, df_de = split_dataset_by_country(data)

    if FeatureEngineering is None:
        if DropIrrelevantColumns: 
            df_fr = drop_irrelevant_columns(df_fr, 'FR')
            df_de = drop_irrelevant_columns(df_de, 'DE')
        else:
            df_fr = df_fr.drop(columns=['COUNTRY'])
            df_de = df_de.drop(columns=['COUNTRY'])

    else:
        df_fr = df_fr.drop(columns=['COUNTRY'])
        df_de = df_de.drop(columns=['COUNTRY'])
        df_fr, df_de = FeatureEngineering(df_fr,df_de)

    # Step 3 : Missing values replacement
    df_fr = fill_missing_values(df_fr)
    df_de = fill_missing_values(df_de)
    
    grid1= GridSearchCV(estimator=model, param_grid=parametre, cv=5, n_jobs=1)
    grid2= GridSearchCV(estimator=model, param_grid=parametre, cv=5, n_jobs=1)
    
    X_fr=df_fr.drop('TARGET', axis=1)
    X_de=df_de.drop('TARGET', axis=1)
    y_fr=df_fr['TARGET']
    y_de=df_de['TARGET']
    
    #X_train_fr, X_test_fr, y_train_fr, y_test_fr = train_test_split(X_fr, y_fr, test_size=0.2, random_state=42)
    #X_train_de, X_test_de, y_train_de, y_test_de = train_test_split(X_de, y_de, test_size=0.2, random_state=42)
    
    grid1.fit(X_fr,y_fr)
    grid2.fit(X_de,y_de)
    
    best_params_fr = grid1.best_params_
    best_spearman_fr = grid1.best_score_
    
    best_params_de = grid2.best_params_
    best_spearman_de = grid2.best_score_
    
    print("Meilleur parametre pour la France : ",best_params_fr)
    print("Meilleur spearman pour la France : ",best_spearman_fr)
    print("Meilleur parametre pour l'Allemange : ",best_params_de)
    print("Meilleur spearman pour l'Allemagne : ",best_spearman_de)
    
    return best_spearman_fr, best_params_fr,best_spearman_de, best_params_de
    


In [None]:
# Chargement des données
X_df = pd.read_csv('X_train_NHkHMNU.csv', delimiter=',')
y_df = pd.read_csv('y_train_ZAN5mwg.csv', delimiter=',')

# Fusion des données d'entraînement sur la colonne 'ID'
df = pd.merge(X_df, y_df, on='ID')


In [None]:
df

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}
rf = RandomForestRegressor(random_state=42)

best_spearman_fr, best_params_fr,best_spearman_de, best_params_de = get_best_grid_parameters(df,rf,param_grid)

In [None]:
pip install optuna

In [None]:
import optuna
from sklearn.metrics import mean_squared_error

In [None]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from scipy.stats import spearmanr  # Pour calculer la corrélation de Spearman


In [None]:
def objectivefr(trial):
    # 1. Suggérer des hyperparamètres
    lr = trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True)
    batch_size = trial.suggest_int('batch_size', 16, 128)

    # 2. Créer et entraîner le modèle
    model = create_nn_model(input_dim=X_train_fr.shape[1], learning_rate=lr)
    model.fit(X_train_fr, y_train_fr, batch_size=batch_size, epochs=10, verbose=0)

    # 3. Faire des prédictions sur les données de test
    y_pred_fr = model.predict(X_test_fr).flatten()

    # 4. Calculer la corrélation de Spearman
    spearman_corr, _ = spearmanr(y_test_fr, y_pred_fr)

    # Optuna maximise la valeur, donc on retourne -spearman_corr pour minimiser
    return -spearman_corr 

In [None]:
def objectivede(trial):
    # 1. Suggérer des hyperparamètres
    lr = trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True)
    batch_size = trial.suggest_int('batch_size', 16, 128)

    # 2. Créer et entraîner le modèle
    model = create_nn_model(input_dim=X_train_de.shape[1], learning_rate=lr)
    model.fit(X_train_de, y_train_de, batch_size=batch_size, epochs=10, verbose=0)

    # 3. Faire des prédictions sur les données de test
    y_pred_de = model.predict(X_test_de).flatten()

    # 4. Calculer la corrélation de Spearman
    spearman_corr, _ = spearmanr(y_test_de, y_pred_de)

    # Optuna maximise la valeur, donc on retourne -spearman_corr pour minimiser
    return -spearman_corr

In [None]:
def create_nn_model(input_dim, learning_rate):
    """
    Crée un modèle de réseau de neurones avec des paramètres personnalisables.

    Args:
        input_dim (int): Nombre de caractéristiques en entrée.
        learning_rate (float): Taux d'apprentissage pour l'optimiseur.

    Returns:
        keras.Model: Modèle de réseau de neurones compilé.
    """
    model = Sequential()
    model.add(Dense(64, input_dim=input_dim, activation='relu'))#on utilise relu pour expliquer des relations non-linéaires
    model.add(Dense(32, activation='relu')) 
    model.add(Dense(1, activation='linear'))  # Régression (prédiction continue)
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='mse')# on optimise avec Adam
    return model          #Adam est un algorithme d'optimisation utilisé pour entraîner les réseaux de neurones.

In [None]:
# Chargement des données
X_df = pd.read_csv('X_train_NHkHMNU.csv', delimiter=',')
y_df = pd.read_csv('y_train_ZAN5mwg.csv', delimiter=',')

# Fusion des données d'entraînement sur la colonne 'ID'
df = pd.merge(X_df, y_df, on='ID')

In [None]:
# Fonction pour séparer les datasets par pays
def split_dataset_by_country(dataframe):
    df_fr = dataframe[dataframe['COUNTRY'] == 'FR']
    df_de = dataframe[dataframe['COUNTRY'] == 'DE']
    return df_fr, df_de

In [None]:
# Séparer les datasets par pays
df_fr, df_de = split_dataset_by_country(df)

In [None]:
df_fr = fill_missing_values(df_fr)
df_de = fill_missing_values(df_de)


In [None]:
# Préservation des jeux de test originaux pour vérification ultérieure
#start_df_fr_test = df_fr_test.copy()
#start_df_de_test = df_de_test.copy()

In [None]:
df_fr = drop_irrelevant_columns(df_fr, 'FR')
df_de = drop_irrelevant_columns(df_de, 'DE')

In [None]:
# Extraire les cibles (TARGET) avant standardisation
y_fr = df_fr.pop('TARGET')
y_de = df_de.pop('TARGET')

In [None]:
# Standardisation des données
scaler_fr = StandardScaler()
scaler_de = StandardScaler()

In [None]:
y_fr.shape

In [None]:
df_fr_scaled = pd.DataFrame(scaler_fr.fit_transform(df_fr), columns=df_fr.columns)
df_de_scaled = pd.DataFrame(scaler_de.fit_transform(df_de), columns=df_de.columns)

In [None]:
# Séparation des caractéristiques et cibles pour les jeux d'entraînement
X_fr = df_fr_scaled
X_de = df_de_scaled

In [None]:
df_fr_scaled.head()

In [None]:
X_fr.shape

In [None]:
# Séparation des jeux d'entraînement et de test (80% - 20%)
X_train_fr, X_test_fr, y_train_fr, y_test_fr = train_test_split(X_fr, y_fr, test_size=0.2, random_state=42)
X_train_de, X_test_de, y_train_de, y_test_de = train_test_split(X_de, y_de, test_size=0.2, random_state=42)

In [None]:
# Lancer Optuna pour l'optimisation
study = optuna.create_study(direction='maximize')  # On minimise la MSE
study.optimize(objectivefr, n_trials=100)  # 20 essais pour l'optimisation

# 3. Afficher les meilleurs hyperparamètres et la meilleure corrélation de Spearman
print("Meilleurs hyperparamètres pour la France :", study.best_params)
#print("Meilleure corrélation de Spearman :", -study.best_value) 

In [None]:
# Lancer Optuna pour l'optimisation
study = optuna.create_study(direction='maximize')  # On minimise la MSE
study.optimize(objectivede, n_trials=100)  # 20 essais pour l'optimisation

# 3. Afficher les meilleurs hyperparamètres et la meilleure corrélation de Spearman
print("Meilleurs hyperparamètres pour l'Allemgane :", study.best_params)
#print("Meilleure corrélation de Spearman :", -study.best_value) 

In [None]:
model_fr = create_nn_model(X_fr.shape[1],0.0001355209836354015)
model_de = create_nn_model(X_de.shape[1],0.20411449016100183)

In [None]:
model_fr.fit(X_train_fr, y_train_fr, epochs=20, batch_size=97, validation_data=(X_test_fr, y_test_fr), verbose=1)
model_de.fit(X_train_de, y_train_de, epochs=20, batch_size=97, validation_data=(X_test_de, y_test_de), verbose=1)

In [None]:
# Prédictions
y_pred_fr = model_fr.predict(X_test_fr)
y_pred_de = model_de.predict(X_test_de)
print(y_pred_fr.size)


In [None]:
# Formatage des résultats pour la soumission
y_pred_fr = pd.DataFrame(y_pred_fr, columns=['TARGET'])
y_pred_de = pd.DataFrame(y_pred_de, columns=['TARGET'])
#y_pred_fr['ID'] = start_df_fr_test['ID'].values
#y_pred_de['ID'] = start_df_de_test['ID'].values


In [None]:
y_pred_fr['ID'] = X_test_fr['ID'].values
y_pred_de['ID'] = X_test_de['ID'].values

In [None]:
# Changer l'ordre des colonnes pour la soumission
y_pred_fr = y_pred_fr[['ID', 'TARGET']]
y_pred_de = y_pred_de[['ID', 'TARGET']]

In [None]:
# Combiner les prédictions des deux pays
#y_pred = pd.concat([y_pred_fr, y_pred_de], axis=0)


In [None]:
#on s'en fou
print(f"Taille de y_pred_fr : {y_pred_fr['TARGET'].shape}")
print(f"Taille de y_fr : {y_fr.shape}")
print(f"Taille de y_pred_de : {y_pred_de['TARGET'].shape}")
print(f"Taille de y_de : {y_de.shape}")

In [None]:
# Corrélation de Spearman
# Calculer la corrélation de Spearman pour la France
correlation_fr, _ = spearmanr(y_pred_fr['TARGET'], y_test_fr)

# Calculer la corrélation de Spearman pour l'Allemagne
correlation_de, _ = spearmanr(y_pred_de['TARGET'], y_test_de)

# Affichage des résultats de la corrélation de Spearman
print(f"Corrélation de Spearman pour la France : {correlation_fr:.4f}")
print(f"Corrélation de Spearman pour l'Allemagne : {correlation_de:.4f}")