## 1. Preliminary

### 1.1 Context

Vous travaillez pour la ville de Seattle. Pour atteindre son objectif de ville neutre en émissions de carbone en 2050, votre équipe s’intéresse de près à la consommation et aux émissions des bâtiments non destinés à l’habitation.

Des relevés minutieux ont été effectués par les agents de la ville en 2016. Voici [les données](https://s3.eu-west-1.amazonaws.com/course.oc-static.com/projects/Data_Scientist_P4/2016_Building_Energy_Benchmarking.csv) et [leur source](https://data.seattle.gov/dataset/2016-Building-Energy-Benchmarking/2bpz-gwpy). Cependant, ces relevés sont coûteux à obtenir, et à partir de ceux déjà réalisés, **vous voulez tenter de prédire les émissions de CO2 et la consommation totale d’énergie de bâtiments non destinés à l’habitation** pour lesquels elles n’ont pas encore été mesurées.

<div class="alert alert-block alert-info">
Votre prédiction se basera sur les données structurelles des bâtiments (taille et usage des bâtiments, date de construction, situation géographique, ...)
</div>

Vous cherchez également à **évaluer l’intérêt de l’"[ENERGY STAR Score](https://www.energystar.gov/buildings/facility-owners-and-managers/existing-buildings/use-portfolio-manager/interpret-your-results/what)" pour la prédiction d’émissions**, qui est fastidieux à calculer avec l’approche utilisée actuellement par votre équipe. Vous l'intégrerez dans la modélisation et jugerez de son intérêt.
Vous sortez tout juste d’une réunion de brief avec votre équipe. Voici un récapitulatif de votre mission :
 1) Réaliser une courte analyse exploratoire.
 2) Tester différents modèles de prédiction afin de répondre au mieux à la problématique.

Avant de quitter la salle de brief, Douglas, le project lead, vous donne quelques pistes et erreurs à éviter :

> Douglas : L’objectif est de te passer des relevés de consommation annuels futurs (attention à la fuite de données). Nous ferons de toute façon pour tout nouveau bâtiment un premier relevé de référence la première année, donc rien ne t'interdit d’en déduire des variables structurelles aux bâtiments, par exemple la nature et proportions des sources d’énergie utilisées.. 
Fais bien attention au traitement des différentes variables, à la fois pour trouver de nouvelles informations (peut-on déduire des choses intéressantes d’une simple adresse ?) et optimiser les performances en appliquant des transformations simples aux variables (normalisation, passage au log, etc.).
Mets en place une évaluation rigoureuse des performances de la régression, et optimise les hyperparamètres et le choix d’algorithmes de ML à l’aide d’une validation croisée.


### 1.2 Requirements

In [1]:
package_list = ("pandas", "numpy", "matplotlib", "seaborn", "scikit-learn", "mlflow")

In [2]:
!python3 -V

Python 3.10.13


In [3]:
txt = !python3 -m pip freeze
check = lambda i: any([(pack in i) for pack in package_list])
txt = [i for i in txt if check(i)]
txt

['matplotlib==3.8.2',
 'matplotlib-inline==0.1.6',
 'mlflow==2.11.3',
 'numpy @ file:///Users/runner/miniforge3/conda-bld/numpy_1704280780572/work/dist/numpy-1.26.3-cp310-cp310-macosx_11_0_arm64.whl#sha256=f96d0b051b72345dbc317d793b2b34c7c4b7f41b0b791ffc93e820c45ba6a91c',
 'pandas==2.2.0',
 'scikit-learn==1.4.0',
 'seaborn==0.13.2']

### 1.3 Imports

In [4]:
# built in
import os, warnings
import time

# data
import pandas as pd
import numpy as np

# metrics
from sklearn.metrics import mean_squared_error, r2_score

# estimators
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, Perceptron
from sklearn.neural_network import MLPRegressor

# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## mlflow
import mlflow
import mlflow.sklearn

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# exceptions
from sklearn.exceptions import ConvergenceWarning

# pipeline et preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import KNNImputer



### 1.4 Graphics and option

In [5]:
# warnings.filterwarnings('ignore)
warnings.filterwarnings(action='once')

# Suppress specific warnings
warnings.filterwarnings("ignore", category=UserWarning, module='_distutils_hack')
warnings.filterwarnings("ignore", category=DeprecationWarning, module='importlib')

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

<div class="alert alert-block alert-info">
We disable the warnings.</div>

### 1.5 Data

In [6]:
# !tree

In [7]:
# os.listdir()

In [8]:
path="./data/cleaned/"
filename="df_SiteEnergyUseWN_0.csv"

In [9]:
df = pd.read_csv(path+filename)
df.head()

Unnamed: 0,Neighborhood,GroupedPrimaryPropertyTypes,YearBuilt,NumberofBuildings,NumberofFloors,PropertyGFABuilding(s),ENERGYSTARScore,DistanceToDowntown,Log_SiteEnergyUseWN(kBtu)
0,DOWNTOWN,Bâtiments d'Hébergement,1927.0,1.0,12.0,88434.0,60.0,0.864611,15.824652
1,DOWNTOWN,Bâtiments d'Hébergement,1996.0,1.0,11.0,88502.0,61.0,0.907278,15.974742
2,DOWNTOWN,Bâtiments d'Hébergement,1969.0,1.0,41.0,759392.0,43.0,1.047606,18.118725
3,DOWNTOWN,Bâtiments d'Hébergement,1926.0,1.0,10.0,61320.0,56.0,1.038057,15.753792
4,DOWNTOWN,Bâtiments d'Hébergement,1980.0,1.0,18.0,113580.0,75.0,1.100255,16.500395


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Neighborhood                 1379 non-null   object 
 1   GroupedPrimaryPropertyTypes  1379 non-null   object 
 2   YearBuilt                    1379 non-null   float64
 3   NumberofBuildings            1379 non-null   float64
 4   NumberofFloors               1379 non-null   float64
 5   PropertyGFABuilding(s)       1379 non-null   float64
 6   ENERGYSTARScore              924 non-null    float64
 7   DistanceToDowntown           1379 non-null   float64
 8   Log_SiteEnergyUseWN(kBtu)    1379 non-null   float64
dtypes: float64(7), object(2)
memory usage: 97.1+ KB


## 2. Data Preparation

### 2.1 X & y

In [11]:
X = df.drop(columns="Log_SiteEnergyUseWN(kBtu)")
y = df['Log_SiteEnergyUseWN(kBtu)']

In [12]:
X.head()

Unnamed: 0,Neighborhood,GroupedPrimaryPropertyTypes,YearBuilt,NumberofBuildings,NumberofFloors,PropertyGFABuilding(s),ENERGYSTARScore,DistanceToDowntown
0,DOWNTOWN,Bâtiments d'Hébergement,1927.0,1.0,12.0,88434.0,60.0,0.864611
1,DOWNTOWN,Bâtiments d'Hébergement,1996.0,1.0,11.0,88502.0,61.0,0.907278
2,DOWNTOWN,Bâtiments d'Hébergement,1969.0,1.0,41.0,759392.0,43.0,1.047606
3,DOWNTOWN,Bâtiments d'Hébergement,1926.0,1.0,10.0,61320.0,56.0,1.038057
4,DOWNTOWN,Bâtiments d'Hébergement,1980.0,1.0,18.0,113580.0,75.0,1.100255


In [13]:
y.head()

0    15.824652
1    15.974742
2    18.118725
3    15.753792
4    16.500395
Name: Log_SiteEnergyUseWN(kBtu), dtype: float64

## 3. Models for different k-values (knn imputation)

In [14]:
# Configurer MLflow
mlflow.set_experiment('Gradient_Boosting_Energy_k_val')

# Initialiser le DataFrame des résultats
results_df = pd.DataFrame(columns=['Model Name', 'Best Parameters', 'MSE Train', 'MSE Test', 'R2 Train', 'R2 Test', 'Best CV R2 Test', 'Fit Time'])

def train_eval_model(model, X_train, y_train, X_test, y_test, model_name, best_params=None, best_cv_score=None):
    # Set model parameters if provided
    if best_params:
        model.set_params(**best_params)

    # Mesurer le temps de début
    start_time = time.time()
    # Entraîner le modèle
    model.fit(X_train, y_train)
    # Mesurer le temps de fin et calculer le temps de fitting
    fit_time = time.time() - start_time

    # Prédictions sur les données d'entraînement et de test
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calcul des métriques de performance
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)

    # Enregistrement de la performance du modèle avec MLflow
    with mlflow.start_run(run_name=f"{model_name}"):
        if best_params:
            mlflow.log_params(best_params)
        mlflow.log_metric("mse_train", mse_train)
        mlflow.log_metric("mse_test", mse_test)
        mlflow.log_metric("r2_train", r2_train)
        mlflow.log_metric("r2_test", r2_test)
        if best_cv_score is not None:
            mlflow.log_metric("best_cv_r2_test", best_cv_score)
        mlflow.sklearn.log_model(model, f"model_{model_name}")

        print(f"{model_name} - MSE Train: {mse_train}, MSE Test: {mse_test}, R2 Train: {r2_train}, R2 Test: {r2_test}")

    # # Plotting the predicted vs real values
    # plt.figure(figsize=(10, 6))
    # plt.scatter(y_test, y_test_pred, alpha=0.5)
    # plt.xlabel('Real Values')
    # plt.ylabel('Predicted Values')
    # plt.title(f'Comparison of Real and Predicted Values for {model_name}')
    # plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
    # plt.savefig(f'predicted_vs_real_{model_name}_v1.png')
    # plt.show()
    
    
    # # Calcul des résidus
    # residuals = y_test - y_test_pred

    # # Tracer le graphique des résidus
    # plt.figure(figsize=(10, 6))
    # plt.scatter(y_test_pred, residuals, alpha=0.5)
    # plt.xlabel('Predicted Values')
    # plt.ylabel('Residuals')
    # plt.title(f'Residual Plot for {model_name}')
    # plt.axhline(y=0, color='k', linestyle='--')  # Ligne horizontale à zéro
    # plt.savefig(f'residual_plot_{model_name}_v1.png')
    # plt.show()
    
    # # Tracer l'histogramme des résidus
    # plt.figure(figsize=(10, 6))
    # plt.hist(residuals, bins=30, edgecolor='k', alpha=0.7)
    # plt.xlabel('Residuals')
    # plt.ylabel('Frequency')
    # plt.title(f'Histogram of Residuals for {model_name}')
    # plt.savefig(f'residuals_histogram_{model_name}_v1.png')
    # plt.show()

    # # Tracer le tracé KDE des résidus
    # plt.figure(figsize=(10, 6))
    # sns.kdeplot(residuals, shade=True)
    # plt.xlabel('Residuals')
    # plt.ylabel('Density')
    # plt.title(f'KDE Plot of Residuals for {model_name}')
    # plt.savefig(f'residuals_kde_{model_name}_v1.png')
    # plt.show()


    # Enregistrement des résultats dans le DataFrame
    global results_df
    new_row = pd.DataFrame([{
        'Model Name': model_name,
        'Best Parameters': best_params,
        'MSE Train': mse_train,
        'MSE Test': mse_test,
        'R2 Train': r2_train,
        'R2 Test': r2_test,
        'Best CV R2 Test': best_cv_score,
        'Fit Time': fit_time
    }])
    results_df = pd.concat([results_df, new_row], ignore_index=True)


2024/04/24 07:36:47 INFO mlflow.tracking.fluent: Experiment with name 'Gradient_Boosting_Energy_k_val' does not exist. Creating a new experiment.


In [15]:
for k in range(15, 30):
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=0.3,
                                                        random_state=42)

    numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = X_train.select_dtypes(include=['object', 'category']).columns

    pipe_trans_num = Pipeline(steps=[('scaler', StandardScaler())])
    pipe_trans_cat = Pipeline(steps=[('ohe', OneHotEncoder())])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', pipe_trans_num, numeric_features),
            ('cat', pipe_trans_cat, categorical_features)
        ])

    k_val = k

    model_pipelines = {
        f"GradientBoosting_k_{k}": Pipeline([("preprocessor", preprocessor), ("imputer", KNNImputer(n_neighbors=k_val)), ("model", GradientBoostingRegressor())])
    }
    
    param_grids = {
        f"GradientBoosting_k_{k}": {'model__n_estimators': [150], 'model__learning_rate': [0.1], 
                            'model__max_depth': [2], 
                            'model__min_samples_split': [20], 'model__min_samples_leaf':[10], 
                            'model__max_features': [.75]}
    }

    # Recherche des meilleurs hyperparamètres et enregistrement du meilleur score de validation croisée
    best_params = {}
    best_cv_scores = {}

    for model_name, param_grid in param_grids.items():
        model = model_pipelines[model_name]

        if param_grid:
            grid_search = GridSearchCV(model, param_grid, cv=5, scoring='r2', verbose=1)
            grid_search.fit(X_train, y_train)
            best_params[model_name] = grid_search.best_params_
            best_cv_scores[model_name] = max(grid_search.cv_results_['mean_test_score'])
        else:
            best_params[model_name] = None
            best_cv_scores[model_name] = None

    # Entraînement, évaluation et enregistrement des meilleurs modèles
    for model_name, params in best_params.items():
        model = model_pipelines[model_name]

        train_eval_model(model, X_train, y_train, X_test, y_test, model_name, params, best_cv_scores[model_name])



Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_k_15 - MSE Train: 0.30143983241375194, MSE Test: 0.39119495406397103, R2 Train: 0.8258603851685943, R2 Test: 0.7314639295655931
Fitting 5 folds for each of 1 candidates, totalling 5 fits


  results_df = pd.concat([results_df, new_row], ignore_index=True)


GradientBoosting_k_16 - MSE Train: 0.3034979684838686, MSE Test: 0.3948662163516336, R2 Train: 0.8246714148203464, R2 Test: 0.7289437888070758
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_k_17 - MSE Train: 0.3000495293912457, MSE Test: 0.3976022437016386, R2 Train: 0.8266635531869001, R2 Test: 0.7270656407749023
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_k_18 - MSE Train: 0.3076554600365407, MSE Test: 0.394858580801859, R2 Train: 0.8222696619668834, R2 Test: 0.7289490302359612
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_k_19 - MSE Train: 0.30332287267236563, MSE Test: 0.39245120817488527, R2 Train: 0.8247725664064842, R2 Test: 0.730601572986329
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_k_20 - MSE Train: 0.2994447314892801, MSE Test: 0.3989778344373744, R2 Train: 0.8270129405683082, R2 Test: 0.7261213654797767
Fitting 5 folds for each of 1 candidates, totalli

In [16]:
results_df.head(50)

Unnamed: 0,Model Name,Best Parameters,MSE Train,MSE Test,R2 Train,R2 Test,Best CV R2 Test,Fit Time
0,GradientBoosting_k_15,"{'model__learning_rate': 0.1, 'model__max_dept...",0.30144,0.391195,0.82586,0.731464,0.764807,0.163615
1,GradientBoosting_k_16,"{'model__learning_rate': 0.1, 'model__max_dept...",0.303498,0.394866,0.824671,0.728944,0.761582,0.369276
2,GradientBoosting_k_17,"{'model__learning_rate': 0.1, 'model__max_dept...",0.30005,0.397602,0.826664,0.727066,0.765888,0.372341
3,GradientBoosting_k_18,"{'model__learning_rate': 0.1, 'model__max_dept...",0.307655,0.394859,0.82227,0.728949,0.762644,0.339549
4,GradientBoosting_k_19,"{'model__learning_rate': 0.1, 'model__max_dept...",0.303323,0.392451,0.824773,0.730602,0.762513,0.342522
5,GradientBoosting_k_20,"{'model__learning_rate': 0.1, 'model__max_dept...",0.299445,0.398978,0.827013,0.726121,0.761957,0.339938
6,GradientBoosting_k_21,"{'model__learning_rate': 0.1, 'model__max_dept...",0.305761,0.396127,0.823364,0.728079,0.758803,0.382661
7,GradientBoosting_k_22,"{'model__learning_rate': 0.1, 'model__max_dept...",0.299345,0.393628,0.827071,0.729794,0.757642,0.30842
8,GradientBoosting_k_23,"{'model__learning_rate': 0.1, 'model__max_dept...",0.300743,0.398525,0.826263,0.726432,0.761511,0.315599
9,GradientBoosting_k_24,"{'model__learning_rate': 0.1, 'model__max_dept...",0.301701,0.389311,0.82571,0.732757,0.760058,0.328517


In [17]:
# Retrieve the current MLflow experiment name
mlflow_experiment_name = mlflow.get_experiment_by_name('Gradient_Boosting_Energy_k_val').name

# Construct a filename using the experiment name
filename = f"{mlflow_experiment_name}_results.csv"

# Save the DataFrame
results_df.to_csv(filename, index=False)
print(f"Results saved as {filename}")

Results saved as Gradient_Boosting_Energy_k_val_results.csv


## 4. Models for different train_test_split

In [18]:
# Configurer MLflow
mlflow.set_experiment('Gradient_Boosting_Energy_train_test')

# Initialiser le DataFrame des résultats
results_df = pd.DataFrame(columns=['Model Name', 'Best Parameters', 'MSE Train', 'MSE Test', 'R2 Train', 'R2 Test', 'Best CV R2 Test', 'Fit Time'])

def train_eval_model(model, X_train, y_train, X_test, y_test, model_name, best_params=None, best_cv_score=None):
    # Set model parameters if provided
    if best_params:
        model.set_params(**best_params)

    # Mesurer le temps de début
    start_time = time.time()
    # Entraîner le modèle
    model.fit(X_train, y_train)
    # Mesurer le temps de fin et calculer le temps de fitting
    fit_time = time.time() - start_time

    # Prédictions sur les données d'entraînement et de test
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calcul des métriques de performance
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)

    # Enregistrement de la performance du modèle avec MLflow
    with mlflow.start_run(run_name=f"{model_name}"):
        if best_params:
            mlflow.log_params(best_params)
        mlflow.log_metric("mse_train", mse_train)
        mlflow.log_metric("mse_test", mse_test)
        mlflow.log_metric("r2_train", r2_train)
        mlflow.log_metric("r2_test", r2_test)
        if best_cv_score is not None:
            mlflow.log_metric("best_cv_r2_test", best_cv_score)
        mlflow.sklearn.log_model(model, f"model_{model_name}")

        print(f"{model_name} - MSE Train: {mse_train}, MSE Test: {mse_test}, R2 Train: {r2_train}, R2 Test: {r2_test}")

    # # Plotting the predicted vs real values
    # plt.figure(figsize=(10, 6))
    # plt.scatter(y_test, y_test_pred, alpha=0.5)
    # plt.xlabel('Real Values')
    # plt.ylabel('Predicted Values')
    # plt.title(f'Comparison of Real and Predicted Values for {model_name}')
    # plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
    # plt.savefig(f'predicted_vs_real_{model_name}_v1.png')
    # plt.show()
    
    
    # # Calcul des résidus
    # residuals = y_test - y_test_pred

    # # Tracer le graphique des résidus
    # plt.figure(figsize=(10, 6))
    # plt.scatter(y_test_pred, residuals, alpha=0.5)
    # plt.xlabel('Predicted Values')
    # plt.ylabel('Residuals')
    # plt.title(f'Residual Plot for {model_name}')
    # plt.axhline(y=0, color='k', linestyle='--')  # Ligne horizontale à zéro
    # plt.savefig(f'residual_plot_{model_name}_v1.png')
    # plt.show()
    
    # # Tracer l'histogramme des résidus
    # plt.figure(figsize=(10, 6))
    # plt.hist(residuals, bins=30, edgecolor='k', alpha=0.7)
    # plt.xlabel('Residuals')
    # plt.ylabel('Frequency')
    # plt.title(f'Histogram of Residuals for {model_name}')
    # plt.savefig(f'residuals_histogram_{model_name}_v1.png')
    # plt.show()

    # # Tracer le tracé KDE des résidus
    # plt.figure(figsize=(10, 6))
    # sns.kdeplot(residuals, shade=True)
    # plt.xlabel('Residuals')
    # plt.ylabel('Density')
    # plt.title(f'KDE Plot of Residuals for {model_name}')
    # plt.savefig(f'residuals_kde_{model_name}_v1.png')
    # plt.show()


    # Enregistrement des résultats dans le DataFrame
    global results_df
    new_row = pd.DataFrame([{
        'Model Name': model_name,
        'Best Parameters': best_params,
        'MSE Train': mse_train,
        'MSE Test': mse_test,
        'R2 Train': r2_train,
        'R2 Test': r2_test,
        'Best CV R2 Test': best_cv_score,
        'Fit Time': fit_time
    }])
    results_df = pd.concat([results_df, new_row], ignore_index=True)


2024/04/24 07:38:07 INFO mlflow.tracking.fluent: Experiment with name 'Gradient_Boosting_Energy_train_test' does not exist. Creating a new experiment.


In [19]:
for k in [15]:
    for i in range(50):
        X_train, X_test, y_train, y_test = train_test_split(X,
                                                            y,
                                                            test_size=0.3,
                                                            random_state=i
                                                            )

        numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
        categorical_features = X_train.select_dtypes(include=['object', 'category']).columns

        pipe_trans_num = Pipeline(steps=[('scaler', StandardScaler())])
        pipe_trans_cat = Pipeline(steps=[('ohe', OneHotEncoder())])

        preprocessor = ColumnTransformer(
            transformers=[
                ('num', pipe_trans_num, numeric_features),
                ('cat', pipe_trans_cat, categorical_features)
            ])

        k_val = k

        model_pipelines = {
            f"GradientBoosting_train_test_k_{k_val}_{i}": Pipeline([("preprocessor", preprocessor), ("imputer", KNNImputer(n_neighbors=k_val)), ("model", GradientBoostingRegressor())])
        }
        
        param_grids = {
            f"GradientBoosting_train_test_k_{k_val}_{i}": {'model__n_estimators': [150], 'model__learning_rate': [0.1], 
                            'model__max_depth': [2], 
                            'model__min_samples_split': [20], 'model__min_samples_leaf':[10], 
                            'model__max_features': [.75]}
                    }

        # Recherche des meilleurs hyperparamètres et enregistrement du meilleur score de validation croisée
        best_params = {}
        best_cv_scores = {}

        for model_name, param_grid in param_grids.items():
            model = model_pipelines[model_name]

            if param_grid:
                grid_search = GridSearchCV(model, param_grid, cv=5, scoring='r2', verbose=1)
                grid_search.fit(X_train, y_train)
                best_params[model_name] = grid_search.best_params_
                best_cv_scores[model_name] = max(grid_search.cv_results_['mean_test_score'])
            else:
                best_params[model_name] = None
                best_cv_scores[model_name] = None

        # Entraînement, évaluation et enregistrement des meilleurs modèles
        for model_name, params in best_params.items():
            model = model_pipelines[model_name]

            train_eval_model(model, X_train, y_train, X_test, y_test, model_name, params, best_cv_scores[model_name])



Fitting 5 folds for each of 1 candidates, totalling 5 fits


GradientBoosting_train_test_k_15_0 - MSE Train: 0.3111165240969761, MSE Test: 0.3779214267737809, R2 Train: 0.8142197669113527, R2 Test: 0.7636348877462965
Fitting 5 folds for each of 1 candidates, totalling 5 fits


  results_df = pd.concat([results_df, new_row], ignore_index=True)


GradientBoosting_train_test_k_15_1 - MSE Train: 0.3060617484125429, MSE Test: 0.36698360600074414, R2 Train: 0.8214908312489038, R2 Test: 0.7562956193510273
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_train_test_k_15_2 - MSE Train: 0.2840678465890933, MSE Test: 0.4281828139920785, R2 Train: 0.8275924082301028, R2 Test: 0.7415187751723378
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_train_test_k_15_3 - MSE Train: 0.2912765879509667, MSE Test: 0.4337219071594727, R2 Train: 0.8236002477509149, R2 Test: 0.7376678415597557
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_train_test_k_15_4 - MSE Train: 0.30659395791166416, MSE Test: 0.4036678962937263, R2 Train: 0.8201598693462702, R2 Test: 0.7359000384009782
Fitting 5 folds for each of 1 candidates, totalling 5 fits
GradientBoosting_train_test_k_15_5 - MSE Train: 0.31081352464905104, MSE Test: 0.39597168486354184, R2 Train: 0.8138524211724316, R2 Test:

In [20]:
results_df.head(50)

Unnamed: 0,Model Name,Best Parameters,MSE Train,MSE Test,R2 Train,R2 Test,Best CV R2 Test,Fit Time
0,GradientBoosting_train_test_k_15_0,"{'model__learning_rate': 0.1, 'model__max_dept...",0.311117,0.377921,0.81422,0.763635,0.741929,0.371997
1,GradientBoosting_train_test_k_15_1,"{'model__learning_rate': 0.1, 'model__max_dept...",0.306062,0.366984,0.821491,0.756296,0.742771,0.327622
2,GradientBoosting_train_test_k_15_2,"{'model__learning_rate': 0.1, 'model__max_dept...",0.284068,0.428183,0.827592,0.741519,0.75414,0.318077
3,GradientBoosting_train_test_k_15_3,"{'model__learning_rate': 0.1, 'model__max_dept...",0.291277,0.433722,0.8236,0.737668,0.763727,0.316853
4,GradientBoosting_train_test_k_15_4,"{'model__learning_rate': 0.1, 'model__max_dept...",0.306594,0.403668,0.82016,0.7359,0.757692,0.343059
5,GradientBoosting_train_test_k_15_5,"{'model__learning_rate': 0.1, 'model__max_dept...",0.310814,0.395972,0.813852,0.754116,0.741709,0.320227
6,GradientBoosting_train_test_k_15_6,"{'model__learning_rate': 0.1, 'model__max_dept...",0.269955,0.469309,0.831632,0.733817,0.757759,0.273239
7,GradientBoosting_train_test_k_15_7,"{'model__learning_rate': 0.1, 'model__max_dept...",0.29216,0.430513,0.819153,0.74877,0.750147,0.321368
8,GradientBoosting_train_test_k_15_8,"{'model__learning_rate': 0.1, 'model__max_dept...",0.301501,0.389795,0.814177,0.772903,0.730891,0.328207
9,GradientBoosting_train_test_k_15_9,"{'model__learning_rate': 0.1, 'model__max_dept...",0.303869,0.415779,0.812034,0.760241,0.757855,0.338388


In [21]:
# Retrieve the current MLflow experiment name
mlflow_experiment_name = mlflow.get_experiment_by_name('Gradient_Boosting_Energy_train_test').name

# Construct a filename using the experiment name
filename = f"{mlflow_experiment_name}_results.csv"

# Save the DataFrame
results_df.to_csv(filename, index=False)
print(f"Results saved as {filename}")

Results saved as Gradient_Boosting_Energy_train_test_results.csv


In [22]:
# !mlflow ui