## Notebook 2: Apprentissage automatique, Regression: Pollution CO2

#### Import des bibliothèques

In [1]:
import pandas as pd
import numpy as np
#-------------------------------------------------

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

#-------------------------------------------------
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor, make_column_transformer

#--------------------------------------------------------
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor

#--------------------------------------------------------
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.inspection import permutation_importance
#------------------------------------------------
from xgboost import XGBRegressor

#-----------------------------------------------------------
from sklearn.model_selection import GridSearchCV

#-----------------------------------------------------------
from joblib import dump

## Checklist

Depuis 2001, **l’ADEME** acquiert tous les ans ces données auprès de **l’Union Technique de l’Automobile du motocycle et du Cycle UTAC** (en charge de l’homologation des véhicules avant leur mise en vente) en accord avec le ministère du développement durable.
Pour chaque véhicule les données d’origine (transmises par l’Utac) sont les suivantes :

* **Les consommations de carburant**

* **Les émissions de dioxyde de carbone (CO2)**

* **Les émissions des polluants de l’air** (réglementés dans le cadre de la norme Euro)

* **L’ensemble des caractéristiques techniques des véhicules** (gammes, marques, modèles, n° de CNIT, type d’énergie ...)



# L'inventaire des variables pertinentes:

Les données comprennent les variables pertinentes suivantes:

* **lib_mrq_utac**: La marque, il y en a 12.

* **lib_mod**: Le modèle commerciale, il y en a 20.

* **cod_cbr**: Le type de carburant, il y en a 5.

* **hybride**: Information permettant d’identifier les véhicules hybrides (O/N)

* **puiss_max** : Puissance maximale

* **typ_boite_nb_rapp**: Type boite de vitesse et le nombre de rapport.

* **conso_urb**: Consommation urbaine de carburant (en l/100km),

* **conso_exurb**: consommation extra urbaine de carburant (en l/100km),

* **conso_mixte**: Consommation mixte de carburant (en l/100km),

* **co2**: Emission de CO2 (en g/km),

* **masse_ordma_min**: Masse en ordre de marche mini

* **masse_ordma_max**: Masse en ordre de marche max
 
* **Carrosserie**: Carrosserie

* **gamme**: Gamme du véhicule



# Objectif

Notre objectif majeur dans ce notebook 2 est de :

Prédire les émissions de **CO2** des véhicules en fonction de certaines informations (Variables explicatives)

* En utilisant 6 modèles différents

* En comparant les scores
    
* En choisissant le meilleur modèle


# Description des données

Lien vers les données: https://www.data.gouv.fr/fr/datasets/emissions-de-co2-et-de-polluants-des-vehicules-commercialises-en-france/


### Chargement, lecture, apercu et infos des données

In [2]:
# import data
data_model = pd.read_csv("data_model.csv")

In [3]:
# check if import is correct
data_model.sample(7)

Unnamed: 0,lib_mrq,cnit,cod_cbr,hybride,puiss_max,Carrosserie,gamme,co2,masse_ordma_min,masse_ordma_max,Type_boite,Nb_rapp
7485,MERCEDES,M10MCDVPF12S772,GO,non,95,MINIBUS,MOY-INFER,216.0,2186.0,2355.0,A,7.0
50871,VOLKSWAGEN,M10VWGVPC47M876,GO,non,100,MINIBUS,MOY-INFER,209.0,2219.0,2815.0,M,6.0
23164,MERCEDES,M10MCDVP6058635,GO,non,120,MINIBUS,MOY-SUPER,193.0,2186.0,2350.0,M,6.0
4266,MERCEDES,M10MCDVPFY6C198,GO,non,125,COUPE,SUPERIEURE,123.0,1615.0,1615.0,M,6.0
28833,MERCEDES,M10MCDVP025C892,ES,non,190,MINIBUS,MOY-SUPER,284.0,2076.0,2185.0,A,5.0
6632,MERCEDES,M1GMCDVPX557807,GO,non,190,TS TERRAINS/CHEMINS,LUXE,189.0,2175.0,2175.0,A,7.0
38630,MERCEDES,M10MCDVP372U987,GO,non,165,MINIBUS,MOY-INFER,226.0,2356.0,2450.0,A,5.0


In [4]:
# informations about the dataframe
data_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55028 entries, 0 to 55027
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   lib_mrq          55028 non-null  object 
 1   cnit             55028 non-null  object 
 2   cod_cbr          55028 non-null  object 
 3   hybride          55028 non-null  object 
 4   puiss_max        55028 non-null  object 
 5   Carrosserie      55028 non-null  object 
 6   gamme            55028 non-null  object 
 7   co2              55028 non-null  float64
 8   masse_ordma_min  55028 non-null  float64
 9   masse_ordma_max  55028 non-null  float64
 10  Type_boite       55028 non-null  object 
 11  Nb_rapp          55028 non-null  float64
dtypes: float64(4), object(8)
memory usage: 5.0+ MB


### Selectionner les features les plus importants

In [5]:
# most important features
New_Data = data_model[['Carrosserie', 'masse_ordma_min', 'masse_ordma_max', 'co2']]

In [6]:
# here Y will keep CO2 feature (which is our target) and X will become our new dataset without CO2 feature
y = New_Data['co2']    # Target
X = New_Data.drop(['co2'] ,axis =1)  # Features

### Traiter la colonne Carrosserie

In [7]:
# encode 'Carrosserie' categorical feature into numerical
le=LabelEncoder()
X['Carrosserie']=le.fit_transform(X['Carrosserie'])
dump(le,'Encoder.pkl')

['Encoder.pkl']

### Prediction de CO2
    
Pour chacun de nos modèles:
   * **DummyRegressor**,
   * **LinearRegression**, 
   * **SGDRegressor**, 
   * **RandomForestRegressor**,
   * **GradientBoostingRegressor**,
   * **XGBRegressor**, 
    
Nous allons predire le niveau d'émission de **Co2** puis établir une comparaison entre les performances de chaque modèle.

    Appliquer un GridSearch pour optimiser les hyperparamètres de chaque modèle
    Paramètrer dans GridSearch les scores qui sont adaptés à la régression (R2, MAE et RMSE)
    Identifier le meilleur modèle

In [8]:
# Create the regressors and their respective hyperparameters grids
# regressors = [DummyRegressor(), LinearRegression(), SGDRegressor(), RandomForestRegressor(), GradientBoostingRegressor(), XGBRegressor()]

# param_grids =[{'strategy': ['mean', 'median', 'quantile'], 'quantile' : [0.8]},
#                 {'fit_intercept': [True, False], 'copy_X': [True, False], 'n_jobs': [None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
#                 {'loss' : ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'], 'penalty' : ['l2', 'l1', 'elasticnet'], 'alpha' : [0.0001, 0.001, 0.01, 0.1, 1]},
#                 {'n_estimators' : [100, 200, 300, 400, 500], 'criterion' : ['mse', 'mae'], 'max_depth' : [None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
#                 {'loss' : ['ls', 'lad', 'huber', 'quantile'], 'learning_rate' : [0.001, 0.01, 0.1, 1], 'n_estimators' : [100, 200, 300, 400, 500], 'criterion' : ['friedman_mse', 'mse', 'mae']},
#                 {'objective' : ['reg:squarederror', 'reg:squaredlogerror', 'reg:logistic', 'reg:pseudohubererror'], 'learning_rate' : [0.001, 0.01, 0.1, 1], 'n_estimators' : [100, 200, 300, 400, 500]}
#                 ]

regressors = {'DummyRegressor' : DummyRegressor(),
              'LinearRegression' : LinearRegression(),
              'SGDRegressor' : SGDRegressor(),
              'RandomForestRegressor' : RandomForestRegressor(),
              'GradientBoostingRegressor' : GradientBoostingRegressor(),
              'XGBRegressor' : XGBRegressor()
              }

param_grids = {'DummyRegressor' : {'strategy': ['mean', 'median', 'quantile'], 'quantile' : [0.8]},
                'LinearRegression' : {'fit_intercept': [True, False], 'copy_X': [True, False], 'n_jobs': [None, 1, 6]},
                'SGDRegressor' : {'loss' : ['squared_loss', 'huber'], 'penalty' : ['l2', 'l1'], 'alpha' : [0.01, 0.1]},
                'RandomForestRegressor' : {'n_estimators' : [20, 50], 'criterion' : ['poisson', 'squared_error']},
                'GradientBoostingRegressor' : {'loss' : ['ls', 'huber',], 'learning_rate' : [0.01,0.1], 'n_estimators' : [50]},
                'XGBRegressor' : {'learning_rate' : [0.001, 1], 'n_estimators' : [50, 100]}
                }



In [None]:
# Useful for SGDRegressor
scaler = StandardScaler()  # Create a StandardScaler object
X = scaler.fit_transform(X)  # Standardize the features in matrix X
dump(scaler, 'Scaler.pkl')  # Save the scaler for later use

In [9]:
# Split the data in a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True)

# Scoring metrics for evaluation
scoring_metrics = ['r2', 'neg_mean_absolute_error', 'neg_mean_squared_error']

# Model Evaluation Loop
for model_name,model in regressors.items() :
    print(f"evaluation of {model_name} running")

    # Hyperparameter Tuning and Cross-Validation
    param_grid = param_grids[model_name]  # Get the hyperparameter grid for the current model
    grid_search = GridSearchCV(model, param_grid, scoring=scoring_metrics, refit='r2', cv=5)
    grid_search.fit(X_train, y_train)  # Perform grid search and cross-validation
    
    best_model = grid_search.best_estimator_  # Get the best model from the grid search
    y_pred = best_model.predict(X_test)  # Predict target values using the best model

    # Calculate evaluation metrics    
    r2 = r2_score(y_test, y_pred) # R-squared
    mae = mean_absolute_error(y_test, y_pred) # Mean Absolute Error
    rmse = mean_squared_error(y_test, y_pred, squared=False) # Root Mean Squared Error

    print("Best Parameters:", grid_search.best_params_)
    print("R2:", r2)
    print("MAE:", mae)
    print("RMSE:", rmse)
    print("=" * 20)  # display 20 times '='

evaluation of DummyRegressor running
Best Parameters: {'quantile': 0.8, 'strategy': 'mean'}
R2: -7.69110327403233e-06
MAE: 22.302972549351892
RMSE: 34.04974648960275
evaluation of LinearRegression running
Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None}
R2: 0.430097227675622
MAE: 17.050194358060917
RMSE: 25.70470343021974
evaluation of SGDRegressor running


20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/linear_model/_stochastic_gradient.py", line 1582, in fit
    self._validate_params()
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constr

Best Parameters: {'alpha': 0.01, 'loss': 'huber', 'penalty': 'l1'}
R2: 0.3972281836248005
MAE: 17.11932292161953
RMSE: 26.43557016105933
evaluation of RandomForestRegressor running
Best Parameters: {'criterion': 'poisson', 'n_estimators': 50}
R2: 0.6668328112513786
MAE: 12.423483892999414
RMSE: 19.653654849518563
evaluation of GradientBoostingRegressor running


10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/ensemble/_gb.py", line 420, in fit
    self._validate_params()
  File "/home/zaphyra/Documents/Vscode/iadev-py/numpy_py/.venvnumpy/lib64/python3.11/site-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "/home/z

Best Parameters: {'learning_rate': 0.1, 'loss': 'huber', 'n_estimators': 50}
R2: 0.5341833360009189
MAE: 14.296621205864147
RMSE: 23.239122386473806
evaluation of XGBRegressor running
Best Parameters: {'learning_rate': 1, 'n_estimators': 50}
R2: 0.6693427183096918
MAE: 12.503925002302173
RMSE: 19.579484726044143


In [10]:
# Select the best model and its hyperparameters
best_model_name = type(grid_search.best_estimator_).__name__
print(f"The best model is {best_model_name}")

best_params = grid_search.best_params_
print("Hyperparameters used by the best model:")
for param, value in best_params.items():
    print(f"{param}: {value}")

The best model is XGBRegressor
Hyperparameters used by the best model:
learning_rate: 1
n_estimators: 50


train test split à 30% : utiliser ces 30% pour gridsearch (pour limiter le temps de recherche)
permet de tester le meilleur modèle (une fois qu'il est choisit, on peut utiliser KFold pour l'intégralité des données pour ce modèle)

In [11]:
# Save the trained model
dump(best_model, 'Model.pkl')

['Model.pkl']

### Application Web

Développer une application Streamlit avec les options suivantes :

    1- L'utilisateur doit selectionner le type du carrosserie à partire d'un menu déroulant
    2- L'utilisateur doit saisir 'masse_ordma_min' et 'masse_ordma_max' dans deux champs de saisie différent
    3- Programmer un boutton pour lancer la prédiction de CO2