A réaliser : 
- Une analyse descriptive des données, y compris une explication du sens des colonnes gardées, des arguments derrière la suppression de lignes ou de colonnes, des statistiques descriptives et des visualisations pertinentes.

## Import des modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [2]:

bc_after_eda = pd.read_csv("data/bc_after_eda.csv", index_col='Unnamed: 0')
bc_after_eda

Unnamed: 0,3LargestGFA,ListOfAllPropertyUseTypes,FirstUseType,SecondLargestPropertyUseType,YearBuilt,NumberofFloors,NumberofBuildings,Latitude,Longitude,Neighborhood,SiteEnergyUse(kBtu),Electricity(kBtu),NaturalGas(kBtu),SteamUse(kBtu)
0,88434.0,Hotel,Hotel,,1927,12,1.0,47.61220,-122.33799,DOWNTOWN,7.226362e+06,3.946027e+06,1.276453e+06,2003882.00
1,103566.0,"Hotel, Parking, Restaurant",Hotel,Parking,1996,11,1.0,47.61317,-122.33393,DOWNTOWN,8.387933e+06,3.242851e+06,5.145082e+06,0.00
2,756493.0,Hotel,Hotel,,1969,41,1.0,47.61393,-122.33810,DOWNTOWN,7.258702e+07,4.952666e+07,1.493800e+06,21566554.00
3,61320.0,Hotel,Hotel,,1926,10,1.0,47.61412,-122.33664,DOWNTOWN,6.794584e+06,2.768924e+06,1.811213e+06,2214446.25
4,191454.0,"Hotel, Parking, Swimming Pool",Hotel,Parking,1980,18,1.0,47.61375,-122.34047,DOWNTOWN,1.417261e+07,5.368607e+06,8.803998e+06,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3371,12294.0,Office,Office,,1990,1,1.0,47.56722,-122.31154,GREATER DUWAMISH,8.497457e+05,5.242709e+05,3.254750e+05,0.00
3372,16000.0,Other - Recreation,Other,,2004,1,1.0,47.59625,-122.32283,DOWNTOWN,9.502762e+05,3.965461e+05,5.537300e+05,0.00
3373,13157.0,"Fitness Center/Health Club/Gym, Other - Recrea...",Other,Fitness Center/Health Club/Gym,1974,1,1.0,47.63644,-122.35784,MAGNOLIA / QUEEN ANNE,5.765898e+06,1.792159e+06,3.973739e+06,0.00
3374,13586.0,"Fitness Center/Health Club/Gym, Food Service, ...",Mixed Use Property,Fitness Center/Health Club/Gym,1989,1,1.0,47.52832,-122.32431,GREATER DUWAMISH,7.194712e+05,3.488702e+05,3.706010e+05,0.00


## Import des modules 

In [3]:
#Selection
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV, 
    cross_validate,
)
from sklearn.metrics import r2_score, mean_absolute_error , root_mean_squared_error, mean_absolute_percentage_error
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline

#Preprocess
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler,FunctionTransformer

#Modèles
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


## Feature Engineering

A réaliser : Enrichir le jeu de données actuel avec de nouvelles features issues de celles existantes. 

### Préparation des features pour la modélisation

A réaliser :
* Si ce n'est pas déjà fait, supprimer toutes les colonnes peu pertinentes pour la modélisation.
* Tracer la distribution de la cible pour vous familiariser avec l'ordre de grandeur. En cas d'outliers, mettez en place une démarche pour les supprimer.
* Débarrassez-vous des features redondantes en utilisant une matrice de corrélation.
* Réalisez différents graphiques pour comprendre le lien entre vos features et la target (boxplots, scatterplots, pairplot si votre nombre de features numériques n'est pas très élevé).
*  Séparez votre jeu de données en un Pandas DataFrame X (ensemble de feautures) et Pandas Series y (votre target).
* Si vous avez des features catégorielles, il faut les encoder pour que votre modèle fonctionne.

#### Modes énergétiques

In [4]:
bc_after_eda['UseGas'] = (bc_after_eda['NaturalGas(kBtu)'].notna()) & (bc_after_eda['NaturalGas(kBtu)'] != 0)
bc_after_eda['UseSteam'] = (bc_after_eda['SteamUse(kBtu)'].notna()) & (bc_after_eda['SteamUse(kBtu)'] != 0)
bc_after_eda['UseElectricity'] = (bc_after_eda['Electricity(kBtu)'].notna()) & (bc_after_eda['Electricity(kBtu)'] != 0)


#### Age de la propriété

In [5]:
bc_after_eda['AgeProperty']= 2016 - bc_after_eda['YearBuilt']
bc_after_eda['AgeCategory'] = pd.cut(bc_after_eda['AgeProperty'], 
                                  bins=[0, 20, 40, 70, 200], 
                                  labels=['New', 'Recent', 'Old', 'Historic'])


#### Ere de  construction de la propriété

In [6]:
bc_after_eda['EnergyEra'] = pd.cut(bc_after_eda['YearBuilt'], 
                                bins=[1800, 1980, 2000, 2025], 
                                labels=['Pre-Energy-Crisis', 'Modern', 'Contemporary'])

#### distance du centre ville

In [7]:
seattle_position = (47.6085965,-122.5049456)
bc_after_eda['CityDistance'] = np.sqrt(
    (bc_after_eda['Latitude'] - seattle_position[0])**2 + 
    (bc_after_eda['Longitude'] - seattle_position[1])**2)
bc_after_eda['CityDistance'].describe()


count    1641.000000
mean        0.178664
std         0.026149
min         0.099400
25%         0.164800
50%         0.176047
75%         0.192523
max         0.303815
Name: CityDistance, dtype: float64

#### Utilisation multiple

In [8]:
bc_after_eda['MultipleUseType'] = bc_after_eda['ListOfAllPropertyUseTypes'].str.count('s')+1
bc_after_eda['MultipleUseType'].value_counts()

MultipleUseType
1    971
2    537
3     93
4     32
6      3
7      2
5      2
8      1
Name: count, dtype: int64

In [None]:
bc_after_eda.columns

Index(['3LargestGFA', 'ListOfAllPropertyUseTypes', 'FirstUseType',
       'SecondLargestPropertyUseType', 'YearBuilt', 'NumberofFloors',
       'NumberofBuildings', 'Latitude', 'Longitude', 'Neighborhood',
       'SiteEnergyUse(kBtu)', 'Electricity(kBtu)', 'NaturalGas(kBtu)',
       'SteamUse(kBtu)', 'UseGas', 'UseSteam', 'UseElectricity', 'AgeProperty',
       'AgeCategory', 'EnergyEra', 'CityDistance', 'MultipleUseType'],
      dtype='object')

## Split train/test

In [9]:
predict_values = ['3LargestGFA', 'MultipleUseType', 'UseSteam', 'UseElectricity', 'UseGas', 'EnergyEra',
       'AgeCategory', 'NumberofFloors', 'NumberofBuildings', 'CityDistance', 'Neighborhood', 'FirstUseType']
X = bc_after_eda[predict_values]
y = bc_after_eda['SiteEnergyUse(kBtu)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
print("Index X_train avant pipeline :", X_train.index.equals(y_train.index))


Index X_train avant pipeline : True


## Finalisation des features

In [11]:
# ========================
# ÉTAPE 1: PREPROCESSING PERSONNALISÉ 
# ========================

def fix_floors_and_discretize(df):
    """Fonction qui fait tout votre preprocessing d'un coup"""
    df = df.copy()
    
    # 1. Corriger NumberofFloors
    mask = (df['NumberofFloors'] < 1)
    OneBuildingMeanFloor = df[df['NumberofBuildings']==1]["NumberofFloors"].mean()
    OneBuildingMeanFloor = int(OneBuildingMeanFloor.round(0))
    df.loc[mask,'NumberofFloors'] = OneBuildingMeanFloor
    
    # 2. Créer AvgFloor (comme vous faisiez)
    #df['AvgFloor'] = df['NumberofFloors']/df['NumberofBuildings']
    
    
    # 5. PropertySize (3LargestGFA) – quantiles auto sur train
    if not hasattr(fix_floors_and_discretize, 'size_bins'):
        _, fix_floors_and_discretize.size_bins = pd.qcut(
            df['3LargestGFA'], q=4, retbins=True, duplicates='drop'
        )
    df['PropertySize'] = pd.cut(df['3LargestGFA'],
                                bins=fix_floors_and_discretize.size_bins,
                                labels=['Small', 'Mid', 'Large', 'XLarge'],
                                include_lowest=True)
    
    # 6. HeightCategory (NumberofFloors) – quantiles auto sur train
    if not hasattr(fix_floors_and_discretize, 'floor_bins'):
        _, fix_floors_and_discretize.floor_bins = pd.qcut(
            df['NumberofFloors'], q=3, retbins=True, duplicates='drop'
        )
    df['HeightCategory'] = pd.cut(df['NumberofFloors'],
                                  bins=fix_floors_and_discretize.floor_bins,
                                  labels=['Low', 'Mid', 'High'],
                                  include_lowest=True)
    return df

# ========================
# ÉTAPE 2: PIPELINE COMPLET
# ========================

# Colonnes après votre preprocessing
categorical_features = ['FirstUseType', 'PropertySize',
                       'Neighborhood','AgeCategory','EnergyEra','HeightCategory']  # Ajoutez vos autres catégories ici

numerical_features = ['3LargestGFA',  
                     'CityDistance', 'MultipleUseType', 'NumberofFloors','NumberofBuildings']

# Pipeline complet
full_pipeline = Pipeline([
    # Étape 1: Preprocessing personnalisé
    ('preprocessing', FunctionTransformer(fix_floors_and_discretize, validate=False)),
    
    # Étape 2: Encodage + Normalisation
    ('encoder', ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), 
         categorical_features),
        ('num', StandardScaler(), numerical_features)
    ], remainder='passthrough'))
])

# ========================
# ÉTAPE 3: APPLICATION 
# ========================

# Fit sur train et transform train/test
X_train_transformed = full_pipeline.fit_transform(X_train)
X_test_transformed = full_pipeline.transform(X_test)

# ========================
# ÉTAPE 4: CRÉER VOS DataFrames _final
# ========================

# Récupérer les noms des colonnes
onehot = full_pipeline.named_steps['encoder'].named_transformers_['cat']
onehot_names = onehot.get_feature_names_out(categorical_features)
num_names = [f"scaled_{col}" for col in numerical_features]

# Colonnes restantes (passthrough)
all_cols_after_preprocessing = fix_floors_and_discretize(X_train).columns
remaining_cols = [col for col in all_cols_after_preprocessing 
                 if col not in categorical_features + numerical_features]

# Noms finaux
final_feature_names = list(onehot_names) + num_names + remaining_cols

# Vos DataFrames finaux
X_train_final = pd.DataFrame(X_train_transformed, columns=final_feature_names,index=X_train.index )
X_test_final = pd.DataFrame(X_test_transformed, columns=final_feature_names,index=X_test.index)

# Conversion en numérique
for col in X_train_final.columns:
    X_train_final[col] = pd.to_numeric(X_train_final[col], errors='coerce')
    X_test_final[col] = pd.to_numeric(X_test_final[col], errors='coerce')

#X_train_final.drop(columns=['NumberofFloors', 'NumberofBuildings'], inplace=True)
#X_test_final.drop(columns=['NumberofFloors', 'NumberofBuildings'], inplace=True)

print(f"✅ Pipeline terminé!")
print(f"Shape finale: Train {X_train_final.shape}, Test {X_test_final.shape}")


✅ Pipeline terminé!
Shape finale: Train (1312, 72), Test (329, 72)




In [12]:
print("Index X_train_final après pipeline :", X_train_final.index.equals(y_train.index))

Index X_train_final après pipeline : True


In [13]:
X_train_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1312 entries, 2338 to 1965
Data columns (total 72 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   FirstUseType_Automobile Dealership                            1312 non-null   float64
 1   FirstUseType_Bank Branch                                      1312 non-null   float64
 2   FirstUseType_Courthouse                                       1312 non-null   float64
 3   FirstUseType_Data Center                                      1312 non-null   float64
 4   FirstUseType_Distribution Center                              1312 non-null   float64
 5   FirstUseType_Financial Office                                 1312 non-null   float64
 6   FirstUseType_Fire Station                                     1312 non-null   float64
 7   FirstUseType_Fitness Center/Health Club/Gym                   1312 non-

In [16]:
# CODE COMPARAISON DES MODELES

models = {
    'DummyRegressor': DummyRegressor(strategy='mean'),
    'LinearRegression': LinearRegression(),
    'SVR': SVR(),
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(random_state=42, n_jobs=-1)
}

scoring = ['neg_root_mean_squared_error','r2','neg_mean_absolute_error']
cv_results = {}

print("=== COMPARAISON MODÈLES AVEC PIPELINE COMPLET ===")
for name, model in models.items():
    # Pipeline complet + modèle final
    full_estimator = Pipeline([
        ('preprocess', full_pipeline.named_steps['preprocessing']),
        ('encode_scale', full_pipeline.named_steps['encoder']),
        ('model', model)
    ])
    scores = cross_validate(full_estimator, X_train, y_train, 
                            cv=5, scoring=scoring, n_jobs=-1)
    cv_results[name] = {
        'RMSE': -scores['test_neg_root_mean_squared_error'].mean(),
        'R2': scores['test_r2'].mean(),
        'MAE': -scores['test_neg_mean_absolute_error'].mean()
    }
    print(f"{name} → R²: {cv_results[name]['R2']:.3f}, RMSE: {cv_results[name]['RMSE']:.0f}")
    
    

=== COMPARAISON MODÈLES AVEC PIPELINE COMPLET ===


  return _ForkingPickler.loads(res)


DummyRegressor → R²: -0.017, RMSE: 29013479
LinearRegression → R²: 0.494, RMSE: 15320664
SVR → R²: -0.067, RMSE: 29634019




GradientBoosting → R²: -0.424, RMSE: 28769810
RandomForest → R²: 0.172, RMSE: 24054996




### Optimisation et interprétation du modèle

A réaliser :
* Reprennez le meilleur algorithme que vous avez sécurisé via l'étape précédente, et réalisez une GridSearch de petite taille sur au moins 3 hyperparamètres.
* Si le meilleur modèle fait partie de la famille des modèles à arbres (RandomForest, GradientBoosting) alors utilisez la fonctionnalité feature importance pour identifier les features les plus impactantes sur la performance du modèle. Sinon, utilisez la méthode Permutation Importance de sklearn. 

In [638]:
# CODE OPTIMISATION ET INTERPRETATION DU MODELE
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
# Définir une petite grille d’hyperparamètres à tester
param_grid = {
    'n_estimators': [100, 200],          # nombre d’arbres
    'max_depth': [10, 20, None],         # profondeur max des arbres
    'min_samples_split': [2, 5]          # nombre min d’échantillons pour split
}

# Créer et configurer la GridSearch
gs = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',  # RMSE négatif car maximisation par GridSearch
    cv=3,                                  # validation croisée 3 folds pour rapidité
    n_jobs=-1,
    verbose=1
)

# Exécuter la recherche sur le train
gs.fit(X_train_final, y_train)

# Afficher les meilleurs paramètres et score
print("Best parameters:", gs.best_params_)
print("Best RMSE (neg):", gs.best_score_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}
Best RMSE (neg): -22389105.06051686


### Résultat V1 'main'


Recalculated RMSE: 8594300.258724326  
Recalculated R2: 0.7820465641162211  
Recalculated MAE: 3339004.45232972  
Recalculated MAPE: 0.8307252249062986  



In [639]:
best_model = gs.best_estimator_

importances = best_model.feature_importances_
feature_names = X_train_final.columns

# Affichage du top 15 features les plus importantes
sorted_idx = importances.argsort()[::-1]
print("Top 10 features by importance:")
for idx in sorted_idx[:30]:
    print(f"- {feature_names[idx]}: {importances[idx]:.4f}")

Top 10 features by importance:
- scaled_3LargestGFA: 0.4418
- scaled_NumberofBuildings: 0.3338
- FirstUseType_Hospital: 0.0543
- scaled_CityDistance: 0.0338
- FirstUseType_Data Center: 0.0313
- scaled_NumberofFloors: 0.0176
- FirstUseType_Mixed Use Property: 0.0118
- HeightCategory_Mid: 0.0115
- Neighborhood_NORTHEAST: 0.0061
- scaled_MultipleUseType: 0.0057
- FirstUseType_University: 0.0048
- UseGas: 0.0046
- EnergyEra_Modern: 0.0046
- FirstUseType_Other: 0.0045
- AgeCategory_Recent: 0.0044
- UseSteam: 0.0034
- Neighborhood_EAST: 0.0030
- FirstUseType_Laboratory: 0.0029
- EnergyEra_Pre-Energy-Crisis: 0.0023
- HeightCategory_Low: 0.0023
- FirstUseType_Hotel: 0.0018
- Neighborhood_LAKE UNION: 0.0016
- FirstUseType_Supermarket / Grocery Store: 0.0014
- Neighborhood_GREATER DUWAMISH: 0.0014
- AgeCategory_Old: 0.0013
- FirstUseType_Medical Office: 0.0011
- AgeCategory_New: 0.0009
- Neighborhood_DOWNTOWN: 0.0009
- FirstUseType_Large Office: 0.0008
- FirstUseType_Warehouse: 0.0005


In [640]:
y_pred = best_model.predict(X_test_final)
rmse = root_mean_squared_error(y_test, y_pred)  
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

print(f"Recalculated RMSE: {rmse}")
print(f"Recalculated R2: {r2}")
print(f"Recalculated MAE: {mae}")
print(f"Recalculated MAPE: {mape}")

Recalculated RMSE: 8887739.117017826
Recalculated R2: 0.7669091255711366
Recalculated MAE: 3301680.4566533007
Recalculated MAPE: 0.8048223870314912
