Dans ce notebook, on va utiliser le type de bâtiment mais pas les usages principaux

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

import sklearn
sklearn.set_config(transform_output = "pandas")

from sklearn.preprocessing import OneHotEncoder, TargetEncoder, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_absolute_error as mae, mean_absolute_percentage_error as mape, mean_squared_error as mse
from sklearn.linear_model import LinearRegression

import warnings
#warnings.filterwarnings("ignore")

from utils import *

In [2]:
df = pd.read_csv('df_clean.csv', index_col=0)
print(df.shape)

(1578, 20)


In [3]:
df.head()

Unnamed: 0,PrimaryPropertyType,CouncilDistrictCode,Neighborhood,YearBuilt,NumberofFloors,PropertyGFATotal,LargestPropertyUseType,LargestPropertyUseTypeGFA,SecondLargestPropertyUseType,SecondLargestPropertyUseTypeGFA,ThirdLargestPropertyUseType,ThirdLargestPropertyUseTypeGFA,ENERGYSTARScore,SiteEUIWN(kBtu/sf),SiteEnergyUseWN(kBtu),SteamUse(kBtu),Electricity(kBtu),NaturalGas(kBtu),TotalGHGEmissions,GHGEmissionsIntensity
0,Hotel,7,DOWNTOWN,1927,12,88434,Hotel,88434.0,,0.0,,0.0,60.0,84.300003,7456910.0,2003882.0,3946027.0,1276453.0,249.98,2.83
1,Hotel,7,DOWNTOWN,1996,11,103566,Hotel,83880.0,Parking,15064.0,Restaurant,4622.0,61.0,97.900002,8664479.0,0.0,3242851.0,5145082.0,295.86,2.86
2,Hotel,7,DOWNTOWN,1969,41,956110,Hotel,756493.0,,0.0,,0.0,43.0,97.699997,73937112.0,21566554.0,49526664.0,1493800.0,2089.28,2.19
3,Hotel,7,DOWNTOWN,1926,10,61320,Hotel,61320.0,,0.0,,0.0,56.0,113.300003,6946800.5,2214446.25,2768924.0,1811213.0,286.43,4.67
4,Hotel,7,DOWNTOWN,1980,18,175580,Hotel,123445.0,Parking,68009.0,Swimming Pool,0.0,75.0,118.699997,14656503.0,0.0,5368607.0,8803998.0,505.01,2.88


# Prédiction sur la consommation non normalisée

Pour commencer on va utiliser peu de variables pour la modélisation, puis on en rajoutera au fur et à mesure

## Train/Test split

In [4]:
X = df[['SiteEnergyUseWN(kBtu)', 'PrimaryPropertyType', 'CouncilDistrictCode', 'PropertyGFATotal']]
y = X.pop('SiteEnergyUseWN(kBtu)')
print('X shape :', X.shape)
print('y shape :', y.shape)
display(X.head())
display(y.head())

X shape : (1578, 3)
y shape : (1578,)


Unnamed: 0,PrimaryPropertyType,CouncilDistrictCode,PropertyGFATotal
0,Hotel,7,88434
1,Hotel,7,103566
2,Hotel,7,956110
3,Hotel,7,61320
4,Hotel,7,175580


0     7456910.0
1     8664479.0
2    73937112.0
3     6946800.5
4    14656503.0
Name: SiteEnergyUseWN(kBtu), dtype: float64

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print('X_train shape :', X_train.shape)
print('y_train shape :', y_train.shape)
print('X_test shape :', X_test.shape)
print('y_test shape :', y_test.shape)

X_train shape : (1262, 3)
y_train shape : (1262,)
X_test shape : (316, 3)
y_test shape : (316,)


## Preprocessing

On va effectuer un encodage One-Hot pour les variables catégorielles et et une standardisation Min-Max pour les variables numériques.

Différents types d'encodage à tester :
- [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
- [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder)
- [GLMMEncoder](https://contrib.scikit-learn.org/category_encoders/glmm.html)
- [LeaveOneOutEncoder](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html)
- [MEstimateEncoder](https://contrib.scikit-learn.org/category_encoders/mestimate.html)
- [HashingEncoder](https://contrib.scikit-learn.org/category_encoders/hashing.html)

Lien utiles :
- https://www.kaggle.com/code/shahules/an-overview-of-encoding-techniques
- https://towardsdatascience.com/6-ways-to-encode-features-for-machine-learning-algorithms-21593f6238b0
- https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
- https://contrib.scikit-learn.org/category_encoders/

Vérifions que toutes les catégories sont présentes dans le jeu d'entrainement

In [6]:
print('X', X[['PrimaryPropertyType', 'CouncilDistrictCode']].nunique(), sep='\n')
print()
print('X_train', X_train[['PrimaryPropertyType', 'CouncilDistrictCode']].nunique(), sep='\n')

X
PrimaryPropertyType    19
CouncilDistrictCode     7
dtype: int64

X_train
PrimaryPropertyType    19
CouncilDistrictCode     7
dtype: int64


In [7]:
def custom_combiner(feature, category):
    return str(category)

col_transformer = make_column_transformer(
    (OneHotEncoder(sparse_output=False, dtype=np.int8, feature_name_combiner=custom_combiner), ['PrimaryPropertyType']),
    (OneHotEncoder(sparse_output=False, dtype=np.int8), ['CouncilDistrictCode']),
    remainder=MinMaxScaler(),
    verbose_feature_names_out=False
)
col_transformer

In [8]:
X_transform = col_transformer.fit_transform(X_train)
print('X_transform shape :', X_transform.shape)
X_transform.head()

X_transform shape : (1262, 27)


Unnamed: 0,Distribution Center,Hospital,Hotel,K-12 School,Laboratory,Large Office,Medical Office,Mixed Use Property,Other,Refrigerated Warehouse,Restaurant,Retail Store,Self-Storage Facility,Senior Care Community,Small- and Mid-Sized Office,Supermarket / Grocery Store,University,Warehouse,Worship Facility,CouncilDistrictCode_1,CouncilDistrictCode_2,CouncilDistrictCode_3,CouncilDistrictCode_4,CouncilDistrictCode_5,CouncilDistrictCode_6,CouncilDistrictCode_7,PropertyGFATotal
300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0.018968
2207,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0.011862
2998,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.005809
3187,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0.145617
257,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0.053691


In [9]:
# dtypes
X_transform.dtypes.value_counts()

int8       26
float64     1
Name: count, dtype: int64

In [10]:
# memory usage in kB
X_transform.memory_usage().sum() / 1000

53.004

## Entrainement d'un modèle dans une pipeline

On va tester un simple algorithme de régression linéaire que l'on va intégrer dans une pipeline contenant le preprocessing avant l'ajustement du modèle.

In [11]:
pipeline = make_pipeline(col_transformer, LinearRegression())
pipeline

In [12]:
pipeline.fit(X_train, y_train)

Résultats sur le jeu d'entrainement

In [13]:
y_pred = pipeline.predict(X_train)
print('R2 :', r2_score(y_train, y_pred))
print('MSE :', mse(y_train, y_pred))
print('MAE :', mae(y_train, y_pred))
print('MAPE :', mape(y_train, y_pred))

R2 : 0.6556901461092943
MSE : 148550314295494.22
MAE : 4439146.705522163
MAPE : 1.6380159038984732


Résultats sur le jeu de test

In [14]:
y_pred = pipeline.predict(X_test)
print('R2 :', r2_score(y_test, y_pred))
print('MSE :', mse(y_test, y_pred))
print('MAE :', mae(y_test, y_pred))
print('MAPE :', mape(y_test, y_pred))

R2 : 0.1735762246629149
MSE : 675508587785177.9
MAE : 5095258.379670126
MAPE : 1.4621210770328374


Il y a clairement de l'overfitting sur ce modèle simple

## Comparaison de modèles

In [15]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RANSACRegressor, QuantileRegressor, SGDRegressor
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import (RandomForestRegressor, BaggingRegressor, AdaBoostRegressor, 
                              GradientBoostingRegressor, HistGradientBoostingRegressor)
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor

import time

import warnings
warnings.filterwarnings("ignore")

In [16]:
models = {'dummy': DummyRegressor(strategy='mean'),
          'lin_reg': LinearRegression(),
          'ridge': Ridge(random_state=42),
          'lasso': Lasso(random_state=42),
          #'ransac': RANSACRegressor(random_state=42),
          'quantile': QuantileRegressor(solver='highs'),
          'sgd': SGDRegressor(random_state=42),
          'lin_svm': LinearSVR(random_state=42),
          'knn': KNeighborsRegressor(),
          'pls1': PLSRegression(),
          'gaussian': GaussianProcessRegressor(random_state=42),
          'forest': RandomForestRegressor(random_state=42),
          'bagging': BaggingRegressor(random_state=42),
          'adaboost': AdaBoostRegressor(random_state=42),
          'gradboost': GradientBoostingRegressor(random_state=42),
          'histgboost': HistGradientBoostingRegressor(random_state=42),
          'xgb': XGBRegressor(random_state=42),
          'neur_net': MLPRegressor(hidden_layer_sizes=[10]*10, random_state=42)
         }

model_results = model_comparison(models, X_train, y_train, X_test, y_test, col_transformer)

_____________________________________________________________________________________
model			 train_score	 test_score	 fit_time	 predict_time
dummy       		 0.000		 -0.000		 0.009s		 0.005s
-------------------------------------------------------------------------------------
lin_reg     		 0.656		 0.174		 0.01s		 0.007s
-------------------------------------------------------------------------------------
ridge       		 0.654		 0.168		 0.018s		 0.007s
-------------------------------------------------------------------------------------
lasso       		 0.658		 0.177		 0.031s		 0.005s
-------------------------------------------------------------------------------------


quantile    		 -0.071		 -0.042		 0.095s		 0.005s
-------------------------------------------------------------------------------------
sgd         		 0.657		 0.170		 0.134s		 0.006s
-------------------------------------------------------------------------------------
lin_svm     		 -0.157		 -0.089		 0.014s		 0.006s
-------------------------------------------------------------------------------------
knn         		 0.700		 0.098		 0.009s		 0.084s
-------------------------------------------------------------------------------------
pls1        		 0.646		 0.161		 0.021s		 0.011s
-------------------------------------------------------------------------------------
gaussian    		 0.891		 -173.375		 0.15s		 0.028s
-------------------------------------------------------------------------------------
forest      		 0.952		 0.196		 0.479s		 0.014s
-------------------------------------------------------------------------------------
bagging     		 0.930		 0.179		 0.06s		 0.007s
-----------------

In [17]:
model_results.sort_values(by="test_score", ascending=False)

Unnamed: 0_level_0,train_score,test_score,fit_time,predict_time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
histgboost,0.675,0.2429,0.204,0.0069
adaboost,0.8367,0.2213,0.0337,0.0067
xgb,0.9954,0.199,0.1903,0.0074
forest,0.9525,0.1957,0.4791,0.0142
bagging,0.9301,0.1789,0.0605,0.007
lasso,0.6583,0.1768,0.0312,0.0055
gradboost,0.9407,0.1743,0.1281,0.0064
lin_reg,0.6557,0.1736,0.0096,0.0067
sgd,0.6569,0.1699,0.1345,0.0064
ridge,0.6544,0.1675,0.0177,0.0071


## Rajout des variables `YearBuilt` et `NumberofFloors`

On va discrétiser les variables `YearBuilt` et `NumberofFloors` (voir [ce lien](https://scikit-learn.org/stable/modules/preprocessing.html#discretization) pour plus d'informations)

# Prédiction sur le log de la target non normalisée