# Preprocessor Notebook : Logements Sociaux, fichier RPLS annuel

Ce notebook traite le fichier Excel du RPLS annuel : données sur les logements sociaux.
Le but est de récupérer les datasets suivants, à partir du fichier XSLX téléchargé depuis le site du ministère du Développement Durable :
 - Données par régions
 - Données par départements
 - Données par EPCI
 - Données par communes

 ### Paramètres
 Ce Notebook prend des paramètres en entrée, définis sur la toute première cellule (ci-dessus).
 La cellule a le tag "parameters" ce qui permet de lui passer des valeurs via papermill.
 - filepath : le chemin vers le fichier Excel à traiter
 - model_name : le nom du modèle source

 ### Principe
 Ce notebook extrait 4 feuilles du fichier Excel d'entrée : region, departement, epci, communes. 
 Chaque feuille est chargée dans un dataFrame, convertie en JSON, puis chargée en Bronze.

## Initialisation

Les cellules suivantes servent à importer les modules nécessaires et à préparer les variables communes utilisées dans les traitements.

In [1]:
# Baseline imports
import pandas as pd
import os
import sys
import datetime

# Dirty trick to be able to import common odis modules, if the notebook is not executed from 13_odis
current_dir = os.getcwd()
parent_dir = os.path.dirname(os.getcwd())
while not current_dir.endswith("13_odis"):
    print("changing to parent dir")
    os.chdir(parent_dir)
    current_dir = parent_dir
    parent_dir = os.path.dirname(current_dir)

print(os.getcwd())
sys.path.append(current_dir)

changing to parent dir
/Users/alex/dev/13_odis


In [2]:
# additional imports
from common.config import load_config
from common.data_source_model import DataSourceModel
from common.utils.file_handler import FileHandler
from common.utils.interfaces.data_handler import OperationType

## Paramètres du Notebook
Paramètres pouvant être passés en input par papermill.

Seuls des types built-in semblent marcher (str, int etc), les classes spécifiques ou les objets mutables (datetime...) semblent faire planter papermill.

Doc officielle de papermill : parametrize [https://papermill.readthedocs.io/en/latest/usage-parameterize.html]

In [3]:
# Define parameters for papermill. 
filepath = 'data/imports/logement/logement.logements_sociaux_1.xlsx'
model_name = "logement.logements_sociaux"


# Variables et fonctions utiles

Quelques variables et fonctions utilitaires sont définies ici.
Les fonctions utilitaires seront ultérieurement factorisées vers des classes Python dédiées.

In [4]:
# Initialize common variables
dataframes = {}
artifacts = []

start_time = datetime.datetime.now(tz=datetime.timezone.utc)
config = load_config("datasources.yaml", response_model=DataSourceModel)
model = config.get_model( model_name = model_name )
# Instantiate File Handler for file loads and dumps
handler = FileHandler()

In [5]:
import math

# Utility function to cleanup JSON data exported from a dataframe, before dumping it to json
def clean_json(obj):
    """
    Cleans JSON data by removing invalid values (e.g., NaN, INF, empty strings).
    
    :param obj: JSON object
    :return: Cleaned JSON object
    """
    if isinstance(obj, dict):
        return {k: clean_json(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [clean_json(v) for v in obj]
    elif isinstance(obj, float):
        return None if math.isinf(obj) or math.isnan(obj) else obj
    elif isinstance(obj, str):
        return None if obj.upper() in ("INF", "NA", "NAN", "") else obj
    return obj

## Traitement des données
A partir de là, on charge le fichier Excel dans Pandas et on traite les feuilles à récupérer, une par une

In [6]:
# Load workbook to pandas
wb = pd.ExcelFile(
    filepath,
    engine = 'openpyxl'
)

In [7]:
# Load excel sheet for Regions
sheet_name = "REGION"
df_region = pd.read_excel(wb, 
                    sheet_name = "REGION",
                    index_col = "REG",
                    header = 5
                    )
dataframes["REGION"] = df_region

discard_cols = [ 2019, 2020, 2021, 2022, 2023]
df_region = df_region.drop(labels = discard_cols, axis = 1)

# Dump into a JSON artifact
region_json = df_region.to_dict(orient = 'records')
region_artifact = handler.artifact_dump( region_json, "REGION", model, format = "json" )
artifacts.append(region_artifact)

df_region.head()

2025-04-15 09:37:26,266 - main - INFO :: file_handler.py :: logement.logements_sociaux -> results saved to : 'data/imports/logement/logement.logements_sociaux_REGION.json'


Unnamed: 0_level_0,LIBREG,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,nb_lgt_tot,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
REG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Guadeloupe,35657,1508,1350,161,1383,0,40059,0,40059,...,0,0,0,0,0,0,0,0,0,0.0
2,Martinique,33491,1156,269,19,506,0,35441,0,35441,...,0,0,0,0,0,0,0,0,0,0.0
3,Guyane,19585,1213,405,0,559,0,21762,0,21762,...,0,0,0,0,0,0,0,0,0,0.0
4,La Réunion,80140,1082,1583,188,470,0,83463,0,85944,...,0,0,0,0,0,0,0,0,0,0.0
6,Mayotte,2234,248,78,60,321,0,2941,0,2941,...,0,0,0,0,0,0,0,0,0,0.0


In [8]:
# Load excel sheet for Departments
df_department = pd.read_excel(wb, 
                    sheet_name = "DEPARTEMENT",
                    index_col = "DEP",
                    header = 5
                    )
dataframes["DEPARTEMENT"] = df_department

discard_cols = [ 2019, 2020, 2021, 2022, 2023]
df_department = df_department.drop(labels = discard_cols, axis = 1)

# Dump into a JSON artifact
department_json = df_department.to_dict( orient = 'records' )
department_artifact = handler.artifact_dump( department_json, "DEPARTEMENT", model, format = "json" )
artifacts.append(department_artifact)

df_department.head()

2025-04-15 09:37:27,090 - main - INFO :: file_handler.py :: logement.logements_sociaux -> results saved to : 'data/imports/logement/logement.logements_sociaux_DEPARTEMENT.json'


Unnamed: 0_level_0,Unnamed: 1,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,nb_lgt_tot,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
DEP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Ain,46145,1200,1063,203,997,0,49608,1663,51271,...,105,544,5514,3751,1060,170,37,41,31971,64.447267
2,Aisne,37628,1383,1371,220,600,15,41217,170,41387,...,29,299,1912,2769,2870,873,287,453,39397,95.584346
3,Allier,17862,963,810,42,176,0,19853,116,19969,...,29,131,2000,2983,1136,473,176,1,16974,85.498413
4,Alpes-de-Haute-Provence,7549,159,90,22,68,0,7888,44,7932,...,116,162,291,375,546,69,16,1,7604,96.399594
5,Hautes-Alpes,7607,298,30,87,28,0,8050,8,8058,...,3,46,226,401,332,81,10,1,4556,56.596273


In [9]:
# Load excel sheet for EPCI
df_epci = pd.read_excel(wb, 
                    sheet_name = "EPCI",
                    index_col = "EPCI_DEP",
                    header = 5
                    )

dataframes["EPCI"] = df_epci

# discard cols that don't have a str as label
discard_cols = [ 2019, 2020, 2021, 2022, 2023 ]
df_epci = df_epci.drop(labels = discard_cols, axis = 1)

# Dump into a JSON artifact
epci_json = df_epci.to_dict( orient = 'records' )
epci_artifact = handler.artifact_dump( epci_json, "EPCI", model, format = "json" )
artifacts.append(epci_artifact)

df_epci.head()

2025-04-15 09:37:28,904 - main - INFO :: file_handler.py :: logement.logements_sociaux -> results saved to : 'data/imports/logement/logement.logements_sociaux_EPCI.json'


Unnamed: 0_level_0,DEP,LIBEPCI,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
EPCI_DEP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
200029999 - (01),1,CC Rives de l'Ain - Pays du Cerdon,600,17,6,0,14,0,637,0,...,0,6,89,75,9,3,0,0,547,85.871272
200040350 - (01),1,CC Bugey Sud,1758,65,76,0,37,0,1936,22,...,0,26,184,152,71,21,0,0,1063,54.907025
200040590 - (01),1,CA Villefranche Beaujolais Saône,688,13,5,0,7,0,713,14,...,0,4,31,67,14,0,0,0,370,51.893408
200042497 - (01),1,CC Dombes Saône Vallée,1899,29,21,0,27,0,1976,37,...,2,8,76,93,33,1,0,0,1334,67.510121
200042935 - (01),1,CA Haut - Bugey Agglomération,7320,285,311,34,228,0,8178,115,...,20,2,300,414,175,39,3,0,3814,46.63732


In [10]:
# Load excel sheet for COMMUNES
df_communes = pd.read_excel(wb, 
                    sheet_name = "COMMUNES",
                    index_col = "DEPCOM_ARM",
                    header = 5
                    )

dataframes["COMMUNES"] = df_communes

# discard cols that don't have a str as label
discard_cols = [ 2019, 2020, 2021, 2022, 2023 ]
df_communes = df_communes.drop(labels = discard_cols, axis = 1)

# Dump into a JSON artifact
communes_json = df_communes.to_dict( orient = 'records' )
communes_artifact = handler.artifact_dump( communes_json, "COMMUNES", model, format = "json" )
artifacts.append(communes_artifact)

df_communes.head()

2025-04-15 09:37:54,370 - main - INFO :: file_handler.py :: logement.logements_sociaux -> results saved to : 'data/imports/logement/logement.logements_sociaux_COMMUNES.json'


Unnamed: 0_level_0,REG,DEP,LIBCOM_DEP,LIBCOM,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
DEPCOM_ARM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,84,1,L'Abergement-Clémenciat (01),L'Abergement-Clémenciat,31.0,1.0,0.0,0.0,0.0,0.0,...,0,14,0,4,2,4,0,0,32,100.0
1004,84,1,Ambérieu-en-Bugey (01),Ambérieu-en-Bugey,1906.0,96.0,59.0,1.0,47.0,0.0,...,27,13,60,0,1,1,0,0,847,40.161214
1005,84,1,Ambérieux-en-Dombes (01),Ambérieux-en-Dombes,107.0,3.0,2.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,81,71.681416
1007,84,1,Ambronay (01),Ambronay,120.0,3.0,0.0,0.0,6.0,0.0,...,0,0,28,2,0,0,0,0,42,32.55814
1008,84,1,Ambutrix (01),Ambutrix,16.0,1.0,0.0,0.0,0.0,0.0,...,0,4,0,0,0,0,0,0,17,100.0


## Sauvegarde des métadonnées
On sauvegarde les métadonnées du processus localement, pour garder l'historique et pouvoir reprendre après erreur si besoin

In [11]:
for artifact in artifacts:
    print(artifact.model_dump( mode = "json" ))

preprocess_metadata = handler.dump_metadata(
    model = model,
    operation = OperationType.PREPROCESS,
    start_time = start_time,
    complete = True,
    errors = 0,
    artifacts = artifacts,
    pages = []
)

{'name': 'REGION', 'storage_info': {'location': 'data/imports/logement', 'format': 'json', 'file_name': 'logement.logements_sociaux_REGION.json', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'DEPARTEMENT', 'storage_info': {'location': 'data/imports/logement', 'format': 'json', 'file_name': 'logement.logements_sociaux_DEPARTEMENT.json', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'EPCI', 'storage_info': {'location': 'data/imports/logement', 'format': 'json', 'file_name': 'logement.logements_sociaux_EPCI.json', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'COMMUNES', 'storage_info': {'location': 'data/imports/logement', 'format': 'json', 'file_name': 'logement.logements_sociaux_COMMUNES.json', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
2025-04-15 09:38:03,274 - main - INFO :: file_handler.py :: logement.logements_sociaux -> results saved to : 'data/imports/logement/logement.logements_soc

## Chargement en couche Bronze
On instancie un JsonLoader pour charger tous les artifacts en base

In [13]:
from common.utils.factory.loader_factory import create_loader

# instanciate json loader. 'format' is important here as model.format = 'xlsx'
loader = create_loader(config, model, handler=FileHandler(), format= "json")

results = []

for result_artifact in loader.load_artifacts(artifacts):    
    print(result_artifact)
    results.append(result_artifact)

# Finally : dump the artifact load metadata
load_metadata = handler.dump_metadata(
    model = model,
    operation = OperationType.LOAD,
    start_time = start_time,
    complete = True,
    errors = 0,
    artifacts = results,
    pages = []
)

2025-04-15 09:38:59,413 - main - INFO :: json_loader.py :: Creating table bronze.logement_logements_sociaux_REGION
2025-04-15 09:38:59,428 - main - INFO :: json_loader.py :: Table bronze.logement_logements_sociaux_REGION created successfully
2025-04-15 09:38:59,429 - main - DEBUG :: file_handler.py :: loading JSON file : data/imports/logement/logement.logements_sociaux_REGION.json
2025-04-15 09:38:59,433 - main - INFO :: json_loader.py :: did not find a datapath indication for logement.logements_sociaux: Loading JSON data as-is.
2025-04-15 09:38:59,437 - main - INFO :: json_loader.py :: Inserting artifact table REGION from model logement_logements_sociaux
2025-04-15 09:38:59,439 - main - ERROR :: json_loader.py :: Error loading data for page REGION: cursor already closed
Traceback (most recent call last):
  File "/Users/alex/dev/13_odis/common/utils/loaders/json_loader.py", line 145, in load_artifacts
    load_success = self.load_to_db(payload, suffix = artifact_log.name)
             

2025-04-15 09:38:59,473 - main - INFO :: json_loader.py :: Table bronze.logement_logements_sociaux_DEPARTEMENT created successfully
2025-04-15 09:38:59,474 - main - DEBUG :: file_handler.py :: loading JSON file : data/imports/logement/logement.logements_sociaux_DEPARTEMENT.json
2025-04-15 09:38:59,477 - main - INFO :: json_loader.py :: did not find a datapath indication for logement.logements_sociaux: Loading JSON data as-is.
2025-04-15 09:38:59,478 - main - INFO :: json_loader.py :: Inserting artifact table DEPARTEMENT from model logement_logements_sociaux
2025-04-15 09:38:59,480 - main - ERROR :: json_loader.py :: Error loading data for page DEPARTEMENT: cursor already closed
Traceback (most recent call last):
  File "/Users/alex/dev/13_odis/common/utils/loaders/json_loader.py", line 145, in load_artifacts
    load_success = self.load_to_db(payload, suffix = artifact_log.name)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/dev/13_odis/.ven