# Preprocessor Notebook : Logements Sociaux, fichier RPLS annuel

Ce notebook traite le fichier Excel du RPLS annuel : données sur les logements sociaux.
Le but est de récupérer les datasets suivants, à partir du fichier XSLX téléchargé depuis le site du ministère du Développement Durable :
 - Données par régions
 - Données par départements
 - Données par EPCI
 - Données par communes

 ### Paramètres
 Ce Notebook prend des paramètres en entrée, définis sur la toute première cellule (ci-dessus).
 La cellule a le tag "parameters" ce qui permet de lui passer des valeurs via papermill.
 - filepath : le chemin vers le fichier Excel à traiter
 - model_name : le nom du modèle source

 ### Principe
 Ce notebook extrait 4 feuilles du fichier Excel d'entrée : region, departement, epci, communes. 
 Chaque feuille est chargée dans un dataFrame puis sauvegardée en .xlsx, et chargée en base de données Bronze.
 Peu de retraitement sur ces dataFrames, seul le tableau "departement" a besoin de renommer une colonne.

## Initialisation

Les cellules suivantes servent à importer les modules nécessaires et à préparer les variables communes utilisées dans les traitements.

In [1]:
# Baseline imports
import pandas as pd
import os
import sys
import datetime

# Dirty trick to be able to import common odis modules, if the notebook is not executed from 13_odis
current_dir = os.getcwd()
parent_dir = os.path.dirname(os.getcwd())
while not current_dir.endswith("13_odis"):
    print("changing to parent dir")
    os.chdir(parent_dir)
    current_dir = parent_dir
    parent_dir = os.path.dirname(current_dir)

print(os.getcwd())
sys.path.append(current_dir)

changing to parent dir
/home/jbn/13_odis


In [2]:
# additional imports
from common.config import load_config
from common.data_source_model import DataSourceModel
from common.utils.file_handler import FileHandler
from common.utils.interfaces.data_handler import OperationType

## Paramètres du Notebook
Paramètres pouvant être passés en input par papermill.

Seuls des types built-in semblent marcher (str, int etc), les classes spécifiques ou les objets mutables (datetime...) semblent faire planter papermill.

Doc officielle de papermill : parametrize [https://papermill.readthedocs.io/en/latest/usage-parameterize.html]

In [3]:
# Define parameters for papermill. 
filepath = 'data/imports/logement_social/logement_social.logements_sociaux_1.xlsx'
model_name = "logement_social.logements_sociaux"


In [4]:
# Initialize common variables
dataframes = {}
artifacts = []

start_time = datetime.datetime.now(tz=datetime.timezone.utc)
config = load_config("datasources.yaml", response_model=DataSourceModel)
model = config.get_model( model_name = model_name )
# Instantiate File Handler for file loads and dumps
handler = FileHandler()

## Traitement des données
A partir de là, on charge le fichier Excel dans Pandas et on traite les feuilles à récupérer, une par une

In [5]:
# Load workbook to pandas
wb = pd.ExcelFile(
    filepath,
    engine = 'openpyxl'
)

In [6]:
# Load excel sheet for Regions
sheet_name = "REGION"
keep_columns_region = [
    'LIBREG',
    'densite',
    'nb_ls',
    'tx_vac',
    'tx_mob'
]


df_region = pd.read_excel(wb, 
                    sheet_name = "REGION",
                    index_col = "REG",
                    header = 5
                    )

# df_region = df_region[keep_columns_region]
dataframes["REGION"] = df_region

region_artifact = handler.artifact_dump( df_region, "REGION", model)
artifacts.append(region_artifact)

df_region.head()

2025-07-23 15:24:55,218 - DEBUG :: file_handler.py :: dump (130) :: dumping: data/imports/logement_social/logement_social.logements_sociaux_REGION.xlsx
2025-07-23 15:24:55,321 - DEBUG :: file_handler.py :: file_dump (273) :: logement_social.logements_sociaux -> results saved to : 'data/imports/logement_social/logement_social.logements_sociaux_REGION.xlsx'


Unnamed: 0_level_0,LIBREG,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,nb_lgt_tot,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
REG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Guadeloupe,33731,1403,928,145,1298,0,37505,0,37505,...,0,0,0,0,0,0,0,0,0,0.0
2,Martinique,32653,1129,363,84,700,0,34929,0,34929,...,0,0,0,0,0,0,0,0,0,0.0
3,Guyane,18950,893,595,0,617,0,21055,0,21055,...,0,0,0,0,0,0,0,0,0,0.0
4,La Réunion,80832,1025,1965,173,433,25,84453,0,84453,...,0,0,0,0,0,0,0,0,0,0.0
6,Mayotte,2093,154,88,63,280,0,2678,0,2678,...,0,0,0,0,0,0,0,0,0,0.0


In [7]:
# Load excel sheet for Departments
keep_columns_departments = [
    'Unnamed: 1',
    'densite',
    'nb_ls',
    'tx_vac',
    'tx_mob'
]

df_department = pd.read_excel(wb, 
                    sheet_name = "DEPARTEMENT",
                    index_col = "DEP",
                    header = 5
                    )

# df_department = df_department[keep_columns_departments]

# TODO : rename column for Unnamed: 1

dataframes["DEPARTEMENT"] = df_department

department_artifact = handler.artifact_dump( df_department, "DEPARTEMENT", model)
artifacts.append(department_artifact)

df_department.head()

2025-07-23 15:24:55,567 - DEBUG :: file_handler.py :: dump (130) :: dumping: data/imports/logement_social/logement_social.logements_sociaux_DEPARTEMENT.xlsx
2025-07-23 15:24:55,894 - DEBUG :: file_handler.py :: file_dump (273) :: logement_social.logements_sociaux -> results saved to : 'data/imports/logement_social/logement_social.logements_sociaux_DEPARTEMENT.xlsx'


Unnamed: 0_level_0,LIBDEP,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,nb_lgt_tot,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
DEP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Ain,45634,1277,905,144,914,0,48874,1641,50515,...,69,364,1242,1232,628,85,16,21,46727,95.607071
2,Aisne,37707,1286,1360,147,617,19,41136,170,41306,...,1,35,630,1106,1265,174,76,5,38716,94.117075
3,Allier,17860,1056,933,37,90,0,19976,116,20092,...,0,0,48,112,88,19,3,2,16775,83.975771
4,Alpes-de-Haute-Provence,7411,274,66,16,51,0,7818,44,7862,...,116,91,60,107,91,6,1,1,7045,90.112561
5,Hautes-Alpes,7619,214,30,81,25,0,7969,8,7977,...,0,16,6,154,145,31,5,5,4460,55.966872


In [8]:
# Load excel sheet for EPCI
keep_columns_epci = [
    'LIBEPCI',
    'densite',
    'nb_ls',
    'tx_vac',
    'tx_mob'
]

df_epci = pd.read_excel(wb, 
                    sheet_name = "EPCI",
                    index_col = "EPCI_DEP",
                    header = 5
                    )

# df_epci = df_epci[keep_columns_epci]

dataframes["EPCI"] = df_epci

epci_artifact = handler.artifact_dump( df_epci, "EPCI", model)
artifacts.append(epci_artifact)

df_epci.head()

2025-07-23 15:24:58,231 - DEBUG :: file_handler.py :: dump (130) :: dumping: data/imports/logement_social/logement_social.logements_sociaux_EPCI.xlsx
2025-07-23 15:25:02,493 - DEBUG :: file_handler.py :: file_dump (273) :: logement_social.logements_sociaux -> results saved to : 'data/imports/logement_social/logement_social.logements_sociaux_EPCI.xlsx'


Unnamed: 0_level_0,DEP,LIBEPCI,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,nb_ls,parc_non_conv,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
EPCI_DEP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
200000172 - (74),74,CC Faucigny-Glières,1686,55,143,21,40,0,1945,8,...,0,13,125,2,1,0,0,0,1913,98.354756
200000438 - (44),44,CC du Pays de Pontchâteau St-Gildas-des-Bois,547,17,3,7,1,0,575,0,...,0,15,59,14,0,0,0,0,541,94.086957
200000545 - (10),10,CC des Portes de Romilly sur Seine,2105,67,77,10,65,0,2324,26,...,0,10,66,63,31,8,0,0,2284,98.27883
200000628 - (84),84,CC Rhône Lez Provence,961,21,15,0,20,0,1017,229,...,0,0,0,9,16,21,0,0,665,65.388397
200000800 - (41),41,CC Coeur de Sologne,679,20,136,0,0,0,835,0,...,0,0,17,163,26,2,0,0,826,98.922156


In [9]:
# Load excel sheet for COMMUNES
keep_columns_communes = [
    'LIBCOM_DEP',
    'densite',
    'nb_ls',
    'tx_vac',
    'tx_mob'
]

df_communes = pd.read_excel(wb, 
                    sheet_name = "COMMUNES",
                    index_col = "DEPCOM_ARM",
                    header = 5
                    )

# df_communes = df_communes[keep_columns_communes]

dataframes["COMMUNES"] = df_communes

communes_artifact = handler.artifact_dump( df_communes, "COMMUNES", model )
artifacts.append(communes_artifact)

df_communes.head()

2025-07-23 15:25:33,865 - DEBUG :: file_handler.py :: dump (130) :: dumping: data/imports/logement_social/logement_social.logements_sociaux_COMMUNES.xlsx
2025-07-23 15:26:32,381 - DEBUG :: file_handler.py :: file_dump (273) :: logement_social.logements_sociaux -> results saved to : 'data/imports/logement_social/logement_social.logements_sociaux_COMMUNES.xlsx'


Unnamed: 0_level_0,REG,DEP,LIBCOM_DEP,LIBCOM,nb_loues,nb_vacants,nb_vides,nb_asso,nb_occup_finan,nb_occup_temp,...,ener_A_new,ener_B_new,ener_C_new,ener_D_new,ener_E_new,ener_F_new,ener_G_new,ener_NR_new,nb_dpe_realise,perc_dpe_realise
DEPCOM_ARM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,84,1,L'Abergement-Clémenciat (01),L'Abergement-Clémenciat,21.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,6,0,0,22,100.0
1004,84,1,Ambérieu-en-Bugey (01),Ambérieu-en-Bugey,1930.0,55.0,48.0,1.0,50.0,0.0,...,27,13,0,0,1,1,0,0,2017,96.785029
1005,84,1,Ambérieux-en-Dombes (01),Ambérieux-en-Dombes,100.0,3.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,99,95.192308
1007,84,1,Ambronay (01),Ambronay,118.0,6.0,0.0,0.0,5.0,0.0,...,0,0,0,0,0,0,0,0,129,100.0
1008,84,1,Ambutrix (01),Ambutrix,17.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,17,100.0


## Sauvegarde des métadonnées
On sauvegarde les métadonnées du processus localement, pour garder l'historique et pouvoir reprendre après erreur si besoin

In [10]:
for artifact in artifacts:
    print(artifact.model_dump( mode = "json" ))

preprocess_metadata = handler.dump_metadata(
    model = model,
    operation = OperationType.PREPROCESS,
    start_time = start_time,
    complete = True,
    errors = 0,
    artifacts = artifacts,
    pages = []
)

{'name': 'REGION', 'storage_info': {'location': 'data/imports/logement_social', 'format': 'xlsx', 'file_name': 'logement_social.logements_sociaux_REGION.xlsx', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'DEPARTEMENT', 'storage_info': {'location': 'data/imports/logement_social', 'format': 'xlsx', 'file_name': 'logement_social.logements_sociaux_DEPARTEMENT.xlsx', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'EPCI', 'storage_info': {'location': 'data/imports/logement_social', 'format': 'xlsx', 'file_name': 'logement_social.logements_sociaux_EPCI.xlsx', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
{'name': 'COMMUNES', 'storage_info': {'location': 'data/imports/logement_social', 'format': 'xlsx', 'file_name': 'logement_social.logements_sociaux_COMMUNES.xlsx', 'encoding': 'utf-8'}, 'load_to_bronze': True, 'success': True}
2025-07-23 15:26:32,441 - DEBUG :: file_handler.py :: dump (130) :: dumping: data/imports/logement

## Chargement en couche Bronze
On charge un engine SQLAchemy pour charger tous les datasets en base

In [11]:
from dotenv import dotenv_values
import sqlalchemy
from sqlalchemy import text

# prepare db client
vals = dotenv_values()

conn_str = "postgresql://{}:{}@{}:{}/{}".format(
    vals["PG_DB_USER"],
    vals["PG_DB_PWD"],
    vals["PG_DB_HOST"],
    vals["PG_DB_PORT"],
    vals["PG_DB_NAME"]
)

dbengine = sqlalchemy.create_engine(conn_str)

In [12]:
# insert all to bronze
# make the final table name lowercase to avoid issues in Postgre

for name, dataframe in dataframes.items():

    subtable_name = f"{model.table_name}_{name.lower()}"
    query_str = f"DROP TABLE IF EXISTS bronze.{subtable_name} CASCADE"

    # dropping existing table with cascade
    with dbengine.connect() as con:
        print(f"Dropping if exists: {subtable_name}")
        result = con.execute(text(query_str))
        con.commit()

    print(f"Inserting DataFrame {subtable_name}")
    dataframe.to_sql(
        name = subtable_name,
        con = dbengine,
        schema = 'bronze',
        index = True,
        if_exists = 'replace'
    )


Dropping if exists: logement_social_logements_sociaux_region
Inserting DataFrame logement_social_logements_sociaux_region
Dropping if exists: logement_social_logements_sociaux_departement
Inserting DataFrame logement_social_logements_sociaux_departement
Dropping if exists: logement_social_logements_sociaux_epci
Inserting DataFrame logement_social_logements_sociaux_epci
Dropping if exists: logement_social_logements_sociaux_communes
Inserting DataFrame logement_social_logements_sociaux_communes
