# üíª Step 0 : Setup and Libraries / Configuration et Biblioth√®ques

üá¨üáß

- Objective: The purpose of this notebook is to cross-reference real estate transaction data (DVF) with demographic and income data (INSEE). By doing so, we aim to identify geographical areas where the market is affordable enough to implement modular housing projects.
- The core hypothesis is that modular ownership (Tiny Houses/Container homes) should significantly increase the household's "left-to-live" income (disposable income) compared to traditional rentals. To ensure financial viability, we will integrate standard banking metrics, such as the 30% debt-to-income ratio.

üá´üá∑ 

- Objectif : L'objectif de ce notebook est de croiser les donn√©es de transactions immobili√®res (DVF) avec les donn√©es d√©mographiques et de revenus (INSEE). Ce faisant, nous visons √† identifier les zones g√©ographiques o√π le march√© est suffisamment abordable pour mettre en ≈ìuvre des projets d'habitat modulaire.
- L'hypoth√®se centrale est que l'accession modulaire (Tiny Houses/Containers) doit augmenter de mani√®re significative le "reste √† vivre" (revenu disponible) des m√©nages par rapport √† la location classique. Pour garantir la viabilit√© financi√®re, nous int√©grerons les m√©triques bancaires standards, comme le taux d'endettement de 30 %

In [42]:
# Import core libraries for data manipulation and math
import pandas as pd
import numpy as np

pd.options.display.float_format = '{:,.2f}'.format

# Configure pandas to display all columns for deep inspection
pd.set_option('display.max_columns', None)

# Define a version or timestamp for traceability
print("Setup Complete: Ready for Market Analysis & Data Engineering.")
print("Setup Complete: Ready for Market Analysis.")

Setup Complete: Ready for Market Analysis & Data Engineering.
Setup Complete: Ready for Market Analysis.


# üíª Step 1: Loading Large Datasets (DVF) / Chargement des donn√©es massives
üá¨üáß Strategy: Since the DVF file is massive (several GBs), we use the usecols parameter to load only the variables relevant to our "modular housing" study. This saves RAM and prevents the IDE from crashing.

üá´üá∑ Strat√©gie : Comme le fichier DVF est volumineux (plusieurs Go), nous utilisons le param√®tre usecols pour ne charger que les variables pertinentes pour notre √©tude. Cela pr√©serve la m√©moire vive et √©vite les plantages.

![alt text](image.png)

In [43]:
# 1 Define target columns to save memory
# target_columns based on business needs: Price, Location, Type, and Surface
target_columns = [
    'Date mutation', 'Nature mutation', 'Valeur fonciere', 
    'Commune', 'Code departement', 'Code commune', 
    'Nombre de lots', 'Code type local', 'Type local', 'Surface reelle bati', 
    'Nature culture', 'Surface terrain'
]

# Load the DVF file (using '|' separator as it's a .txt file from Government)
df_dvf = pd.read_csv(
    r'C:\Users\Utilisateur\IronHack\final_project\ValeursFoncieres-2025-S1.txt',
    # nrows=100, 
    usecols=target_columns,
    sep='|', 
    low_memory=False
)

# Affichage des colonnes pour voir "l'inventaire"
print(df_dvf.columns.tolist())


['Date mutation', 'Nature mutation', 'Valeur fonciere', 'Commune', 'Code departement', 'Code commune', 'Nombre de lots', 'Code type local', 'Type local', 'Surface reelle bati', 'Nature culture', 'Surface terrain']


In [44]:
def clean_column_names(df):
    """
    Standardize DataFrame column names: 1. Convert to string / 2. Strip leading/trailing whitespaces
    3. Replace internal spaces with underscores / 4. Convert to lowercase
    """
    df.columns = (
        df.columns.astype(str)
                  .str.strip()
                  .str.replace(' ', '_', regex=False)
                  .str.lower()
    )

    return df

clean_column_names(df_dvf)
print(df_dvf.columns.tolist())

['date_mutation', 'nature_mutation', 'valeur_fonciere', 'commune', 'code_departement', 'code_commune', 'nombre_de_lots', 'code_type_local', 'type_local', 'surface_reelle_bati', 'nature_culture', 'surface_terrain']


In [45]:
def check_data_quality(df):

    print(f"DataFrame format (rows, cols): {df.shape}\n")
   
    stats = pd.DataFrame({
        'Type': df.dtypes,
        'Manquants': df.isna().sum(),
        '% Manquants': (df.isna().sum() / len(df) * 100).round(2),
        'Uniques': df.nunique(),
        'Doublons (per column)': df.apply(lambda x: x.duplicated().sum())
        
    })
    
    return stats
check_data_quality (df_dvf)

DataFrame format (rows, cols): (1387077, 12)



Unnamed: 0,Type,Manquants,% Manquants,Uniques,Doublons (per column)
date_mutation,object,0,0.0,173,1386904
nature_mutation,object,0,0.0,6,1387071
valeur_fonciere,object,13995,1.01,71189,1315887
commune,object,0,0.0,28603,1358474
code_departement,object,0,0.0,97,1386980
code_commune,int64,0,0.0,900,1386177
nombre_de_lots,int64,0,0.0,56,1387021
code_type_local,float64,572338,41.26,4,1387072
type_local,object,572338,41.26,4,1387072
surface_reelle_bati,float64,573187,41.32,2733,1384343


In [46]:
# GB Step 1.2: Standardizing Insee Codes at the source
# 1. Ensure codes are strings to preserve leading zeros
# Dept: 2 digits (ex: 1 -> 01) | City: 3 digits (ex: 65 -> 065)
df_dvf['code_departement'] = df_dvf['code_departement'].astype(str).str.zfill(2)
df_dvf['code_commune'] = df_dvf['code_commune'].astype(str).str.zfill(3)

# 2. Merge them into a single 5-digit insee_code
df_dvf['insee_code'] = df_dvf['code_departement'] + df_dvf['code_commune']
# 3. Quick check: should show codes like '01001' or '64065'
print("‚úÖ Insee codes standardized in main DVF.")
df_dvf

‚úÖ Insee codes standardized in main DVF.


Unnamed: 0,date_mutation,nature_mutation,valeur_fonciere,commune,code_departement,code_commune,nombre_de_lots,code_type_local,type_local,surface_reelle_bati,nature_culture,surface_terrain,insee_code
0,07/01/2025,Vente,46800000,FARGES,01,158,0,,,,J,78.00,01158
1,07/01/2025,Vente,46800000,FARGES,01,158,0,1.00,Maison,111.00,S,133.00,01158
2,07/01/2025,Vente,46800000,FARGES,01,158,0,3.00,D√©pendance,0.00,S,133.00,01158
3,06/01/2025,Vente,18000000,MONTANGES,01,257,0,,,,S,46.00,01257
4,06/01/2025,Vente,18000000,MONTANGES,01,257,0,,,,J,17.00,01257
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1387072,27/06/2025,Vente,55000000,PARIS 13,75,113,2,2.00,Appartement,61.00,,,75113
1387073,27/06/2025,Vente,55038610,PARIS 05,75,105,2,2.00,Appartement,47.00,,,75105
1387074,27/06/2025,Vente,55038610,PARIS 05,75,105,2,3.00,D√©pendance,0.00,,,75105
1387075,25/06/2025,Vente,2441758000,PARIS 15,75,115,0,,,,,,75115


## üè∑Ô∏è 1.1 Data Categorization & Cleaning / Cat√©gorisation et Nettoyage
---
üá¨üáß : ***Target Zone Identification***

For Modular Housing To maximize affordability (Target: land at ‚Ç¨50/sqm or less), this study isolates "Nature Culture" codes with the lowest acquisition costs that remain buildable or adaptable. The L (Landes/Moorland) and BR (Heath) segments are identified as key levers for social mobility, while S (Soils) and J (Gardens) provide the baseline for urban integration.

üá´üá∑ ***Identification des zones cibles***

Pour l'habitat modulaire Afin de maximiser l'accessibilit√© financi√®re (Objectif : terrain √† 50‚Ç¨/m¬≤ ou moins), l'√©tude isole les codes de "Nature Culture" pr√©sentant le plus faible co√ªt d'acquisition tout en restant constructibles ou adaptables. Les segments L (Landes) et BR (Bruy√®res) sont identifi√©s comme les leviers majeurs pour l'ascenseur social, tandis que S et J servent de base pour l'int√©gration urbaine.

In [47]:
print(df_dvf['nature_culture'].value_counts())

nature_culture
S     454677
T     138926
P      76202
J      45998
BT     44334
L      38317
AG     36377
AB     30780
VI     15610
BR     14226
VE     10870
BS      7498
PA      7197
B       5148
E       3665
BP      3517
BF      2219
PP      1029
PC       697
PH       608
BM       557
CH       409
CA       326
LB       106
PE        68
TP        64
BO        28
Name: count, dtype: int64


In [48]:
# 1. Selection of the target land useful for the project
nature_cibles = ['S', 'L', 'J', 'BR']

# 2. Cr√©ation du DataFrame filtr√©
df_dvf_taget = df_dvf[df_dvf['nature_culture'].isin(nature_cibles)].copy()

# 3. CORRECTION : On nettoie la colonne SANS √©craser tout le DataFrame
df_dvf_taget['nature_culture'] = df_dvf_taget['nature_culture'].astype(str).str.strip()

# 4. Rapport de l'auditeur (Maintenant stats_culture fonctionnera !)
print("--- Analyse du Gisement Foncier Social ---")
stats_culture = df_dvf_taget['nature_culture'].value_counts()
for code, count in stats_culture.items():
    pct = (count / len(df_dvf_taget) * 100)
    print(f"Code {code}: {count} parcelles ({round(pct, 2)}%)")

print(f"\nTotal parcelles exploitables pour le projet : {len(df_dvf_taget)}")

--- Analyse du Gisement Foncier Social ---
Code S: 454677 parcelles (82.19%)
Code J: 45998 parcelles (8.31%)
Code L: 38317 parcelles (6.93%)
Code BR: 14226 parcelles (2.57%)

Total parcelles exploitables pour le projet : 553218


In [49]:
df_dvf_taget.info()

<class 'pandas.core.frame.DataFrame'>
Index: 553218 entries, 0 to 1386932
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   date_mutation        553218 non-null  object 
 1   nature_mutation      553218 non-null  object 
 2   valeur_fonciere      545700 non-null  object 
 3   commune              553218 non-null  object 
 4   code_departement     553218 non-null  object 
 5   code_commune         553218 non-null  object 
 6   nombre_de_lots       553218 non-null  int64  
 7   code_type_local      367383 non-null  float64
 8   type_local           367383 non-null  object 
 9   surface_reelle_bati  366740 non-null  float64
 10  nature_culture       553218 non-null  object 
 11  surface_terrain      553218 non-null  float64
 12  insee_code           553218 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 59.1+ MB


In [50]:
print(df_dvf_taget['nature_mutation'].value_counts())

nature_mutation
Vente                                 546447
Echange                                 5191
Vente terrain √† b√¢tir                    645
Vente en l'√©tat futur d'ach√®vement       476
Adjudication                             370
Expropriation                             89
Name: count, dtype: int64


In [51]:
# Relevant filtering of real estate transactions
mutations_cibles = [
    'Vente', 
    "Vente en l'√©tat futur d'ach√®vement", 
    'Vente terrain √† b√¢tir'
]

# Filtring the DataFrame to keep only relevant real estate transactions
df_dvf_mutation = df_dvf_taget[df_dvf_taget['nature_mutation'].isin(mutations_cibles)]
# Summary of filtering
print(f"Volume initial : {len(df_dvf_taget)}")
print(f"Volume apr√®s filtrage : {len(df_dvf_mutation)}")
print(f"Lignes supprim√©es : {len(df_dvf_taget) - len(df_dvf_mutation)}")
# Aper√ßu des types de mutations restants
print("\nR√©partition apr√®s filtrage :")
print(df_dvf_mutation['nature_mutation'].value_counts())

Volume initial : 553218
Volume apr√®s filtrage : 547568
Lignes supprim√©es : 5650

R√©partition apr√®s filtrage :
nature_mutation
Vente                                 546447
Vente terrain √† b√¢tir                    645
Vente en l'√©tat futur d'ach√®vement       476
Name: count, dtype: int64


In [52]:
df_dvf_mutation

Unnamed: 0,date_mutation,nature_mutation,valeur_fonciere,commune,code_departement,code_commune,nombre_de_lots,code_type_local,type_local,surface_reelle_bati,nature_culture,surface_terrain,insee_code
0,07/01/2025,Vente,46800000,FARGES,01,158,0,,,,J,78.00,01158
1,07/01/2025,Vente,46800000,FARGES,01,158,0,1.00,Maison,111.00,S,133.00,01158
2,07/01/2025,Vente,46800000,FARGES,01,158,0,3.00,D√©pendance,0.00,S,133.00,01158
3,06/01/2025,Vente,18000000,MONTANGES,01,257,0,,,,S,46.00,01257
4,06/01/2025,Vente,18000000,MONTANGES,01,257,0,,,,J,17.00,01257
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1386865,10/06/2025,Vente,1218904200,PARIS 13,75,113,0,,,,S,38.00,75113
1386867,10/06/2025,Vente,1218904200,PARIS 13,75,113,0,,,,S,120.00,75113
1386930,18/06/2025,Vente,147500000,PARIS 14,75,114,0,2.00,Appartement,42.00,S,83.00,75114
1386931,18/06/2025,Vente,147500000,PARIS 14,75,114,0,4.00,Local industriel. commercial ou assimil√©,132.00,S,83.00,75114


In [53]:
# "administrative noise removed by deleting unseless columns"
colonnes_a_supprimer = [
    'Identifiant de document', 'Reference document', 
    '1 Articles CGI', '2 Articles CGI', '3 Articles CGI', '4 Articles CGI', '5 Articles CGI', 
    'No Volume', 'No plan',
    '1er lot', 'Surface Carrez du 1er lot', 
    '2eme lot', 'Surface Carrez du 2eme lot', 
    '3eme lot', 'Surface Carrez du 3eme lot',
    '4eme lot', 'Surface Carrez du 4eme lot', 
    '5eme lot', 'Surface Carrez du 5eme lot', 'Nature culture speciale', 'Identifiant local', 'Prefixe de section', 'B/T/Q', 
    'No voie', 'Type de voie', 'Code voie', 'Voie', 'Code postal', 'Nombre pieces principales','Section','No disposition'
]

# 2. Suppression s√©curis√©e
# errors='ignore' permet d'√©viter un plantage si une colonne a d√©j√† √©t√© supprim√©e
df_immo = df_dvf_mutation.drop(columns=colonnes_a_supprimer, errors='ignore')
df_immo.head()

Unnamed: 0,date_mutation,nature_mutation,valeur_fonciere,commune,code_departement,code_commune,nombre_de_lots,code_type_local,type_local,surface_reelle_bati,nature_culture,surface_terrain,insee_code
0,07/01/2025,Vente,46800000,FARGES,1,158,0,,,,J,78.0,1158
1,07/01/2025,Vente,46800000,FARGES,1,158,0,1.0,Maison,111.0,S,133.0,1158
2,07/01/2025,Vente,46800000,FARGES,1,158,0,3.0,D√©pendance,0.0,S,133.0,1158
3,06/01/2025,Vente,18000000,MONTANGES,1,257,0,,,,S,46.0,1257
4,06/01/2025,Vente,18000000,MONTANGES,1,257,0,,,,J,17.0,1257


In [54]:
df_immo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 547568 entries, 0 to 1386932
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   date_mutation        547568 non-null  object 
 1   nature_mutation      547568 non-null  object 
 2   valeur_fonciere      540121 non-null  object 
 3   commune              547568 non-null  object 
 4   code_departement     547568 non-null  object 
 5   code_commune         547568 non-null  object 
 6   nombre_de_lots       547568 non-null  int64  
 7   code_type_local      366198 non-null  float64
 8   type_local           366198 non-null  object 
 9   surface_reelle_bati  365558 non-null  float64
 10  nature_culture       547568 non-null  object 
 11  surface_terrain      547568 non-null  float64
 12  insee_code           547568 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 58.5+ MB


In [55]:
df_immo.columns.tolist()

['date_mutation',
 'nature_mutation',
 'valeur_fonciere',
 'commune',
 'code_departement',
 'code_commune',
 'nombre_de_lots',
 'code_type_local',
 'type_local',
 'surface_reelle_bati',
 'nature_culture',
 'surface_terrain',
 'insee_code']

## üîß 1.2 Formatting & Structural Optimization
### 1.2.1 Formatting
üá¨üáß **Objective:** To improve readability and data integrity, I reordered the columns to put key metrics first. I also cast all IDs (INSEE, Dept) as strings to prevent any mathematical operations on geographical codes.

üá´üá∑ **Objectif :** Pour am√©liorer la lisibilit√© et l'int√©grit√© des donn√©es, j'ai r√©organis√© l'ordre des colonnes en pla√ßant les m√©triques cl√©s en priorit√©. J'ai √©galement forc√© le type 'string' pour tous les identifiants (INSEE, Dept) afin d'√©viter toute op√©ration math√©matique sur les codes g√©ographiques.

In [56]:
# üá¨üáß Step 1.4: Reordering columns and finalizing dtypes
# We want numbers as floats and identifiers/text as strings

# 1. Define the new order of columns
new_order = [
    'insee_code',
    'commune', 
    'valeur_fonciere',
    'nature_mutation', 
    'surface_terrain', 
    'type_local',
    'surface_reelle_bati',
    'nombre_de_lots',
    'nature_culture',
    'code_departement', 
    'code_commune'    
]

# 2. Reorder the DataFrame
df_immo = df_immo[new_order]
    # Replace commas by dots in the data (for text columns)
    # We use stack/unstack for a global replacement on the whole dataframe
df_immo = df_immo.replace(',', '.', regex=True)
# 3. Force string type for identifiers and labels
cols_to_str = ['insee_code', 'commune', 'nature_mutation', 'nature_culture', 'code_departement', 'code_commune']
df_immo[cols_to_str] = df_immo[cols_to_str].astype(str)
cols_to_float = ['valeur_fonciere', 'surface_terrain', 'surface_reelle_bati']
df_immo[cols_to_float] = df_immo[cols_to_float].astype(float)
# 4. Final check of types and organization
print("‚úÖ DataFrame successfully reorganized and types finalized.")
print(df_immo.dtypes)
display(df_immo.head())

‚úÖ DataFrame successfully reorganized and types finalized.
insee_code              object
commune                 object
valeur_fonciere        float64
nature_mutation         object
surface_terrain        float64
type_local              object
surface_reelle_bati    float64
nombre_de_lots           int64
nature_culture          object
code_departement        object
code_commune            object
dtype: object


Unnamed: 0,insee_code,commune,valeur_fonciere,nature_mutation,surface_terrain,type_local,surface_reelle_bati,nombre_de_lots,nature_culture,code_departement,code_commune
0,1158,FARGES,468000.0,Vente,78.0,,,0,J,1,158
1,1158,FARGES,468000.0,Vente,133.0,Maison,111.0,0,S,1,158
2,1158,FARGES,468000.0,Vente,133.0,D√©pendance,0.0,0,S,1,158
3,1257,MONTANGES,180000.0,Vente,46.0,,,0,S,1,257
4,1257,MONTANGES,180000.0,Vente,17.0,,,0,J,1,257


### 1.2.2 Optimization and organization
---
üá¨üáß **Objective:** 
Instead of removing data points, we categorize the land plots by size to address different social needs. 
- **Micro-Plots (< 50m¬≤):** Urban solutions for single young workers (Social & Mobility).
- **Standard Plots (50m¬≤ - 5000m¬≤):** Classical modular housing projects.
- **Large Estates (> 5000m¬≤):** Potential for collective "Eco-villages" or "Hameaux L√©gers" in rural areas.
We also convert the 'Valeur fonciere' from object to float for calculations.

üá´üá∑ **Objectif :** 
Au lieu de supprimer des donn√©es, nous cat√©gorisons les terrains par taille pour r√©pondre √† diff√©rents besoins sociaux.
- **Micro-terrains (< 50m¬≤) :** Solutions urbaines pour jeunes travailleurs solos (Vie sociale & Mobilit√©).
- **Terrains Standards (50m¬≤ - 5000m¬≤) :** Projets classiques d'habitat modulaire.
- **Grands Domaines (> 5000m¬≤) :** Potentiel pour des "Hameaux L√©gers" ou projets collectifs en zone rurale.
Nous transformons √©galement la 'Valeur fonciere' d'objet en nombre (float) pour permettre les calculs.

In [57]:
# Condition of the categorization
conditions = [
    (df_immo['surface_terrain'] < 50),
    (df_immo['surface_terrain'] >= 50) & (df_immo['surface_terrain'] <= 5000),
    (df_immo['surface_terrain'] > 5000)
]

# Names of the categories and creation of the colomn
choices = ['Micro-terrains', 'Terrains Standards', 'Grands Domaines']

df_immo['categorie_terrain'] = np.select(conditions, choices, default='Inconnu')

# Affichage pour v√©rifier
df_immo.head()

Unnamed: 0,insee_code,commune,valeur_fonciere,nature_mutation,surface_terrain,type_local,surface_reelle_bati,nombre_de_lots,nature_culture,code_departement,code_commune,categorie_terrain
0,1158,FARGES,468000.0,Vente,78.0,,,0,J,1,158,Terrains Standards
1,1158,FARGES,468000.0,Vente,133.0,Maison,111.0,0,S,1,158,Terrains Standards
2,1158,FARGES,468000.0,Vente,133.0,D√©pendance,0.0,0,S,1,158,Terrains Standards
3,1257,MONTANGES,180000.0,Vente,46.0,,,0,S,1,257,Micro-terrains
4,1257,MONTANGES,180000.0,Vente,17.0,,,0,J,1,257,Micro-terrains


### 1.2.3 Statistical Robustness (MVP 1 & 2)
FR : Identification du Foncier Nu et Neutralisation des Outliers Pour garantir la viabilit√© du projet "Habitat Modulaire", nous cr√©ons un indicateur binaire is_bare_land isolant les parcelles sans b√¢ti existant. Afin de neutraliser l'impact des erreurs de saisie cadastrale (prix extr√™mes), nous privil√©gions la m√©diane pour le calcul du score d'accessibilit√© par zone. Cette approche offre une vision conservatrice et r√©aliste de la capacit√© d'endettement pour les travailleurs pr√©caires. Le tout est affin√©e en calculant le prix m√©dian au m2 pour chaque cat√©gorie (Micro-terrains, Terrains Standards, Grands Domaines) au sein de chaque code INSEE. Cette granularit√© permet d'isoler le co√ªt r√©el du foncier accessible aux m√©nages modestes, √©vitant ainsi le biais statistique des grandes parcelles agricoles ou industrielles.

GB: Bare Land Identification and Outlier Neutralization To ensure the viability of the "Modular Housing" project, we are creating a binary indicator is_bare_land to isolate plots without existing buildings. To neutralize the impact of cadastral entry errors (extreme prices), we use the median for calculating accessibility scores by zone. This approach provides a conservative and realistic view of debt capacity for low-income workers. All is refined by calculating the median price per m√© for each category (Micro-terrains, Terrains Standards, Grands Domaines) within each INSEE code. This granularity isolates the real land cost accessible to low-income households, avoiding the statistical bias of large agricultural or industrial plots.

In [58]:
# --- STEP A: FEATURE ENGINEERING (Mandatory before aggregation) ---

# 1. Create the 'is_bare_land' flag (missing in your previous run)
# üá¨üáß 1 if no building, 0 otherwise
df_immo['is_bare_land'] = (df_immo['surface_reelle_bati'].fillna(0) == 0).astype(int)

# 2. Calculate price_m2 and clean infinite values
df_immo = df_immo[df_immo['surface_terrain'] > 0].copy()
df_immo['price_m2'] = df_immo['valeur_fonciere'] / df_immo['surface_terrain']

# 3. Global Outlier Cleaning (1% - 99%)
low, high = df_immo['price_m2'].quantile([0.01, 0.99])
df_robust = df_immo[(df_immo['price_m2'] >= low) & (df_immo['price_m2'] <= high)].copy()

# --- STEP B: ADVANCED AGGREGATION ---

# 4. Grouping with Quartiles and is_bare_land
df_stats = df_robust.groupby(['insee_code', 'categorie_terrain']).agg(
    commune_name=('commune', 'first'),
    median_price_m2=('price_m2', 'median'),
    q1_price_m2=('price_m2', lambda x: x.quantile(0.25)),
    q3_price_m2=('price_m2', lambda x: x.quantile(0.75)),
    nb_transactions=('price_m2', 'count'),
    perc_bare_land=('is_bare_land', 'mean')
).reset_index()

# --- STEP C: RELIABILITY FILTERING ---

# 5. Apply the threshold (n >= 5) to avoid "crazy" medians
# üá¨üáß We only keep segments with enough data points
df_stats_reliable = df_stats[df_stats['nb_transactions'] >= 5].copy()

# 6. Dispersion Index (IQR / Median)
# üá¨üáß Low index = consistent prices / High index = volatile market
df_stats_reliable['dispersion_index'] = (df_stats_reliable['q3_price_m2'] - df_stats_reliable['q1_price_m2']) / df_stats_reliable['median_price_m2']

# Auditor Check
print(f"‚úÖ Audit Termin√©.")
print(f"Segments valid√©s (n>=5) : {len(df_stats_reliable)} sur {len(df_stats)}")
print(df_stats_reliable[['insee_code', 'categorie_terrain', 'median_price_m2', 'nb_transactions']].head())

‚úÖ Audit Termin√©.
Segments valid√©s (n>=5) : 21999 sur 45831
  insee_code   categorie_terrain  median_price_m2  nb_transactions
0      01001  Terrains Standards           534.52                7
2      01002  Terrains Standards           580.67               16
3      01004      Micro-terrains         3,426.75                8
4      01004  Terrains Standards           621.41              118
6      01005  Terrains Standards           311.72                9


In [59]:
df_stats_reliable = df_stats_reliable.round(2)


In [60]:
df_stats_reliable

Unnamed: 0,insee_code,categorie_terrain,commune_name,median_price_m2,q1_price_m2,q3_price_m2,nb_transactions,perc_bare_land,dispersion_index
0,01001,Terrains Standards,L'ABERGEMENT-CLEMENCIAT,534.52,429.00,580.00,7,0.43,0.28
2,01002,Terrains Standards,ABERGEMENT-DE-VAREY (L ),580.67,266.46,1506.10,16,0.81,2.13
3,01004,Micro-terrains,AMBERIEU-EN-BUGEY,3426.75,1784.36,8454.55,8,0.75,1.95
4,01004,Terrains Standards,AMBERIEU-EN-BUGEY,621.41,333.33,1178.80,118,0.61,1.36
6,01005,Terrains Standards,AMBERIEUX-EN-DOMBES,311.72,244.37,321.02,9,0.44,0.25
...,...,...,...,...,...,...,...,...,...
45823,974022,Micro-terrains,LE TAMPON,31847.83,18430.00,41611.70,8,1.00,0.73
45824,974022,Terrains Standards,LE TAMPON,443.12,264.90,672.69,290,0.32,0.92
45827,974023,Terrains Standards,LES TROIS BASSINS,425.09,237.16,644.03,22,0.59,0.96
45828,974024,Grands Domaines,CILAOS,1.00,1.00,1.00,5,1.00,0.00


In [61]:
df_stats_reliable.columns.tolist()

['insee_code',
 'categorie_terrain',
 'commune_name',
 'median_price_m2',
 'q1_price_m2',
 'q3_price_m2',
 'nb_transactions',
 'perc_bare_land',
 'dispersion_index']

### 1.2.4. Consolidation of Multi-Parcel Mutations

GB: Consolidation of Multi-Parcel Mutations and Flow Management To prevent artificial inflation of transaction volumes, we are grouping by mutation. Surfaces of parcels from the same sale are aggregated to reflect the actual land unit. Rows with missing values in pivot variables (price/area) are discarded to ensure the integrity of the median price calculation.

FR : Agr√©gation des mutations multi-parcelles et gestion des flux Afin d'√©viter une inflation artificielle du volume de transactions, nous proc√©dons √† un groupement par mutation. Les surfaces des parcelles d'une m√™me vente sont somm√©es pour refl√©ter l'unit√© fonci√®re r√©elle. Les lignes pr√©sentant des valeurs manquantes sur les variables pivots (prix/surface) sont √©cart√©es pour garantir l'int√©grit√© du calcul du prix m√©dian.

In [62]:
# 1. Cr√©ation d'une cl√© unique de mutation (si pas d'ID unique existant)
# üá¨üáß Creating a unique mutation key to identify single transactions
df_immo['id_mutation'] = df_immo['insee_code'] + "_" + df_immo['valeur_fonciere'].astype(str) + "_" + df_immo['surface_terrain'].sum().astype(str)

# 2. Agr√©gation intelligente
# üá¨üáß Aggregate to sum surfaces and keep unique values for others
df_grouped = df_immo.groupby(['insee_code', 'commune', 'valeur_fonciere', 'nature_mutation']).agg({
    'surface_terrain': 'sum',
    'surface_reelle_bati': 'sum',
    'nature_culture': lambda x: ', '.join(set(x.astype(str))),
    'categorie_terrain': 'first' # On garde la cat√©gorie principale
}).reset_index()

# 3. Gestion des valeurs vides apr√®s groupement
# üá¨üáß Final drop of rows that still have critical missing info
df_grouped = df_grouped.dropna(subset=['valeur_fonciere', 'surface_terrain'])

# 4. Arrondi global
df_grouped = df_grouped.round(2)

print(f"‚úÖ Groupement termin√© : {len(df_immo)} lignes r√©duites √† {len(df_grouped)} mutations r√©elles.")

‚úÖ Groupement termin√© : 547558 lignes r√©duites √† 218658 mutations r√©elles.


In [63]:
# 1. PR√â-NETTOYAGE ET VALEURS VIDES
# üá¨üáß Strategy: Drop rows without price/surface, fill others with "Unknown"
df_immo = df_immo.dropna(subset=['valeur_fonciere', 'surface_terrain'])
df_immo['nature_culture'] = df_immo['nature_culture'].fillna('Inconnue')

# 2. GROUPAGE PAR MUTATION (Gestion des doublons multi-parcelles)
# üá¨üáß We group by price and code to sum surfaces of the same sale
df_mutations = df_immo.groupby(['insee_code', 'commune', 'valeur_fonciere', 'code_departement']).agg({
    'surface_terrain': 'sum',
    'surface_reelle_bati': 'sum',
    'categorie_terrain': 'first',
    'nature_culture': lambda x: ', '.join(set(x.astype(str)))
}).reset_index()

# 3. CALCUL DES VARIABLES CIBLES
df_mutations['is_bare_land'] = (df_mutations['surface_reelle_bati'].fillna(0) == 0).astype(int)
df_mutations['price_m2'] = df_mutations['valeur_fonciere'] / df_mutations['surface_terrain']

# 4. FILTRAGE DES OUTLIERS (1% - 99%)
low, high = df_mutations['price_m2'].quantile([0.01, 0.99])
df_robust = df_mutations[(df_mutations['price_m2'] >= low) & (df_mutations['price_m2'] <= high)].copy()

# 5. G√âN√âRATION DU DF_GROUPED (Le Master Dataset avec toutes vos colonnes)
df_grouped = df_robust.groupby(['insee_code', 'categorie_terrain']).agg(
    commune_name=('commune', 'first'),
    median_price_m2=('price_m2', 'median'),
    q1_price_m2=('price_m2', lambda x: x.quantile(0.25)),
    q3_price_m2=('price_m2', lambda x: x.quantile(0.75)),
    nb_transactions=('price_m2', 'count'),
    perc_bare_land=('is_bare_land', 'mean')
).reset_index()

# 6. CALCUL DES INDICES ET SEUIL DE CONFIANCE (n >= 5)
df_grouped = df_grouped[df_grouped['nb_transactions'] >= 5].copy()
df_grouped['dispersion_index'] = (df_grouped['q3_price_m2'] - df_grouped['q1_price_m2']) / df_grouped['median_price_m2']

# 7. ARRONDI FINAL ET NETTOYAGE
df_grouped = df_grouped.round(2)
df_grouped = df_grouped.replace([np.inf, -np.inf], np.nan)

# Auditor Check
print("‚úÖ df_grouped est pr√™t avec toutes les colonnes demand√©es.")
print(df_grouped.columns.tolist())
display(df_grouped.head())

‚úÖ df_grouped est pr√™t avec toutes les colonnes demand√©es.
['insee_code', 'categorie_terrain', 'commune_name', 'median_price_m2', 'q1_price_m2', 'q3_price_m2', 'nb_transactions', 'perc_bare_land', 'dispersion_index']


Unnamed: 0,insee_code,categorie_terrain,commune_name,median_price_m2,q1_price_m2,q3_price_m2,nb_transactions,perc_bare_land,dispersion_index
4,1004,Terrains Standards,AMBERIEU-EN-BUGEY,244.72,80.31,530.99,42,0.21,1.84
5,1005,Terrains Standards,AMBERIEUX-EN-DOMBES,160.51,103.91,244.37,5,0.0,0.88
9,1007,Terrains Standards,AMBRONAY,193.32,119.8,259.88,19,0.32,0.72
13,1010,Terrains Standards,ANGLEFORT,199.02,112.5,615.0,9,0.56,2.52
17,1014,Terrains Standards,ARBENT,96.18,27.63,106.87,9,0.22,0.82


In [64]:
df_grouped.columns.tolist()

['insee_code',
 'categorie_terrain',
 'commune_name',
 'median_price_m2',
 'q1_price_m2',
 'q3_price_m2',
 'nb_transactions',
 'perc_bare_land',
 'dispersion_index']

In [65]:
check_data_quality (df_grouped)

DataFrame format (rows, cols): (11450, 9)



Unnamed: 0,Type,Manquants,% Manquants,Uniques,Doublons (per column)
insee_code,object,0,0.0,10894,556
categorie_terrain,object,0,0.0,3,11447
commune_name,object,0,0.0,10705,745
median_price_m2,float64,0,0.0,9976,1474
q1_price_m2,float64,0,0.0,9294,2156
q3_price_m2,float64,0,0.0,10541,909
nb_transactions,int64,0,0.0,156,11294
perc_bare_land,float64,0,0.0,83,11367
dispersion_index,float64,0,0.0,823,10627


# üíª Step 2: Loading BPE
## üè• 2.1 Services & Accessibility (BPE) / Services et Accessibilit√©
---
üá¨üáß **Objective:** Load the Permanent Database of Equipment (BPE) to evaluate neighborhood attractiveness. 
We will aggregate individual equipment data to calculate the total number of services available per municipality/IRIS.

üá´üá∑ **Objectif :** Charger la Base Permanente des √âquipements (BPE) pour √©valuer l'attractivit√© des quartiers. 
Nous allons agr√©ger les donn√©es d'√©quipements individuels pour calculer le nombre total de services disponibles par commune/IRIS.

In [66]:
# üá¨üáß Step 2.1: Loading BPE data (Neighborhood services)

# Load the file - We specify the dtype for 'DEPCOM' (Municipality code) to match 'insee_code'
df_bpe = pd.read_csv(
    'BPE_iris_geo_2024.csv', 
    sep=';', # Or ';' depending on your file, check it if it fails
    dtype={'DEPCOM': str, 'DCIRIS': str}, 
    low_memory=False
)
df_bpe

Unnamed: 0,TIME_PERIOD,GEO,FACILITY_TYPE,FACILITY_DOM,FACILITY_SDOM,OBS_VALUE,BPE_MEASURE,GEO_OBJECT
0,2024,010010000,A129,A,A1,1,FACILITIES,IRIS
1,2024,010010000,A401,A,A4,3,FACILITIES,IRIS
2,2024,010010000,A402,A,A4,1,FACILITIES,IRIS
3,2024,010010000,A403,A,A4,1,FACILITIES,IRIS
4,2024,010010000,A404,A,A4,3,FACILITIES,IRIS
...,...,...,...,...,...,...,...,...
1233147,2024,97617_IND,F118,F,F1,1,FACILITIES,IRIS
1233148,2024,97617_IND,F121,F,F1,3,FACILITIES,IRIS
1233149,2024,97617_IND,F124,F,F1,1,FACILITIES,IRIS
1233150,2024,97617_IND,F307,F,F3,1,FACILITIES,IRIS


In [67]:
# üá¨üáß Let's inspect the first 2 rows to see the column names and content

print("List of columns in BPE:")
print(df_bpe.columns.tolist())

print("\nFirst row sample:")
display(df_bpe.head(2))

List of columns in BPE:
['TIME_PERIOD', 'GEO', 'FACILITY_TYPE', 'FACILITY_DOM', 'FACILITY_SDOM', 'OBS_VALUE', 'BPE_MEASURE', 'GEO_OBJECT']

First row sample:


Unnamed: 0,TIME_PERIOD,GEO,FACILITY_TYPE,FACILITY_DOM,FACILITY_SDOM,OBS_VALUE,BPE_MEASURE,GEO_OBJECT
0,2024,10010000,A129,A,A1,1,FACILITIES,IRIS
1,2024,10010000,A401,A,A4,3,FACILITIES,IRIS


### üîÑ 2.1.1 BPE Column Mapping / Mapping des colonnes BPE
---
üá¨üáß **Adjustment:** The BPE dataset uses international labels (GEO, FACILITY_TYPE). I mapped them to our project's standard (`iris_code`, `TYPEQU`) and extracted the `insee_code` from the first 5 digits of the geographic identifier.

üá´üá∑ **Ajustement :** Le dataset BPE utilise des labels internationaux (GEO, FACILITY_TYPE). Je les ai renomm√©s selon nos standards (`iris_code`, `TYPEQU`) et j'ai extrait l' `insee_code` des 5 premiers chiffres de l'identifiant g√©ographique.

In [68]:
# üá¨üáß Step 2.1: Harmonizing BPE columns (Modern/International format)

# 1. Rename columns to match our standard
df_bpe = df_bpe.rename(columns={
    'GEO': 'iris_code',
    'FACILITY_TYPE': 'TYPEQU'
})

# 2. Extract insee_code from iris_code (the first 5 digits)
# We make sure it's a string first
df_bpe['insee_code'] = df_bpe['iris_code'].astype(str).str[:5]

# 3. Final check
print("‚úÖ BPE mapping complete.")
print(f"Columns now available: {df_bpe.columns.tolist()}")
display(df_bpe[['insee_code', 'iris_code', 'TYPEQU', 'OBS_VALUE']].head())

‚úÖ BPE mapping complete.
Columns now available: ['TIME_PERIOD', 'iris_code', 'TYPEQU', 'FACILITY_DOM', 'FACILITY_SDOM', 'OBS_VALUE', 'BPE_MEASURE', 'GEO_OBJECT', 'insee_code']


Unnamed: 0,insee_code,iris_code,TYPEQU,OBS_VALUE
0,1001,10010000,A129,1
1,1001,10010000,A401,3
2,1001,10010000,A402,1
3,1001,10010000,A403,1
4,1001,10010000,A404,3


In [69]:
df_bpe['iris_code'].value_counts()

iris_code
97611_IND    140
371320000    122
151870102    121
171970000    120
540990000    119
            ... 
450740000      1
080770000      1
590090202      1
590100000      1
974230103      1
Name: count, Length: 51166, dtype: int64

### üìä 2.1.2 Data Integrity Audit / Audit d'int√©grit√© des donn√©es
---
üá¨üáß **Objective:** Before cleaning, I perform a global audit of unique values to identify anomalies (like the `_IND` suffix) or unexpected data types across all columns.

üá´üá∑ **Objectif :** Avant le nettoyage, j'effectue un audit global des valeurs uniques pour identifier les anomalies (comme le suffixe `_IND`) ou des types de donn√©es inattendus sur l'ensemble des colonnes.

In [70]:
# Verify the number of unique values are in accordance with the nomber of rows without missing values, for each column
audit_bpe = pd.DataFrame({
    'Type': df_bpe.dtypes,
    'Valeurs Uniques': df_bpe.nunique(),
    'Valeurs Manquantes': df_bpe.isnull().sum(),
    'Echantillon': [df_bpe[col].unique()[:3] for col in df_bpe.columns] # Affiche les 3 premi√®res valeurs uniques
})

display(audit_bpe)

Unnamed: 0,Type,Valeurs Uniques,Valeurs Manquantes,Echantillon
TIME_PERIOD,int64,1,0,[2024]
iris_code,object,51166,0,"[010010000, 010020000, 010040101]"
TYPEQU,object,229,0,"[A129, A401, A402]"
FACILITY_DOM,object,7,0,"[A, C, F]"
FACILITY_SDOM,object,27,0,"[A1, A4, A5]"
OBS_VALUE,int64,167,0,"[1, 3, 7]"
BPE_MEASURE,object,1,0,[FACILITIES]
GEO_OBJECT,object,1,0,[IRIS]
insee_code,object,34977,0,"[01001, 01002, 01004]"


In [71]:
df_bpe['TYPEQU'].value_counts()


TYPEQU
A129    34975
A504    33023
A404    32375
A403    31733
A402    31140
        ...  
A125       47
A136       43
A105       35
D105       28
C505       19
Name: count, Length: 229, dtype: int64

## üìñ 2.2 Equipment Labeling / Libell√© des √©quipements
---
üá¨üáß **Objective:** Use the HTML dictionary to translate equipment codes (e.g., A101) into human-readable labels (e.g., Primary School). This is crucial for the final business analysis.

üá´üá∑ **Objectif :** Utiliser le dictionnaire HTML pour traduire les codes d'√©quipements (ex: A101) en libell√©s compr√©hensibles (ex: √âcole primaire). C'est une √©tape cruciale pour l'analyse m√©tier finale.

In [72]:
# üá¨üáß Step 2.2: Mapping codes to real names using the HTML file
# 1. Read the HTML file (Pandas will find all <table> tags)
try:
    tables = pd.read_html('BPE_liste_hierarchisee_TYPEQU_2024.html')
    
    # Usually, the main mapping table is the first or second one
    # Let's assume it's the one containing 'Type d'√©quipement'
    df_labels = tables[0] 
    
    # 2. Rename columns to match our BPE data
    # We look for the column that contains the codes (A101, etc.)
    # Based on the file, it's often the first column
    df_labels = df_labels.rename(columns={
        df_labels.columns[0]: 'TYPEQU',
        df_labels.columns[1]: 'Equipment_Name'
    })

    # 3. Merge the labels with our main BPE dataframe
    df_bpe = pd.merge(df_bpe, df_labels[['TYPEQU', 'Equipment_Name']], on='TYPEQU', how='left')

    print("‚úÖ Labels successfully merged.")
    display(df_bpe[['insee_code', 'TYPEQU', 'Equipment_Name']].head())

except Exception as e:
    print(f"‚ùå Error reading HTML: {e}")

‚úÖ Labels successfully merged.


Unnamed: 0,insee_code,TYPEQU,Equipment_Name
0,1001,A129,MAIRIE
1,1001,A401,MA√áON
2,1001,A402,PL√ÇTRIER PEINTRE
3,1001,A403,MENUISIER CHARPENTIER SERRURIER
4,1001,A404,PLOMBIER COUVREUR CHAUFFAGISTE


In [73]:
df_labels.describe()

Unnamed: 0,TYPEQU,Equipment_Name,Description
count,263,263,263
unique,263,262,263
top,G105,TOURISME,Il s‚Äôagit des r√©sidences h√¥teli√®res de tourism...
freq,1,2,1


In [74]:
df_labels['TYPEQU'].value_counts()


TYPEQU
G105    1
A       1
A1      1
A101    1
A104    1
       ..
A125    1
A124    1
A122    1
A121    1
A120    1
Name: count, Length: 263, dtype: int64

## üè∑Ô∏è 2.3 User-Centric Service Mapping / Mapping des Services Orient√© Usager
---
üá¨üáß **Objective:** Translate raw Insee codes into business-relevant categories. Following our strategic scoping, Education is limited to Primary/Elementary levels (C1, C2) as older students are mobile. Taxis are excluded from Transport to focus on affordable public mobility.

üá´üá∑ **Objectif :** Traduire les codes bruts de l'Insee en cat√©gories m√©tier pertinentes. Selon notre cadrage, l'√âducation est limit√©e au Primaire/Maternelle (C1, C2) car les plus grands sont mobiles. Les taxis sont exclus des Transports pour se concentrer sur la mobilit√© publique abordable.

In [79]:
def map_bpe_fidele(code):
    # 1. Sant√© / Health (D1 √† D3 + liste sp√©cifique D2xx et D3xx)

    if code.startswith(('D201', 'D204', 'D231', 'D221', 'D202', 'D301', 'D307' )) :
        return 'Sante_Health'
    
    # 2. Education (C1, C2 uniquement : Maternelle et Primaire)
    if code.startswith(('C1', 'C2', 'C3')):
        return 'Education'
    
    # 3. Emploi
    if code in ['A122', 'A503']:
        return 'Emploi_Employment'
    
    # 4. Transport (E101, 102, 107, 108, 109 )
    if code in ['E101', 'E107', 'E108', 'E109']:
        return 'Transport'
    
    # 5a. Daily Commerce: (B201 √† B210)
    if code.startswith('B2') and code <= 'B210':
        return 'Commerce_Proximite'
    
    # 5b. Large retail stores 
    if code in ['B104', 'B105']:
        return 'Grandes_Surfaces'
    
    # 6. Admin Services
    if code.startswith(('A1', 'A2')) or code in ['A128','A129','A130','A131','A132']:
        return 'Admin_Services'
    
    # 7. Social Services
    if code.startswith('D7'):
        return 'Social_Services'
    
    # 8. Practical Services
    if code.startswith(('A3', 'A4', 'A5')):
        return 'Practical_Services'
    
    return 'Divers_Other'

# 2. Apply the mapping to the dataframe
df_bpe['categorie'] = df_bpe['TYPEQU'].apply(map_bpe_fidele)

# 3. Quick check of the new categories
print("‚úÖ Categorization complete. New 'categorie' column added.")
print(df_bpe['categorie'].value_counts())

‚úÖ Categorization complete. New 'categorie' column added.
categorie
Divers_Other          601305
Practical_Services    355619
Admin_Services         81039
Commerce_Proximite     76857
Education              54490
Transport              23004
Sante_Health           15677
Grandes_Surfaces       11150
Social_Services         8467
Emploi_Employment       5544
Name: count, dtype: int64


## üìà 2.4 Final Score Calculation / Calcul du Score Final

üá¨üáß **Objective:** Aggregate neighborhood data (IRIS) and calculate the weighted accessibility score. We use Min-Max normalization (0-100) to ensure categories are comparable. Weights prioritize Health (30%) and Education (25%) for social impact.

üá´üá∑ **Objectif :** Agr√©ger les donn√©es par quartier (IRIS) et calculer le score d'accessibilit√© pond√©r√©. Nous utilisons une normalisation Min-Max (0-100) pour rendre les cat√©gories comparables. Les poids priorisent la Sant√© (30%) et l'√âducation (25%) pour l'impact social.

In [81]:

# 1. Mise √† jour de la fonction de mapping avec vos r√©f√©rences sp√©cifiques
def map_bpe_fidele_v2(code):
    # Sant√© : Vitalit√© de proximit√© (votre liste)
    if code in ['D201', 'D204', 'D231', 'D221', 'D202', 'D301', 'D307']:
        return 'Sante_Health'
    # √âducation : Cycle complet (Maternelle, Primaire + Coll√®ge)
    if code in ['C101', 'C104', 'C201']:
        return 'Education'
    if code in ['A122', 'A503']: return 'Emploi_Employment'
    if code in ['E107', 'E108', 'E109']: return 'Transport'
    if code.startswith('B2') and code <= 'B210': return 'Commerce_Proximite'
    if code in ['B104', 'B105']: return 'Grandes_Surfaces'
    if code.startswith(('A1', 'A2')): return 'Admin_Services'
    if code.startswith('D7'): return 'Social_Services'
    if code.startswith(('A3', 'A4', 'A5')): return 'Practical_Services'
    return 'Divers_Other'

# 2. Application et extraction du code INSEE
df_bpe['categorie'] = df_bpe['TYPEQU'].apply(map_bpe_fidele_v2)
df_bpe['insee_code'] = df_bpe['DEPCOM'] # On utilise le code commune pour le match immo

# 3. Pivot par Commune
scores_equip = df_bpe.pivot_table(index='insee_code', 
                                 columns='categorie', 
                                 values='OBS_VALUE', 
                                 aggfunc='sum', 
                                 fill_value=0)

# 4. Normalisation avec SEUIL (Capping)
# üá¨üáß We cap the count to 5: having 5 GPs is the same as having 50 for a family
# üá´üá∑ On plafonne √† 5 : avoir 5 g√©n√©ralistes ou 50 revient au m√™me pour le confort
thresholds = {'Sante_Health': 5, 'Education': 3, 'Commerce_Proximite': 5, 'Transport': 2}

for col in scores_equip.columns:
    cap = thresholds.get(col, 5) # Par d√©faut cap √† 5
    scores_equip[f'{col}_norm'] = (np.minimum(scores_equip[col], cap) / cap * 100)

# 5. Calcul du Score Final (Vos pond√©rations)
weights = {
    'Sante_Health_norm': 0.30, 'Education_norm': 0.25, 'Emploi_Employment_norm': 0.17,
    'Transport_norm': 0.12, 'Commerce_Proximite_norm': 0.08, 'Grandes_Surfaces_norm': 0.02,
    'Admin_Services_norm': 0.025, 'Social_Services_norm': 0.025, 'Practical_Services_norm': 0.01
}

scores_equip['score_access_mvp1'] = sum(
    scores_equip.get(col, 0) * weight for col, weight in weights.items()
).round(2)

print("‚úÖ Score de Viabilit√© Sociale calcul√© par commune (Echelle 0-100).")
display(scores_equip[['score_access_mvp1']].sort_values(by='score_access_mvp1', ascending=False).head())

KeyError: 'DEPCOM'

# üíª 3.0 Cross-Analysis: Linking Prices and Services / Analyse Crois√©e : Prix et Services
### 3.1 Data Merging Strategy / Strat√©gie de Fusion
---
üá¨üáß **Objective:** Merge real estate price data (Commune level) with accessibility scores (IRIS level). Since price data is less granular, we aggregate the IRIS scores to the Municipality level (Mean) to allow for a direct comparison with `Price_m2`. We use the `insee_code` as the unique join key. This is more robust than using municipality names, as it avoids issues with duplicate names or spelling variations across datasets.


üá´üá∑ **Objectif :** Fusionner les donn√©es de prix immobiliers (√©chelle Commune) avec les scores d'accessibilit√© (√©chelle IRIS). Les prix √©tant moins granulaires, nous agr√©geons les scores IRIS √† la Commune (Moyenne) pour permettre une comparaison directe avec le `Price_m2`. Nous utilisons l' `insee_code` comme cl√© de jointure unique. C'est plus robuste que d'utiliser les noms de communes, car cela √©vite les probl√®mes de doublons ou de variantes d'orthographe entre les jeux de donn√©es.