# 🏙️ Extracting Communes with 5000+ Inhabitants


## 📌 Overview
This notebook filters a **communes shapefile** to retain only those with **5000+ inhabitants**, preparing data for mapping.

In [None]:
# IMPORT LIBRARIES
import matplotlib.pyplot as plt
import pandas as pd
import re # For regular expression
import geopandas as gpd # To read geospatial data
from pathlib import Path # To set relative paths
import unidecode # To standardize strings
import py7zr # To unzip files

In [None]:
# GETTING PROJECT'S ROOT DIRECTORY
base_folder = Path().resolve()  # CURRENT WORKING DIRECTORY
main_folder = base_folder.parent

In [None]:
# EXTRACTING COMMUNES ZIPPED SHAPEFILE
seven_zip_path = main_folder / "data" / "shapefiles" / "Communes" / "communes-20220101.shp.7z"
extract_dir = main_folder / "data" / "shapefiles" / "Communes"

with py7zr.SevenZipFile(seven_zip_path, mode='r') as archive:
    archive.extractall(path=extract_dir)

In [None]:
# SETTING ALL NECESSARY DIRECTORIES
shapefile_path = main_folder / "data" / "shapefiles" / "Communes" / "communes-20220101.shp"
stmt_path = main_folder / "data" / "2- Formatted Data" / "full_stmt_dataset_cleaned.csv"
output_path = main_folder / "data" / "2- Formatted Data" / "full_stmt_dataset_cleaned_v2.csv"

In [None]:
# IMPORTING FILES
communes_shp = gpd.read_file(shapefile_path)
stmt_df = pd.read_csv(stmt_path)

In [None]:
# EXPLORING FILES AND SMALL ADJUSTMENTS
communes_shp.rename(columns={"nom": "commune"}, inplace=True)
communes_shp.head(5)

In [None]:
# EXPLORING FILES AND SMALL ADJUSTMENTS
stmt_df.rename(columns={"Commune de plus de 5000 hab": "commune_5000"}, inplace=True)
stmt_df.head(5)

### The objective is to match our "commune_5000" column with the "commune" column to align geospatial information for future mapping.

1- STANDARDIZATION OF COMMUNE COLUMN FROM BOTH DATASET

In [2]:
# Standardization function
def standardize_commune(name):
    if pd.isna(name):
        return None
    name = unidecode.unidecode(name.lower().strip())  # Remove accents & lowercase
    name = re.sub(r"[-'’]", " ", name)  # Remove hyphens & apostrophes
    name = re.sub(r"\bst[ .]", "saint ", name)  # Standardize "St." -> "Saint"
    return name

# Apply standardization
stmt_df["commune_5000"] = stmt_df["commune_5000"].apply(standardize_commune)
communes_shp["commune"] = communes_shp["commune"].apply(standardize_commune)

# Merge datasets
df_merged = stmt_df.merge(communes_shp, left_on="commune_5000", right_on="commune", how="left")

NameError: name 'stmt_df' is not defined

2- WHICH COMMUNES WERE NOT MATCHED AND WHY?

In [None]:
# Extract unique unmatched commune names as a list
unique_unmatched_communes_list = df_merged.loc[df_merged["commune"].isna(), "commune_5000"].drop_duplicates().tolist()

# Print the list
print(unique_unmatched_communes_list)

# Definining a function that tries to find potential match using regex expressions
def regex_search(commune_name, commune_list):
    pattern = re.sub(r"\s+", ".*", commune_name)  # Convert spaces to regex wildcard
    matches = [c for c in commune_list if re.search(pattern, c, re.IGNORECASE)]
    return matches

# Check possible regex matches
possible_matches = {c: regex_search(c, communes_shp["commune"].tolist()) for c in unique_unmatched_communes_list}

# Print potential matches
for key, value in possible_matches.items():
    if value:
        print(f"🔍 Possible match for '{key}': {value}")

#### Looking online at the unmatched communes, the issue arises because the shapefile dates from 2022, while the dataset is from 2020. Over the years, some communes have merged, extending their names. However, most of these mergers occurred before 2020, yet the STMT data still contains old commune names. A likely explanation is that administrative and employment service records still use these older commune names. While we could attempt to merge communes, for simplicity, we will drop these approximately 40 communes (0.9% of our sample).

3- QUICK VISUALISATION

In [None]:
# Reconvert into geospatial dataframe
gdf = gpd.GeoDataFrame(df_merged, geometry="geometry")

# Define approximate bounding box for mainland France & Corsica
france_bounds = (-5, 10, 41, 52)  # (xmin, xmax, ymin, ymax)

# Filter to keep only polygons within this bounding box
gdf_mainland = gdf.cx[france_bounds[0]:france_bounds[1], france_bounds[2]:france_bounds[3]]

# Plot with very thin edges
fig, ax = plt.subplots(figsize=(8, 10))
gdf_mainland.plot(ax=ax, edgecolor="black", linewidth=0.1, alpha=0.5)

# Remove axis labels for cleaner visualization
ax.set_xticks([])
ax.set_yticks([])
ax.set_title("Communes with > 5000 habitants in Metropolitan France")

plt.show()

4- SAVE AND EXPORT

In [None]:
# DROPPING UNNECESSARY COLUMNS FOR SAVING (KEEPING "WIKIPEDIA" FOR POTENTIAL REMERGING)
df_merged.drop(columns=["commune", "surf_ha", "insee", "geometry"], inplace=True)

# EXPORT DATASET 
df_merged.to_csv(output_path, index=False)