## ‚öôÔ∏è ROME TO FAP LINKING TABLE PROCESSING
---

### Purpose
This script processes the ROME to FAP linking table, structuring occupational classification codes for analysis.

### Steps
1. **Load Data:** Import the Excel file and select the relevant sheet.
2. **Extract Categories:** Assign hierarchical FAP codes (`fap22`, `fap87`, `fap225`).
3. **Handle Missing Data:** Remove empty rows and forward-fill missing values.
4. **Assign Professional Families:** Map `famille_pro` names based on category levels.
5. **Clean & Filter:** Keep relevant rows, drop unnecessary columns, and reorder data.

### Output
A structured dataset with FAP codes, professional families, PCS categories, and ROME classifications for streamlined analysis and merging.


## ‚öôÔ∏è Step 1: Workflow

In [3]:
# =================================================
# Install Necessary Packages (For all the project)
# =================================================
# Silent install of missing third-party packages
import subprocess, sys, importlib

for pkg in ["dash", "geopandas", "jenkspy", "matplotlib", "numpy", "pandas", "plotly", "py7zr", "unidecode", "contextily"]:
    try: importlib.import_module(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# ===============================
# IMPORT LIBRARIES
# ===============================
from pathlib import Path 
import pandas as pd

# ===============================
# DEFINE PROJECT PATHS
# ===============================
base_dir = Path().resolve()
project_root = base_dir.parent

# Define input and output file paths
fap_to_rome = project_root / "data" / "linking tables" / "Rome-V3 vers Fap-2009.xls"
output_path = project_root / "data" / "linking tables" / "Rome_to_Fap_processed.csv"

# ===============================
# LOAD FAP‚ÄìROME LINKING TABLE
# ===============================
link = pd.read_excel(fap_to_rome, sheet_name=1)

# ===============================
# EXTRACT FAP CATEGORIES
# ===============================
# Identify column containing the FAP codes
category_col = link.columns[0]

# Function to split hierarchical FAP codes into levels
def extract_categories(code):
    if pd.isna(code):
        return pd.Series([None, None, None])
    elif len(code) == 1:
        return pd.Series([code, None, None])       # Level 1: fap22
    elif len(code) == 3:
        return pd.Series([code[0], code, None])    # Level 2: fap87
    else:
        return pd.Series([code[0], code[:3], code])  # Level 3: fap225

# Apply function to generate category columns
link[["fap22", "fap87", "fap225"]] = link[category_col].apply(extract_categories)

# ===============================
# CLEAN & STRUCTURE DATA
# ===============================

# Drop rows where all relevant columns are missing
link.dropna(subset=["fap22", "fap87", "fap225", "ROME"], how="all", inplace=True)

# Flag rows with missing ROME codes (to drop later)
link["no_rome_code"] = link["ROME"].isna()

# Assign professional family labels at each FAP level
link["famille_pro22"] = link["Familles professionnelles"].where(
    link["fap22"].notna() & link["fap87"].isna() & link["fap225"].isna()
)
link["famille_pro87"] = link["Familles professionnelles"].where(
    link["fap22"].notna() & link["fap87"].notna() & link["fap225"].isna()
)
link["famille_pro225"] = link["Familles professionnelles"].where(
    link["fap22"].notna() & link["fap87"].notna() & link["fap225"].notna()
)

# Propagate professional family labels downward
link.fillna(method="ffill", inplace=True)

# ===============================
# FINALIZE AND FORMAT OUTPUT
# ===============================

# Keep only relevant rows and columns
link = link[link["no_rome_code"] == False]
link = link[
    ["fap22", "famille_pro22", "fap87", "famille_pro87", "ROME", 
     "R√©pertoire Op√©rationnel des M√©tiers et des Emplois"]
]

# Rename columns
link.rename(columns={
    "ROME": "rome",
    "R√©pertoire Op√©rationnel des M√©tiers et des Emplois": "rome_label"
}, inplace=True)

# Remove duplicates
link.drop_duplicates(inplace=True)

# Clean up asterisks in professional family names
link["famille_pro22"] = link["famille_pro22"].str.replace("*", "", regex=False)
link["famille_pro87"] = link["famille_pro87"].str.replace("*", "", regex=False)

# ===============================
# DISPLAY AND EXPORT
# ===============================
link.to_csv(output_path, index=False)
link

Unnamed: 0,fap22,famille_pro22,fap87,famille_pro87,rome,rome_label
25,A,"Agriculture, marine, p√™che",A0Z,"Agriculteurs, √©leveurs, sylviculteurs, b√ªcherons",A1416,"Polyculture, √©levage"
27,A,"Agriculture, marine, p√™che",A0Z,"Agriculteurs, √©leveurs, sylviculteurs, b√ªcherons",A1403,Aide d'√©levage agricole et aquacole
28,A,"Agriculture, marine, p√™che",A0Z,"Agriculteurs, √©leveurs, sylviculteurs, b√ªcherons",A1407,√âlevage bovin ou √©quin
29,A,"Agriculture, marine, p√™che",A0Z,"Agriculteurs, √©leveurs, sylviculteurs, b√ªcherons",A1408,√âlevage d'animaux sauvages ou de compagnie
30,A,"Agriculture, marine, p√™che",A0Z,"Agriculteurs, √©leveurs, sylviculteurs, b√ªcherons",A1409,√âlevage de lapins et volailles
...,...,...,...,...,...,...
1401,W,"Enseignement, formation",W0Z,Enseignants,K2109,Enseignement technique et professionnel
1405,W,"Enseignement, formation",W0Z,Enseignants,K2103,Direction d'√©tablissement et d'enseignement
1410,W,"Enseignement, formation",W0Z,Enseignants,K2108,Enseignement sup√©rieur
1416,W,"Enseignement, formation",W1Z,Formateurs,K2110,Formation en conduite de v√©hicules


## üìâ Step 2: Checks/Graphs

The issue is that the there exists different fap225 with the same **rome code**, the only think that differentiate them is their associated "qualification code" which is not available neither in our STMT or JOCAS dataset, where we only have the rome code.

In [4]:
# NUMBER OF DUPLICATED ROME (WITHOUT THE FIRST VALUE)
print(f"Duplicated ROME with different FAP : {link[link['rome'].duplicated()]['rome'].nunique()}")

Duplicated ROME with different FAP : 168


We have 168 duplicated rome code out of 714 ones. We will assume that people with similar ROME code (even across different FAP) have the same capabilities and could therefore apply to any job with this ROME code --> this will create some duplicates for the same ROME code.