## MERGING JOCAS (JOB SUPPLY) DATA

The JOCAS data available to us from Progedo (2020) is composed as follow:

```plaintext
JOCAS/
│── Website_1/
│   ├── January/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│   ├── ...
│   ├── December/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│
│── Website_2/
│   ├── January/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│   ├── ...
│
│── ...
```
The objective of this notebook is to extract the data of interest from all this JOCAS database (data for communes>5000 habitants) into one excel file

In [7]:
# IMPORT LIBRARIES
import matplotlib.pyplot as plt
import pandas as pd
import os
import re # For regular expression
import geopandas as gpd # To read geospatial data
from pathlib import Path # To set relative paths
import unidecode # To standardize strings
import py7zr # To unzip files

In [8]:
# GETTING PROJECT'S ROOT DIRECTORY
base_folder = Path().resolve()  # CURRENT WORKING DIRECTORY
main_folder = base_folder.parent

# JOCAS DIRECTORY 
jocas_dir = ""

# IMPORTING FILES
commune_names_path = main_folder / "data" / "2- Formatted Data" / "name_communes_5000.csv"
commune_names = pd.read_csv(commune_names_path).squeeze().tolist()  
rome_fap_path = main_folder / "data" / "linking tables" / "Rome_to_Fap_processed.csv"
rome_fap = pd.read_csv(rome_fap_path)

In [19]:
# READ DATASET
df = pd.read_csv("/Users/alfonso/Desktop/JOCAS DATA/leboncoin/2020-01/20200101_offers.csv", sep=";")

# RENAME COLUMNS OF INTEREST
df.rename(columns={"location_label": "commune", "job_ROME_code": "rome_code"}, inplace=True)

# COMMUNE NAME STANDARDIZATION FUNCTION
def standardize_commune(name):
    if pd.isna(name):
        return None
    name = unidecode.unidecode(name.lower().strip())  # Remove accents & lowercase
    name = re.sub(r"[-'’]", " ", name)  # Remove hyphens & apostrophes
    name = re.sub(r"\bst[ .]", "saint ", name)  # Standardize "St." -> "Saint"
    return name

# APPLY STANDARDIZATION
df["commune"] = df["commune"].apply(standardize_commune)

# FILTER FOR COMMUNES WITH MORE THAN 5000 HABITANTS
df_filtered = df[df["commune"].isin(commune_names)]

# MERGE ROME CODE WITH LINKING TABLE (TO FAP)
df_grouped = df_grouped.merge(rome_fap, left_on="rome_code", right_on="ROME", how="left")

# GROUP BY COMMUNE AND CODE ROME AND COUNT JOB OFFERS
df_grouped = df_grouped.groupby(["commune", "fap225"]).size().reset_index(name="job_offers_fap225")

Unnamed: 0,commune,fap225,job_offers_fap225
0,agen,B3Z20,4
1,agen,B4Z41,4
2,aire sur la lys,B0Z20,1
3,aix en provence,D3Z20,4
4,aix en provence,D4Z41,4
5,aix en provence,J0Z20,9
6,aix en provence,J1Z40,9
7,aix en provence,J1Z80,9
8,aix les bains,J0Z20,4
9,aix les bains,J1Z40,4


In [12]:
rome_fap.head(5)

Unnamed: 0,fap22,famille_pro22,fap87,famille_pro87,fap225,famille_pro225,PCS,Professions et catégories socioprofessionnelles,ROME,Répertoire Opérationnel des Métiers et des Emplois
0,A,"Agriculture, marine, pêche",A0Z,"Agriculteurs, éleveurs, sylviculteurs, bûcherons",A0Z00,Agriculteurs indépendants,111a,Agriculteurs sur petite exploitation de céréal...,,
1,A,"Agriculture, marine, pêche",A0Z,"Agriculteurs, éleveurs, sylviculteurs, bûcherons",A0Z01,Éleveurs indépendants,111d,"Éleveurs d'herbivores, sur petite exploitation",,
2,A,"Agriculture, marine, pêche",A0Z,"Agriculteurs, éleveurs, sylviculteurs, bûcherons",A0Z02,"Bûcherons, sylviculteurs indépendants",122b,"Exploitants forestiers indépendants, de 0 à 9 ...",,
3,A,"Agriculture, marine, pêche",A0Z,"Agriculteurs, éleveurs, sylviculteurs, bûcherons",A0Z40,Agriculteurs salariés,691e,Ouvriers agricoles sans spécialisation particu...,A1416,"Polyculture, élevage"
4,A,"Agriculture, marine, pêche",A0Z,"Agriculteurs, éleveurs, sylviculteurs, bûcherons",A0Z41,Éleveurs salariés,691b,Ouvriers de l'élevage,A1403,Aide d'élevage agricole et aquacole
