## MERGING JOCAS (JOB SUPPLY) DATA

The JOCAS data available to us from Progedo (2020) is composed as follow:

```plaintext
JOCAS/
│── Website_1/
│   ├── January/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│   ├── ...
│   ├── December/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│
│── Website_2/
│   ├── January/
│   │   ├── 01.xlsx
│   │   ├── 02.xlsx
│   │   ├── ...
│   │   ├── 31.xlsx
│   ├── ...
│
│── ...
```
The objective of this notebook is to extract the data of interest from all this JOCAS database (data for communes>5000 habitants) into one excel file

In [2]:
# IMPORT LIBRARIES
import matplotlib.pyplot as plt
import pandas as pd
import os
import re # For regular expression
import geopandas as gpd # To read geospatial data
from pathlib import Path # To set relative paths
import unidecode # To standardize strings
import py7zr # To unzip files

In [4]:
# GETTING PROJECT'S ROOT DIRECTORY
base_folder = Path().resolve()  # CURRENT WORKING DIRECTORY
main_folder = base_folder.parent

# JOCAS DIRECTORY 
jocas_dir = ""

# IMPORTING FILES
commune_names_path = main_folder / "data" / "2- Formatted Data" / "name_communes_5000.csv"
commune_names = pd.read_csv(commune_names_path).squeeze().tolist()  
rome_fap_path = main_folder / "data" / "linking tables" / "Rome_to_Fap_processed.csv"
rome_fap = pd.read_csv(rome_fap_path)

In [9]:
# READ DATASET
df = pd.read_csv("/Users/alfonso/Desktop/JOCAS DATA/leboncoin/2020-01/20200101_offers.csv", sep=";")

# RENAME COLUMNS OF INTEREST
df.rename(columns={"location_label": "commune", "job_ROME_code": "rome_code"}, inplace=True)

# COMMUNE NAME STANDARDIZATION FUNCTION
def standardize_commune(name):
    if pd.isna(name):
        return None
    name = unidecode.unidecode(name.lower().strip())  # Remove accents & lowercase
    name = re.sub(r"[-'’]", " ", name)  # Remove hyphens & apostrophes
    name = re.sub(r"\bst[ .]", "saint ", name)  # Standardize "St." -> "Saint"
    return name

# APPLY STANDARDIZATION
df["commune"] = df["commune"].apply(standardize_commune)

# FILTER FOR COMMUNES WITH MORE THAN 5000 HABITANTS
df_filtered = df[df["commune"].isin(commune_names)]

# GROUP BY COMMUNE AND CODE ROME AND COUNT JOB OFFERS
df_grouped = df_filtered.groupby(["commune", "rome_code"]).size().reset_index(name="job_offers")

In [13]:
df_grouped.head(15)

Unnamed: 0,commune,rome_code,job_offers
0,agen,F1603,1
1,aire sur la lys,F1704,1
2,aix en provence,H3404,1
3,aix en provence,N1103,1
4,aix les bains,N1105,1
5,albertville,M1302,1
6,alencon,F1602,1
7,alencon,I1604,1
8,ales,H2102,1
9,ales,I1607,1
