Skip to content

dataversegh/engine1-catboost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataverseGH-GLMIS — Engine 1: Micro-Nowcasting (CatBoost)

Part of the DataverseGH-GLMIS ecosystem — a modular Labour Market Information System for Ghana.
Engine 1 is the demographic nowcasting foundation. Engine 2 (NLP Demand Engine) is built in parallel.


Overview

This module answers the question:

"Can we predict the employment status of a Ghanaian individual using only socio-demographic characteristics observable in administrative and census data?"

Two models are trained on AHIES 2022–2024 (Annual Household Income and Expenditure Survey, Ghana Statistical Service) using a temporal holdout strategy that treats 2024 as a genuine forward prediction — validating the micro-nowcasting framing.


Models

Model 1 — 5-Class ICLS Employment Status

Detail Value
Target emp_class — Employee, Employer, Own-Account Worker, Unemployed, Outside Labour Force
Macro F1 (Validation) 0.46
Macro F1 (2024 Holdout) 0.46
Nowcast Drift −0.03 ✅
Key finding Structural positions (NLF, OAW) are demographically predictable; behavioural states (Unemployed) are not

Per-class results:

Class F1 Interpretation
Not in Labour Force 0.89 Strongly demographic — age, sex, household role
Own Account Worker 0.81 Regional and education patterns captured well
Employee 0.77 Learnable but noisy boundary with OAW
Unemployed 0.27 Behavioural, not demographic — expected ceiling
Employer 0.08 Too rare (1% of sample), similar profile to Employee

confusion_matrix_5-class


Model 2 — 3-Class Labour Force Participation

Detail Value
Target economic_activity_status — Employed, Unemployed, Not Active (OLF)
Macro F1 (Validation) 0.63
Macro F1 (2024 Holdout) 0.63
Nowcast Drift 0.00 ✅
Key finding Age and household position are the primary demographic drivers of labour force participation in Ghana

confusion_matrix_3-class

Per-class results:

Class F1
Not Active 0.86
Employed 0.79
Unemployed 0.24

Data

Source: AHIES 2022–2024, Ghana Statistical Service
Access: GSS Microdata Portal
Reference ID: DDI-GHA-GSS-AHIES-2022-2024-v1.1
Total observations: ~595,000 across 12 quarters (2022Q1–2024Q4)

Feature Set

features = [
    'age',               # continuous
    'sex',               # categorical
    'marital_status',    # categorical
    'region',            # categorical — 16 regions (post-2019)
    'relation_to_head',  # categorical
    'urban_rural',       # categorical
    'education_level',   # ordinal categorical
    'disability_status', # binary — 0: no disability, 1: has disability
    'quarter',           # categorical — e.g. '2022Q1', temporal drift signal
]

Confirmed Exclusions

Column Reason
economic_activity_status Perfect leakage for 5-class model (maps directly to target)
population_weight Used as sample_weight in model.fit(), not a feature
sector, industry, occupation Consequences of employment, not predictors — tautological

Temporal Validation Strategy

2022Q1 ──────────── 2023Q2 | 2023Q3 ── 2023Q4 | 2024Q1 ──── 2024Q4
        TRAIN (learn)       |  TEST (early stop) |  HOLDOUT (nowcast)
  • Train: 2022Q1 → 2023Q2 (~306k rows)
  • Validation: 2023Q3 → 2023Q4 (~98k rows) — used for early stopping signal only
  • Holdout: 2024Q1 → 2024Q4 (~188k rows) — untouched until final evaluation

Random splits are explicitly avoided. Temporal ordering is respected throughout to legitimise the nowcasting claim.


Model Configuration

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    depth=8,
    learning_rate=0.0528,
    l2_leaf_reg=3,
    bagging_temperature=0.153,
    iterations=1000,
    class_weights=class_weights,       # computed via sklearn balanced weighting
    cat_features=cat_features,         # string/category columns
    eval_metric='TotalF1',
    loss_function='MultiClass',
    random_seed=42,
)

model.fit(
    X_train, y_train,
    sample_weight=train_weights,       # population_weight — national representativeness
    eval_set=(X_test, y_test),
    early_stopping_rounds=100,
)

Hyperparameters selected via Optuna (30 trials, temporal split preserved across all trials).


SHAP Interpretability

Three levels of interpretation are provided:

Level Tool Purpose
Global summary_plot (bar) Which features matter most overall
Class-level summary_plot (dot) Direction and magnitude per class
Local force_plot / waterfall Why a specific individual got a specific prediction

Key SHAP Findings

5-class model:

  • Sector, industry, and occupation dominate — but are excluded from the nowcast model as tautological
  • Among demographic features: age, relation to head, and apprenticeship status are most predictive

3-class LFP model:

  • age is the single most powerful predictor — 3x more important than any other feature
  • Younger individuals are significantly more likely to be Unemployed or Outside the Labour Force
  • relation_to_head and marital_status capture gendered household roles driving NLF classification
  • quarter confirms real temporal drift across waves — model learns year-specific patterns

Shap Summary plot

Shap Dot plot

Repo Structure

engine1-catboost/
│
├── data/
│   ├── raw/                        # Original AHIES CSVs (not committed — see .gitignore)
│   │   ├── ahies_2022.csv
│   │   ├── ahies_2023.csv
│   │   └── ahies_2024.csv
│   └── processed/
│       └── ahies_combined.csv      # Concatenated + cleaned dataframe
│
├── notebooks/
│   ├── 01_data_preparation.ipynb   # Loading, merging, feature engineering
│   ├── 02_model_3class.ipynb       # 3-class ICLS model — training + evaluation
│   ├── 03_model_5class.ipynb       # 5-class LFP model — training + evaluation
│   └── 04_shap_analysis.ipynb      # SHAP interpretability — both models
│
├── models/
│   ├── catboost_5class_nowcast.cbm # Saved 5-class model
│   └── catboost_3class_lfp.cbm     # Saved 3-class model
│
├── outputs/
│   ├── figures/                    # Confusion matrices, SHAP plots
│   └── metrics/                    # Classification reports (JSON/CSV)
│
├── src/
│   ├── data_prep.py                # Data loading, cleaning, feature harmonisation
│   ├── train.py                    # Model training pipeline
│   ├── evaluate.py                 # Evaluation metrics and confusion matrix
│   └── shap_analysis.py            # SHAP computation and plotting
│
├── requirements.txt
├── .gitignore
└── README.md

Requirements

catboost>=1.2
shap>=0.44
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
optuna>=3.4
matplotlib>=3.7
seaborn>=0.12

Install:

pip install -r requirements.txt

Key Methodological Notes

  1. Class imbalance — Own-Account Workers dominate Ghana's labour market (~40% of sample). Balanced class weights and per-class F1 reporting are used throughout. Accuracy is not reported as a primary metric.

  2. Unemployed vs OLF boundary — The hardest classification boundary across both models. This is a behavioural distinction (active job search) not a demographic one. The model's failure here is theoretically expected and explicitly flagged as a limitation.

  3. survey_year as categoricalquarter is treated as a string categorical (e.g. '2022Q1'), not a continuous numeric. This allows CatBoost to learn wave-specific patterns without imposing a linear time assumption.

  4. population_weight as sample_weight — Makes model predictions nationally representative, directly strengthening the nowcasting framing. It is never used as a feature.

  5. Tautological features excluded — Sector, industry, and occupation were tested and improved F1 significantly but were excluded from final models because they are consequences of employment status, not predictors of it. Including them would make the model a data consistency checker, not a nowcast.


Findings Summary

"Using only pre-employment socio-demographic characteristics, the model achieves a macro F1 of 0.63 on the 2024 holdout — with zero temporal drift — demonstrating that labour force participation in Ghana is meaningfully predictable from demographic profiles alone. Performance is strongest for structurally determined positions (Not Active: F1=0.86) and weakest for behaviourally defined states (Unemployed: F1=0.24), consistent with the theoretical limits of demographic nowcasting. Age is the single most powerful predictor, followed by household position and marital status — reflecting the lifecycle and gendered household dynamics that shape Ghana's labour market."


Context

This is Engine 1 of the DataverseGH-GLMIS ecosystem:

Engine Description Status
Engine 1 Micro-Nowcasting (CatBoost) ✅ Complete
Engine 2 NLP Demand Engine (NER Pipeline) 🔄 In Progress

Data source: Ghana Statistical Service. Research conducted for academic purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors