DataverseGH-GLMIS — Engine 1: Micro-Nowcasting (CatBoost)

Part of the DataverseGH-GLMIS ecosystem — a modular Labour Market Information System for Ghana.
Engine 1 is the demographic nowcasting foundation. Engine 2 (NLP Demand Engine) is built in parallel.

Overview

This module answers the question:

"Can we predict the employment status of a Ghanaian individual using only socio-demographic characteristics observable in administrative and census data?"

Two models are trained on AHIES 2022–2024 (Annual Household Income and Expenditure Survey, Ghana Statistical Service) using a temporal holdout strategy that treats 2024 as a genuine forward prediction — validating the micro-nowcasting framing.

Models

Model 1 — 5-Class ICLS Employment Status

Detail	Value
Target	`emp_class` — Employee, Employer, Own-Account Worker, Unemployed, Outside Labour Force
Macro F1 (Validation)	0.46
Macro F1 (2024 Holdout)	0.46
Nowcast Drift	−0.03 ✅
Key finding	Structural positions (NLF, OAW) are demographically predictable; behavioural states (Unemployed) are not

Per-class results:

Class	F1	Interpretation
Not in Labour Force	0.89	Strongly demographic — age, sex, household role
Own Account Worker	0.81	Regional and education patterns captured well
Employee	0.77	Learnable but noisy boundary with OAW
Unemployed	0.27	Behavioural, not demographic — expected ceiling
Employer	0.08	Too rare (1% of sample), similar profile to Employee

Model 2 — 3-Class Labour Force Participation

Detail	Value
Target	`economic_activity_status` — Employed, Unemployed, Not Active (OLF)
Macro F1 (Validation)	0.63
Macro F1 (2024 Holdout)	0.63
Nowcast Drift	0.00 ✅
Key finding	Age and household position are the primary demographic drivers of labour force participation in Ghana

Per-class results:

Class	F1
Not Active	0.86
Employed	0.79
Unemployed	0.24

Data

Source: AHIES 2022–2024, Ghana Statistical Service
Access: GSS Microdata Portal
Reference ID: DDI-GHA-GSS-AHIES-2022-2024-v1.1
Total observations: ~595,000 across 12 quarters (2022Q1–2024Q4)

Feature Set

features = [
    'age',               # continuous
    'sex',               # categorical
    'marital_status',    # categorical
    'region',            # categorical — 16 regions (post-2019)
    'relation_to_head',  # categorical
    'urban_rural',       # categorical
    'education_level',   # ordinal categorical
    'disability_status', # binary — 0: no disability, 1: has disability
    'quarter',           # categorical — e.g. '2022Q1', temporal drift signal
]

Confirmed Exclusions

Column	Reason
`economic_activity_status`	Perfect leakage for 5-class model (maps directly to target)
`population_weight`	Used as `sample_weight` in `model.fit()`, not a feature
`sector`, `industry`, `occupation`	Consequences of employment, not predictors — tautological

Temporal Validation Strategy

2022Q1 ──────────── 2023Q2 | 2023Q3 ── 2023Q4 | 2024Q1 ──── 2024Q4
        TRAIN (learn)       |  TEST (early stop) |  HOLDOUT (nowcast)

Train: 2022Q1 → 2023Q2 (~306k rows)
Validation: 2023Q3 → 2023Q4 (~98k rows) — used for early stopping signal only
Holdout: 2024Q1 → 2024Q4 (~188k rows) — untouched until final evaluation

Random splits are explicitly avoided. Temporal ordering is respected throughout to legitimise the nowcasting claim.

Model Configuration

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    depth=8,
    learning_rate=0.0528,
    l2_leaf_reg=3,
    bagging_temperature=0.153,
    iterations=1000,
    class_weights=class_weights,       # computed via sklearn balanced weighting
    cat_features=cat_features,         # string/category columns
    eval_metric='TotalF1',
    loss_function='MultiClass',
    random_seed=42,
)

model.fit(
    X_train, y_train,
    sample_weight=train_weights,       # population_weight — national representativeness
    eval_set=(X_test, y_test),
    early_stopping_rounds=100,
)

Hyperparameters selected via Optuna (30 trials, temporal split preserved across all trials).

SHAP Interpretability

Three levels of interpretation are provided:

Level	Tool	Purpose
Global	`summary_plot (bar)`	Which features matter most overall
Class-level	`summary_plot (dot)`	Direction and magnitude per class
Local	`force_plot` / `waterfall`	Why a specific individual got a specific prediction

Key SHAP Findings

5-class model:

Sector, industry, and occupation dominate — but are excluded from the nowcast model as tautological
Among demographic features: age, relation to head, and apprenticeship status are most predictive

3-class LFP model:

age is the single most powerful predictor — 3x more important than any other feature
Younger individuals are significantly more likely to be Unemployed or Outside the Labour Force
relation_to_head and marital_status capture gendered household roles driving NLF classification
quarter confirms real temporal drift across waves — model learns year-specific patterns

Repo Structure

engine1-catboost/
│
├── data/
│   ├── raw/                        # Original AHIES CSVs (not committed — see .gitignore)
│   │   ├── ahies_2022.csv
│   │   ├── ahies_2023.csv
│   │   └── ahies_2024.csv
│   └── processed/
│       └── ahies_combined.csv      # Concatenated + cleaned dataframe
│
├── notebooks/
│   ├── 01_data_preparation.ipynb   # Loading, merging, feature engineering
│   ├── 02_model_3class.ipynb       # 3-class ICLS model — training + evaluation
│   ├── 03_model_5class.ipynb       # 5-class LFP model — training + evaluation
│   └── 04_shap_analysis.ipynb      # SHAP interpretability — both models
│
├── models/
│   ├── catboost_5class_nowcast.cbm # Saved 5-class model
│   └── catboost_3class_lfp.cbm     # Saved 3-class model
│
├── outputs/
│   ├── figures/                    # Confusion matrices, SHAP plots
│   └── metrics/                    # Classification reports (JSON/CSV)
│
├── src/
│   ├── data_prep.py                # Data loading, cleaning, feature harmonisation
│   ├── train.py                    # Model training pipeline
│   ├── evaluate.py                 # Evaluation metrics and confusion matrix
│   └── shap_analysis.py            # SHAP computation and plotting
│
├── requirements.txt
├── .gitignore
└── README.md

Requirements

catboost>=1.2
shap>=0.44
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
optuna>=3.4
matplotlib>=3.7
seaborn>=0.12

Install:

pip install -r requirements.txt

Key Methodological Notes

Class imbalance — Own-Account Workers dominate Ghana's labour market (~40% of sample). Balanced class weights and per-class F1 reporting are used throughout. Accuracy is not reported as a primary metric.
Unemployed vs OLF boundary — The hardest classification boundary across both models. This is a behavioural distinction (active job search) not a demographic one. The model's failure here is theoretically expected and explicitly flagged as a limitation.
survey_year as categorical — quarter is treated as a string categorical (e.g. '2022Q1'), not a continuous numeric. This allows CatBoost to learn wave-specific patterns without imposing a linear time assumption.
population_weight as sample_weight — Makes model predictions nationally representative, directly strengthening the nowcasting framing. It is never used as a feature.
Tautological features excluded — Sector, industry, and occupation were tested and improved F1 significantly but were excluded from final models because they are consequences of employment status, not predictors of it. Including them would make the model a data consistency checker, not a nowcast.

Findings Summary

"Using only pre-employment socio-demographic characteristics, the model achieves a macro F1 of 0.63 on the 2024 holdout — with zero temporal drift — demonstrating that labour force participation in Ghana is meaningfully predictable from demographic profiles alone. Performance is strongest for structurally determined positions (Not Active: F1=0.86) and weakest for behaviourally defined states (Unemployed: F1=0.24), consistent with the theoretical limits of demographic nowcasting. Age is the single most powerful predictor, followed by household position and marital status — reflecting the lifecycle and gendered household dynamics that shape Ghana's labour market."

Context

This is Engine 1 of the DataverseGH-GLMIS ecosystem:

Engine	Description	Status
Engine 1	Micro-Nowcasting (CatBoost)	✅ Complete
Engine 2	NLP Demand Engine (NER Pipeline)	🔄 In Progress

Data source: Ghana Statistical Service. Research conducted for academic purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataverseGH-GLMIS — Engine 1: Micro-Nowcasting (CatBoost)

Overview

Models

Model 1 — 5-Class ICLS Employment Status

Model 2 — 3-Class Labour Force Participation

Data

Feature Set

Confirmed Exclusions

Temporal Validation Strategy

Model Configuration

SHAP Interpretability

Key SHAP Findings

Repo Structure

Requirements

Key Methodological Notes

Findings Summary

Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
outputs		outputs
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DataverseGH-GLMIS — Engine 1: Micro-Nowcasting (CatBoost)

Overview

Models

Model 1 — 5-Class ICLS Employment Status

Model 2 — 3-Class Labour Force Participation

Data

Feature Set

Confirmed Exclusions

Temporal Validation Strategy

Model Configuration

SHAP Interpretability

Key SHAP Findings

Repo Structure

Requirements

Key Methodological Notes

Findings Summary

Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages