Part of the DataverseGH-GLMIS ecosystem — a modular Labour Market Information System for Ghana.
Engine 1 is the demographic nowcasting foundation. Engine 2 (NLP Demand Engine) is built in parallel.
This module answers the question:
"Can we predict the employment status of a Ghanaian individual using only socio-demographic characteristics observable in administrative and census data?"
Two models are trained on AHIES 2022–2024 (Annual Household Income and Expenditure Survey, Ghana Statistical Service) using a temporal holdout strategy that treats 2024 as a genuine forward prediction — validating the micro-nowcasting framing.
| Detail | Value |
|---|---|
| Target | emp_class — Employee, Employer, Own-Account Worker, Unemployed, Outside Labour Force |
| Macro F1 (Validation) | 0.46 |
| Macro F1 (2024 Holdout) | 0.46 |
| Nowcast Drift | −0.03 ✅ |
| Key finding | Structural positions (NLF, OAW) are demographically predictable; behavioural states (Unemployed) are not |
Per-class results:
| Class | F1 | Interpretation |
|---|---|---|
| Not in Labour Force | 0.89 | Strongly demographic — age, sex, household role |
| Own Account Worker | 0.81 | Regional and education patterns captured well |
| Employee | 0.77 | Learnable but noisy boundary with OAW |
| Unemployed | 0.27 | Behavioural, not demographic — expected ceiling |
| Employer | 0.08 | Too rare (1% of sample), similar profile to Employee |
| Detail | Value |
|---|---|
| Target | economic_activity_status — Employed, Unemployed, Not Active (OLF) |
| Macro F1 (Validation) | 0.63 |
| Macro F1 (2024 Holdout) | 0.63 |
| Nowcast Drift | 0.00 ✅ |
| Key finding | Age and household position are the primary demographic drivers of labour force participation in Ghana |
Per-class results:
| Class | F1 |
|---|---|
| Not Active | 0.86 |
| Employed | 0.79 |
| Unemployed | 0.24 |
Source: AHIES 2022–2024, Ghana Statistical Service
Access: GSS Microdata Portal
Reference ID: DDI-GHA-GSS-AHIES-2022-2024-v1.1
Total observations: ~595,000 across 12 quarters (2022Q1–2024Q4)
features = [
'age', # continuous
'sex', # categorical
'marital_status', # categorical
'region', # categorical — 16 regions (post-2019)
'relation_to_head', # categorical
'urban_rural', # categorical
'education_level', # ordinal categorical
'disability_status', # binary — 0: no disability, 1: has disability
'quarter', # categorical — e.g. '2022Q1', temporal drift signal
]| Column | Reason |
|---|---|
economic_activity_status |
Perfect leakage for 5-class model (maps directly to target) |
population_weight |
Used as sample_weight in model.fit(), not a feature |
sector, industry, occupation |
Consequences of employment, not predictors — tautological |
2022Q1 ──────────── 2023Q2 | 2023Q3 ── 2023Q4 | 2024Q1 ──── 2024Q4
TRAIN (learn) | TEST (early stop) | HOLDOUT (nowcast)
- Train: 2022Q1 → 2023Q2 (~306k rows)
- Validation: 2023Q3 → 2023Q4 (~98k rows) — used for early stopping signal only
- Holdout: 2024Q1 → 2024Q4 (~188k rows) — untouched until final evaluation
Random splits are explicitly avoided. Temporal ordering is respected throughout to legitimise the nowcasting claim.
from catboost import CatBoostClassifier
model = CatBoostClassifier(
depth=8,
learning_rate=0.0528,
l2_leaf_reg=3,
bagging_temperature=0.153,
iterations=1000,
class_weights=class_weights, # computed via sklearn balanced weighting
cat_features=cat_features, # string/category columns
eval_metric='TotalF1',
loss_function='MultiClass',
random_seed=42,
)
model.fit(
X_train, y_train,
sample_weight=train_weights, # population_weight — national representativeness
eval_set=(X_test, y_test),
early_stopping_rounds=100,
)Hyperparameters selected via Optuna (30 trials, temporal split preserved across all trials).
Three levels of interpretation are provided:
| Level | Tool | Purpose |
|---|---|---|
| Global | summary_plot (bar) |
Which features matter most overall |
| Class-level | summary_plot (dot) |
Direction and magnitude per class |
| Local | force_plot / waterfall |
Why a specific individual got a specific prediction |
5-class model:
- Sector, industry, and occupation dominate — but are excluded from the nowcast model as tautological
- Among demographic features: age, relation to head, and apprenticeship status are most predictive
3-class LFP model:
ageis the single most powerful predictor — 3x more important than any other feature- Younger individuals are significantly more likely to be Unemployed or Outside the Labour Force
relation_to_headandmarital_statuscapture gendered household roles driving NLF classificationquarterconfirms real temporal drift across waves — model learns year-specific patterns
engine1-catboost/
│
├── data/
│ ├── raw/ # Original AHIES CSVs (not committed — see .gitignore)
│ │ ├── ahies_2022.csv
│ │ ├── ahies_2023.csv
│ │ └── ahies_2024.csv
│ └── processed/
│ └── ahies_combined.csv # Concatenated + cleaned dataframe
│
├── notebooks/
│ ├── 01_data_preparation.ipynb # Loading, merging, feature engineering
│ ├── 02_model_3class.ipynb # 3-class ICLS model — training + evaluation
│ ├── 03_model_5class.ipynb # 5-class LFP model — training + evaluation
│ └── 04_shap_analysis.ipynb # SHAP interpretability — both models
│
├── models/
│ ├── catboost_5class_nowcast.cbm # Saved 5-class model
│ └── catboost_3class_lfp.cbm # Saved 3-class model
│
├── outputs/
│ ├── figures/ # Confusion matrices, SHAP plots
│ └── metrics/ # Classification reports (JSON/CSV)
│
├── src/
│ ├── data_prep.py # Data loading, cleaning, feature harmonisation
│ ├── train.py # Model training pipeline
│ ├── evaluate.py # Evaluation metrics and confusion matrix
│ └── shap_analysis.py # SHAP computation and plotting
│
├── requirements.txt
├── .gitignore
└── README.md
catboost>=1.2
shap>=0.44
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
optuna>=3.4
matplotlib>=3.7
seaborn>=0.12Install:
pip install -r requirements.txt-
Class imbalance — Own-Account Workers dominate Ghana's labour market (~40% of sample). Balanced class weights and per-class F1 reporting are used throughout. Accuracy is not reported as a primary metric.
-
Unemployed vs OLF boundary — The hardest classification boundary across both models. This is a behavioural distinction (active job search) not a demographic one. The model's failure here is theoretically expected and explicitly flagged as a limitation.
-
survey_year as categorical —
quarteris treated as a string categorical (e.g.'2022Q1'), not a continuous numeric. This allows CatBoost to learn wave-specific patterns without imposing a linear time assumption. -
population_weight as sample_weight — Makes model predictions nationally representative, directly strengthening the nowcasting framing. It is never used as a feature.
-
Tautological features excluded — Sector, industry, and occupation were tested and improved F1 significantly but were excluded from final models because they are consequences of employment status, not predictors of it. Including them would make the model a data consistency checker, not a nowcast.
"Using only pre-employment socio-demographic characteristics, the model achieves a macro F1 of 0.63 on the 2024 holdout — with zero temporal drift — demonstrating that labour force participation in Ghana is meaningfully predictable from demographic profiles alone. Performance is strongest for structurally determined positions (Not Active: F1=0.86) and weakest for behaviourally defined states (Unemployed: F1=0.24), consistent with the theoretical limits of demographic nowcasting. Age is the single most powerful predictor, followed by household position and marital status — reflecting the lifecycle and gendered household dynamics that shape Ghana's labour market."
This is Engine 1 of the DataverseGH-GLMIS ecosystem:
| Engine | Description | Status |
|---|---|---|
| Engine 1 | Micro-Nowcasting (CatBoost) | ✅ Complete |
| Engine 2 | NLP Demand Engine (NER Pipeline) | 🔄 In Progress |
Data source: Ghana Statistical Service. Research conducted for academic purposes.



