# create_good_model.ipynb — Purposefully **Good** Model (Fairness-Aware Feature Use)

This notebook trains a model that is “good” in the assignment’s sense:
- It aims to **reduce undesirable biased patterns** by excluding sensitive/proxy/leakage features.
- It still reports classical ML metrics.

**Critical interface requirement**:
The independent tester will provide inputs with the original full feature set.
So we train/export a **pipeline** that:
1) accepts **all original features**,
2) internally drops the disallowed columns,
3) runs the classifier on the remaining columns.

It then:
- Exports the model to ONNX,
- Randomly assigns it to `model_1.onnx` or `model_2.onnx` using shared `model_assignment.json`,
- Reports performance:
  1) classical metrics (Accuracy, ROC-AUC, PR-AUC)
  2) part-2 tests (gender partition + gender-flip metamorphic), executed via ONNX runtime.


In [15]:
import os, json, random, shutil
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, average_precision_score
from sklearn.compose import ColumnTransformer

rstate = 1
target = "checked"



data_path = "data/investigation_train_large_checked.csv"


df = pd.read_csv(data_path)
print("Loaded:", data_path, "shape:", df.shape)

y = df[target].astype(int).values

drop_cols = [target, "Ja", "Nee"]

X = df.drop(columns=drop_cols)

print("Features:", X.shape[1], "Positive rate:", y.mean().round(4))

Loaded: data/investigation_train_large_checked.csv shape: (130000, 318)
Features: 315 Positive rate: 0.15


In [16]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=rstate, stratify=y
)
X_train.shape, X_test.shape

((104000, 315), (26000, 315))

## Model choice (Good model)

We use **Logistic Regression** because:
- It is a strong, stable baseline and easier to justify in policy-sensitive contexts.
- With a fairness-aware feature set, it reduces the risk of exploiting complex proxy interactions.
- It converts cleanly to ONNX.

The “goodness” comes primarily from the **feature policy** below.


In [17]:
# Build the set of columns to drop (fairness-aware policy)
cols = list(X_train.columns)
drop = set()


# Gender ofc
if "persoon_geslacht_vrouw" in cols:
    drop.add("persoon_geslacht_vrouw")

# Language / indirect ethnicity
for col in cols:
    if col.startswith("persoonlijke_eigenschappen_spreektaal"):
        drop.add(col)
    if col.startswith("persoonlijke_eigenschappen_taaleis"):
        drop.add(col)
    if "inburger" in col or "inburgering" in col:
        drop.add(col)
    if col == "belemmering_hist_taal":
        drop.add(col)
    if col.startswith("contacten_onderwerp_") and ("taal" in col or "taaleis" in col):
        drop.add(col)

# spatial profiling variables
for col in cols:
    if col.startswith("adres_recentste_wijk_") or col.startswith("adres_recentste_buurt_"):
        drop.add(col)

drop = sorted(drop)
keep_cols = [col for col in cols if col not in drop]

print("Dropping:", len(drop))
print("Keeping :", len(keep_cols))
drop[:20]

Dropping: 25
Keeping : 290


['adres_recentste_buurt_groot_ijsselmonde',
 'adres_recentste_buurt_nieuwe_westen',
 'adres_recentste_buurt_other',
 'adres_recentste_buurt_oude_noorden',
 'adres_recentste_buurt_vreewijk',
 'adres_recentste_wijk_charlois',
 'adres_recentste_wijk_delfshaven',
 'adres_recentste_wijk_feijenoord',
 'adres_recentste_wijk_ijsselmonde',
 'adres_recentste_wijk_kralingen_c',
 'adres_recentste_wijk_noord',
 'adres_recentste_wijk_other',
 'adres_recentste_wijk_prins_alexa',
 'adres_recentste_wijk_stadscentru',
 'belemmering_hist_taal',
 'contacten_onderwerp_beoordelen_taaleis',
 'contacten_onderwerp_boolean_beoordelen_taaleis',
 'contacten_onderwerp_boolean_taaleis___voldoet',
 'persoon_geslacht_vrouw',
 'persoonlijke_eigenschappen_spreektaal']

In [18]:
# Pipeline that keeps full interface but drops internally (ONNX-friendly)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# ColumnTransformer works with column indices; derive indices of keep_cols in the original column order
col_index = {c: i for i, c in enumerate(X_train.columns)}
keep_idx = [col_index[c] for c in keep_cols]

preprocess = ColumnTransformer(
    transformers=[("keep", "passthrough", keep_idx)],
    remainder="drop",
    verbose_feature_names_out=False,
)

clf = LogisticRegression(max_iter=2000, n_jobs=-1)

good_pipe = Pipeline([
    ("preprocess", preprocess),
    ("clf", clf),
])


# Light tuning grid (kept small so it runs quickly)
param_grid = {
    "clf__C": [0.5, 1.0, 2],
    "clf__class_weight": [None],
}

search = GridSearchCV(
    estimator=good_pipe,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3,
    n_jobs=-1,
    verbose=1
)

search.fit(X_train, y_train)

good_model = search.best_estimator_
print("Best params:", search.best_params_)
print("Best CV ROC-AUC:", search.best_score_)

p = good_model.predict_proba(X_test)[:, 1]
yhat = (p >= 0.5).astype(int)

acc = accuracy_score(y_test, yhat)
roc = roc_auc_score(y_test, p)
prauc = average_precision_score(y_test, p)

print(f"GOOD model — Accuracy: {acc:.4f}")
print(f"GOOD model — ROC-AUC:  {roc:.4f}")
print(f"GOOD model — PR-AUC:   {prauc:.4f}")

#GOOD model — Accuracy: 0.8628
# GOOD model — ROC-AUC:  0.8069
# GOOD model — PR-AUC:   0.4646

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best params: {'clf__C': 1.0, 'clf__class_weight': None}
Best CV ROC-AUC: 0.8577820792773186
GOOD model — Accuracy: 0.8769
GOOD model — ROC-AUC:  0.8377
GOOD model — PR-AUC:   0.5743


In [19]:
# Export to ONNX should have the whole pipeline aswell
onnx_tmp = "model/good_model_tmp.onnx"
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [("float_input", FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(good_model, initial_types=initial_type)

with open(onnx_tmp, "wb") as f:
    f.write(onnx_model.SerializeToString())

print("Saved temporary ONNX:", onnx_tmp)

Saved temporary ONNX: model/good_model_tmp.onnx
