# Modelisation - Risque de credit (PD binaire)

Objectif : entrainer le logit de reference et les modeles ML de comparaison sur les donnees
preparees sans fuite d'information (preprocessing fit sur le train).

## 1) Setup

In [1]:
from pathlib import Path
import sys
import os


def find_project_root(start: Path) -> Path:
    for path in [start] + list(start.parents):
        if (path / "src").is_dir():
            return path
    raise FileNotFoundError("Project root with 'src' not found")


root = find_project_root(Path.cwd())
os.chdir(root)
if str(root) not in sys.path:
    sys.path.append(str(root))

import pandas as pd


## 2) Chargement des donnees preparees

In [2]:
from src.data.load_data import read_processed_train_csv, read_processed_test_csv

train_df = read_processed_train_csv()
test_df = read_processed_test_csv()

print('train shape:', train_df.shape)
print('test shape:', test_df.shape)
train_df.head()

train shape: (800, 39)
test shape: (200, 39)


Unnamed: 0,Duree_credit,Montant_credit,Epargne,Anciennete_emploi,Taux_effort,Anciennete_domicile,Age,Nb_credits,Nb_pers_charge,Comptes_A12,...,Autres_credits_A143,Autres_credits_Other,Statut_domicile_A152,Statut_domicile_A153,Type_emploi_A173,Type_emploi_A174,Type_emploi_Other,Telephone_A192,Etranger_Other,Cible
0,36,8335,0,4,3,4,47,1,1,0,...,1,0,0,1,1,0,0,0,0,1
1,12,804,1,4,4,4,38,1,1,0,...,1,0,1,0,1,0,0,0,0,0
2,36,5371,1,2,3,2,28,2,1,0,...,1,0,1,0,1,0,0,0,0,0
3,36,3990,0,1,3,2,29,1,1,1,...,0,0,1,0,0,0,1,0,0,0
4,48,8487,0,3,1,2,24,1,1,1,...,1,0,1,0,1,0,0,0,0,0


## 2.1) Point methode : leakage-free preprocessing

Le pretraitement est ajuste sur l'echantillon d'entrainement uniquement (split 80/20), puis applique tel quel au test. Cela evite toute fuite d'information et garantit des metriques impartiales.


## 3) Entrainement du logit de reference

In [3]:
from src.models.logit import train_logit, save_model, save_summary

logit_model = train_logit(train_df)
logit_path = save_model(logit_model)
summary_path = save_summary(logit_model)

print('Saved logit:', logit_path)
print('Saved summary:', summary_path)
logit_model.summary()

Saved logit: /home/apollinaire_12/memoire_M1/artifacts/logit_model.joblib
Saved summary: /home/apollinaire_12/memoire_M1/reports/draft/logit_summary.txt


0,1,2,3
Dep. Variable:,Cible,No. Observations:,800.0
Model:,Logit,Df Residuals:,761.0
Method:,MLE,Df Model:,38.0
Date:,"Mon, 22 Dec 2025",Pseudo R-squ.:,0.2377
Time:,21:38:30,Log-Likelihood:,-372.52
converged:,True,LL-Null:,-488.69
Covariance Type:,nonrobust,LLR p-value:,9.633e-30

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.6010,0.923,0.651,0.515,-1.208,2.410
Duree_credit,0.0230,0.011,2.167,0.030,0.002,0.044
Montant_credit,0.0001,5.32e-05,2.057,0.040,5.18e-06,0.000
Epargne,-0.0410,0.104,-0.396,0.692,-0.244,0.162
Anciennete_emploi,-0.1808,0.086,-2.104,0.035,-0.349,-0.012
Taux_effort,0.3308,0.094,3.505,0.000,0.146,0.516
Anciennete_domicile,-0.0004,0.093,-0.004,0.997,-0.183,0.182
Age,-0.0025,0.010,-0.261,0.794,-0.022,0.017
Nb_credits,0.2218,0.220,1.010,0.312,-0.209,0.652


## 4) Entrainement logit sans variables sensibles

In [4]:
from src.models.logit import save_model as save_logit

SENSITIVE_PREFIXES = ('Situation_familiale_', 'Etranger_')

def drop_sensitive(df):
    cols = [c for c in df.columns if c.startswith(SENSITIVE_PREFIXES)]
    return df.drop(columns=cols)

train_nosensitive = drop_sensitive(train_df)
logit_ns = train_logit(train_nosensitive)
ns_path = save_logit(logit_ns, filename='logit_model_nosensitive.joblib')
print('Saved logit (no sensitive):', ns_path)

Saved logit (no sensitive): /home/apollinaire_12/memoire_M1/artifacts/logit_model_nosensitive.joblib


## 5) Entrainement des modeles ML de comparaison

In [5]:
from src.models.xgboost_model import train_xgboost, save_model as save_xgb
from src.models.lightgbm_model import train_lightgbm, save_model as save_lgbm
from src.models.random_forest_model import train_random_forest, save_model as save_rf

xgb_model = train_xgboost(train_df)
xgb_path = save_xgb(xgb_model)
print('Saved XGBoost:', xgb_path)

lgbm_model = train_lightgbm(train_df)
lgbm_path = save_lgbm(lgbm_model)
print('Saved LightGBM:', lgbm_path)

rf_model = train_random_forest(train_df)
rf_path = save_rf(rf_model)
print('Saved Random Forest:', rf_path)

Saved XGBoost: /home/apollinaire_12/memoire_M1/artifacts/xgb_model.joblib
[LightGBM] [Info] Number of positive: 240, number of negative: 560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001226 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 412
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 37
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.300000 -> initscore=-0.847298
[LightGBM] [Info] Start training from score -0.847298


Saved LightGBM: /home/apollinaire_12/memoire_M1/artifacts/lgbm_model.joblib


Saved Random Forest: /home/apollinaire_12/memoire_M1/artifacts/rf_model.joblib


## 6) Notes de reproductibilite
Les metriques et graphiques sont produits dans le notebook Resultats (03_results).