Compte tenu des spécificités de notre problème—données déséquilibrées, coûts de mauvaise classification asymétriques, et un mélange de caractéristiques catégorielles et numériques avec une faible corrélation—je recommande d'utiliser les **Machines à Boosting de Gradient**, spécifiquement **XGBoost** ou **LightGBM**.

**Arguments pour ce choix**:

- **Performance**: Les GBM surpassent souvent d'autres algorithmes dans les problèmes de données structurées.
- **Gestion des données déséquilibrées**: Méthodes intégrées pour gérer le déséquilibre des classes.
- **Fonctions de perte personnalisées**: Permettent d'incorporer le ratio de coût 10:1 pour les faux négatifs.
- **Importance des caractéristiques**: Fournissent des métriques d'importance des caractéristiques.
- **Efficacité**: Efficaces avec de grands ensembles de données.
- **Gestion des types de caractéristiques**: Gèrent efficacement les caractéristiques catégorielles et numériques.
- **Faible corrélation des caractéristiques**: Moins de problèmes liés à la multicolinéarité.
- **Techniques d'atténuation**: Arrêt précoce pour éviter le surapprentissage et les valeurs SHAP ou LIME pour une meilleure interprétabilité.

# Validate model selection after first feature selection

## Load libraries

In [21]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, auc, confusion_matrix, f1_score, precision_score, recall_score
import mlflow
import numpy as np
import pandas as pd

import re

## Set global parameters

In [22]:
data_path = 'C:/Users/Z478SG/Desktop/Ecole/OpenClassrooms-Projet-7/modeling/data/03_primary/df_agg.csv'
test_size = 0.2
random_state = 18
cost_fn = 10
cost_fp = 1

## Load data

In [23]:
# Load data
data = pd.read_csv(data_path)

In [24]:
# Remove lines with TARGET = NaN
data = data.dropna(subset=["TARGET"])

In [25]:
data.dtypes.value_counts()

float64    606
bool       141
int64       50
Name: count, dtype: int64

In [26]:
# Fonction pour déterminer le type de données approprié
def determine_int_type(max_value):
    if max_value <= np.iinfo(np.int8).max:
        return np.int8
    elif max_value <= np.iinfo(np.int16).max:
        return np.int16
    elif max_value <= np.iinfo(np.int32).max:
        return np.int32
    else:
        return np.int64

# Parcourir chaque colonne et convertir le type de données si nécessaire
for col in data.select_dtypes(include=[np.int64]).columns:
    max_value = data[col].max()
    new_type = determine_int_type(max_value)
    data[col] = data[col].astype(new_type)

float64_cols = data.select_dtypes(include=[np.float64]).columns
data[float64_cols] = data[float64_cols].astype(np.float32)

In [27]:
data.dtypes.value_counts()

float32    606
bool       141
int8        49
int32        1
Name: count, dtype: int64

### Process infinit values

In [28]:
inf_cols_mask = np.isinf(data).any()

In [29]:
# Print column names with infinite values
inf_cols = data.columns.to_series()[inf_cols_mask].tolist()
inf_cols

[]

In [30]:
# Print how much rows have infinite values
inf_rows_mask = np.isinf(data).any(axis=1)
len(data[inf_rows_mask])

0

In [31]:
# Print the rows that have infinite values
data[inf_rows_mask][inf_cols]

In [32]:
for col in inf_cols:
    print(data[col].describe())

In [33]:
# Replace inf values with max
for col in inf_cols:
    if col in data.columns:  # Check if the column exists in the DataFrame
        max_value = data[col][data[col] != np.inf].max()  # Get the max value excluding inf
        data[col] = data[col].replace([np.inf, -np.inf], max_value)  # Replace inf values

In [34]:
inf_cols_mask = np.isinf(data).any()
# Print column names with infinite values
inf_cols = data.columns.to_series()[inf_cols_mask].tolist()
inf_cols

[]

### Infinite values processed!

In [35]:
X = data.drop("TARGET", axis=1)
y = data["TARGET"]

In [36]:
# ## TEST
# # Séparer les lignes avec TARGET égal à 0 et celles avec TARGET égal à 1
# df_0 = data[data['TARGET'] == 0]
# df_1 = data[data['TARGET'] == 1]

# # Calculer le nombre de lignes nécessaires pour chaque catégorie
# n_0 = int(1000 * 0.92)  # 92% de 1000
# n_1 = 1000 - n_0        # 8% de 1000

# # Sélectionner aléatoirement le nombre approprié de lignes pour chaque catégorie
# df_0_sampled = df_0.sample(n=n_0, random_state=42)
# df_1_sampled = df_1.sample(n=n_1, random_state=42)

# # Concaténer les lignes sélectionnées pour obtenir le DataFrame final
# df_sampled = pd.concat([df_0_sampled, df_1_sampled])

# # Mélanger les lignes pour obtenir un DataFrame final aléatoire
# df_sampled = df_sampled.sample(frac=1, random_state=42).reset_index(drop=True)

# X = df_sampled.drop("TARGET", axis=1)
# y = df_sampled["TARGET"]

In [37]:
X.shape

(307507, 796)

## Split data

In [38]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

## Create model

### Select a model

Models to test:
- RandomForestClassifier
- XGBoost
- LightGBM
- Logistic Regression avec pondération des classes

In [39]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

In [40]:
mlflow.set_tracking_uri(uri="file:///C:/Users/Z478SG/Desktop/Ecole/OpenClassrooms-Projet-7/modeling/mlruns")
mlflow.set_experiment("OCR7")

2024/12/08 21:53:06 INFO mlflow.tracking.fluent: Experiment with name 'OCR7' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///C:/Users/Z478SG/Desktop/Ecole/OpenClassrooms-Projet-7/modeling/mlruns/842049509272676280', creation_time=1733691186762, experiment_id='842049509272676280', last_update_time=1733691186762, lifecycle_stage='active', name='OCR7', tags={}>

In [41]:
def log_model_metrics(model_name, features, metrics):
    with mlflow.start_run() as run:
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("num_features", len(features))
        mlflow.log_param("features", features)
        mlflow.log_metric("AUC-ROC", metrics['auc-roc'])
        mlflow.log_metric("F1", metrics['f1'])
        mlflow.log_metric("Precision", metrics['precision'])
        mlflow.log_metric("Recall", metrics['recall'])


In [42]:
def assess_model(y_pred_proba, y_test, threshold=0.5):
    y_pred = (y_pred_proba > threshold).astype(int)
    return {
        'auc-roc': roc_auc_score(y_test, y_pred_proba),
        'f1': f1_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
    }
    

In [43]:
# Random forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=random_state,)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Mean CV AUC-ROC: {np.mean(cv_scores)}, Std CV AUC-ROC: {np.std(cv_scores)}")

# Entraîner le modèle sur l'ensemble d'entraînement complet
model.fit(X_train, y_train)

# Prédire les probabilités sur l'ensemble de test
y_pred_proba = model.predict_proba(X_test)[:, 1]

metrics = assess_model(y_pred_proba, y_test)
print(f"Test AUC-ROC: {metrics['auc-roc']}")
print(f"Test F1: {metrics['f1']}")
print(f"Test Precision: {metrics['precision']}")
print(f"Test Recall: {metrics['recall']}")

log_model_metrics('RandomForestClassifier', X_train.columns.tolist(), metrics)


Mean CV AUC-ROC: 0.6822085250052969, Std CV AUC-ROC: 0.0028411781978164162
Test AUC-ROC: 0.6910631117285699
Test F1: 0.0
Test Precision: 0.0
Test Recall: 0.0




Test AUC-ROC: 0.6910631117285699

In [44]:
# XGBoost classifier for binary classification outputing probabilities
model = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Mean CV AUC-ROC: {np.mean(cv_scores)}, Std CV AUC-ROC: {np.std(cv_scores)}")

# Entraîner le modèle sur l'ensemble d'entraînement complet
model.fit(X_train, y_train)

# Prédire les probabilités sur l'ensemble de test
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculer les métriques sur l'ensemble de test
metrics = assess_model(y_pred_proba, y_test)
print(f"Test AUC-ROC: {metrics['auc-roc']}")
print(f"Test F1: {metrics['f1']}")
print(f"Test Precision: {metrics['precision']}")
print(f"Test Recall: {metrics['recall']}")

log_model_metrics('XGBClassifier', X_train.columns.tolist(), metrics)


Mean CV AUC-ROC: 0.6926229244684583, Std CV AUC-ROC: 0.002501769040956962


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test AUC-ROC: 0.6912438237237336
Test F1: 0.0
Test Precision: 0.0
Test Recall: 0.0


Test AUC-ROC: 0.6912438237237336


In [45]:
# XGBoost classifier for binary classification outputing scores
model = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:hinge')
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Mean CV AUC-ROC: {np.mean(cv_scores)}, Std CV AUC-ROC: {np.std(cv_scores)}")

# Entraîner le modèle sur l'ensemble d'entraînement complet
model.fit(X_train, y_train)

# Prédire les probabilités sur l'ensemble de test
y_pred_proba = model.predict_proba(X_test)[:, 1]

metrics = assess_model(y_pred_proba, y_test)
print(f"Test AUC-ROC: {metrics['auc-roc']}")
print(f"Test F1: {metrics['f1']}")
print(f"Test Precision: {metrics['precision']}")
print(f"Test Recall: {metrics['recall']}")

log_model_metrics('XGBClassifier-binary:hinge', X_train.columns.tolist(), metrics)


Mean CV AUC-ROC: 0.5, Std CV AUC-ROC: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test AUC-ROC: 0.5
Test F1: 0.0
Test Precision: 0.0
Test Recall: 0.0


Test AUC-ROC: 0.5


In [46]:
# LightGBM classifier
model = LGBMClassifier(n_estimators=100, random_state=random_state)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Mean CV AUC-ROC: {np.mean(cv_scores)}, Std CV AUC-ROC: {np.std(cv_scores)}")

# Entraîner le modèle sur l'ensemble d'entraînement complet
model.fit(X_train, y_train)

# Prédire les probabilités sur l'ensemble de test
y_pred_proba = model.predict_proba(X_test)[:, 1]

metrics = assess_model(y_pred_proba, y_test)
print(f"Test AUC-ROC: {metrics['auc-roc']}")
print(f"Test F1: {metrics['f1']}")
print(f"Test Precision: {metrics['precision']}")
print(f"Test Recall: {metrics['recall']}")

log_model_metrics('LGBMClassifier', X_train.columns.tolist(), metrics)


[LightGBM] [Info] Number of positive: 15956, number of negative: 180848
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.980974 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 92971
[LightGBM] [Info] Number of data points in the train set: 196804, number of used features: 772
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.081076 -> initscore=-2.427822
[LightGBM] [Info] Start training from score -2.427822
[LightGBM] [Info] Number of positive: 15956, number of negative: 180848
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.346470 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 93095
[LightGBM] [Info] Number of data points in the train set: 196804, number of used features: 772
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.081076 -> initscore=-2.4



Test AUC-ROC: 0.7606255692770754
Test F1: 0.0406231512522185
Test Precision: 0.5392670157068062
Test Recall: 0.02110655737704918


Test AUC-ROC: 0.7606255692770754


In [47]:
# Supposons que X est votre DataFrame
missing_percentage = X.isnull().mean() * 100
missing_table = pd.DataFrame({'Column': X.columns, 'Missing Percentage': missing_percentage})

# Trier par ordre croissant du pourcentage de valeurs manquantes
missing_table = missing_table.sort_values(by='Missing Percentage', ascending=True)

# Afficher le tableau trié
print("Tableau trié par ordre croissant du pourcentage de valeurs manquantes:")
missing_table

Tableau trié par ordre croissant du pourcentage de valeurs manquantes:


Unnamed: 0,Column,Missing Percentage
SK_ID_CURR,SK_ID_CURR,0.000000
OCCUPATION_TYPE_Privateservicestaff,OCCUPATION_TYPE_Privateservicestaff,0.000000
OCCUPATION_TYPE_Realtyagents,OCCUPATION_TYPE_Realtyagents,0.000000
OCCUPATION_TYPE_Salesstaff,OCCUPATION_TYPE_Salesstaff,0.000000
OCCUPATION_TYPE_Secretaries,OCCUPATION_TYPE_Secretaries,0.000000
...,...,...
CC_CNT_DRAWINGS_POS_CURRENT_VAR,CC_CNT_DRAWINGS_POS_CURRENT_VAR,99.318064
CC_AMT_DRAWINGS_ATM_CURRENT_VAR,CC_AMT_DRAWINGS_ATM_CURRENT_VAR,99.318064
CC_CNT_DRAWINGS_ATM_CURRENT_VAR,CC_CNT_DRAWINGS_ATM_CURRENT_VAR,99.318064
CC_AMT_DRAWINGS_POS_CURRENT_VAR,CC_AMT_DRAWINGS_POS_CURRENT_VAR,99.318064


In [48]:

# Compter le nombre de colonnes avec différents pourcentages de valeurs manquantes
missing_counts = {
    '>100%': (missing_table['Missing Percentage'] == 100).sum(),
    '90-100%': ((missing_table['Missing Percentage'] >= 90) & (missing_table['Missing Percentage'] < 100)).sum(),
    '75-90%': ((missing_table['Missing Percentage'] >= 75) & (missing_table['Missing Percentage'] < 90)).sum(),
    '50-75%': ((missing_table['Missing Percentage'] >= 50) & (missing_table['Missing Percentage'] < 75)).sum(),
    '25-50%': ((missing_table['Missing Percentage'] >= 25) & (missing_table['Missing Percentage'] < 50)).sum(),
    '<25%': (missing_table['Missing Percentage'] < 25).sum()
}

# Afficher le nombre de colonnes avec différents pourcentages de valeurs manquantes
print("\nNombre de colonnes avec différents pourcentages de valeurs manquantes:")
for key, value in missing_counts.items():
    print(f"{key}: {value} colonnes")



Nombre de colonnes avec différents pourcentages de valeurs manquantes:
>100%: 0 colonnes
90-100%: 125 colonnes
75-90%: 42 colonnes
50-75%: 80 colonnes
25-50%: 332 colonnes
<25%: 217 colonnes


In [49]:
# Filtrer les colonnes qui ont plus de 90% de valeurs manquantes
columns_to_drop = missing_table[missing_table['Missing Percentage'] > 90]['Column']

# Supprimer ces colonnes du DataFrame
X_less_na = X.drop(columns=columns_to_drop)

In [50]:
# Supposons que X est votre DataFrame
missing_percentage = X_less_na.isnull().mean() * 100
missing_table = pd.DataFrame({'Column': X_less_na.columns, 'Missing Percentage': missing_percentage})

# Trier par ordre croissant du pourcentage de valeurs manquantes
missing_table = missing_table.sort_values(by='Missing Percentage', ascending=True)

# Afficher le tableau trié
print("Tableau trié par ordre croissant du pourcentage de valeurs manquantes:")
missing_table

Tableau trié par ordre croissant du pourcentage de valeurs manquantes:


Unnamed: 0,Column,Missing Percentage
SK_ID_CURR,SK_ID_CURR,0.000000
WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,0.000000
WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,0.000000
WEEKDAY_APPR_PROCESS_START_WEDNESDAY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY,0.000000
ORGANIZATION_TYPE_Advertising,ORGANIZATION_TYPE_Advertising,0.000000
...,...,...
REFUSED_AMT_DOWN_PAYMENT_MIN,REFUSED_AMT_DOWN_PAYMENT_MIN,89.549181
REFUSED_AMT_DOWN_PAYMENT_MEAN,REFUSED_AMT_DOWN_PAYMENT_MEAN,89.549181
REFUSED_RATE_DOWN_PAYMENT_MEAN,REFUSED_RATE_DOWN_PAYMENT_MEAN,89.549181
REFUSED_RATE_DOWN_PAYMENT_MIN,REFUSED_RATE_DOWN_PAYMENT_MIN,89.549181


In [51]:
# Calculer la moyenne de chaque colonne
means = X_less_na.mean()

# Remplacer les valeurs manquantes par la moyenne de chaque colonne
X_no_na = X_less_na.fillna(means)


In [52]:
# Supposons que X est votre DataFrame
missing_percentage = X_no_na.isnull().mean() * 100
missing_table = pd.DataFrame({'Column': X_no_na.columns, 'Missing Percentage': missing_percentage})

# Trier par ordre croissant du pourcentage de valeurs manquantes
missing_table = missing_table.sort_values(by='Missing Percentage', ascending=True)

# Afficher le tableau trié
print("Tableau trié par ordre croissant du pourcentage de valeurs manquantes:")
missing_table

Tableau trié par ordre croissant du pourcentage de valeurs manquantes:


Unnamed: 0,Column,Missing Percentage
SK_ID_CURR,SK_ID_CURR,0.0
PREV_NAME_PAYMENT_TYPE_XNA_MEAN,PREV_NAME_PAYMENT_TYPE_XNA_MEAN,0.0
PREV_NAME_PAYMENT_TYPE_nan_MEAN,PREV_NAME_PAYMENT_TYPE_nan_MEAN,0.0
PREV_CODE_REJECT_REASON_CLIENT_MEAN,PREV_CODE_REJECT_REASON_CLIENT_MEAN,0.0
PREV_CODE_REJECT_REASON_HC_MEAN,PREV_CODE_REJECT_REASON_HC_MEAN,0.0
...,...,...
FONDKAPREMONT_MODE_notspecified,FONDKAPREMONT_MODE_notspecified,0.0
FONDKAPREMONT_MODE_orgspecaccount,FONDKAPREMONT_MODE_orgspecaccount,0.0
FONDKAPREMONT_MODE_regoperaccount,FONDKAPREMONT_MODE_regoperaccount,0.0
HOUSETYPE_MODE_blockofflats,HOUSETYPE_MODE_blockofflats,0.0


In [53]:
X_no_na_train, X_no_na_test, y_train, y_test = train_test_split(X_no_na, y, test_size=test_size, random_state=random_state)

In [54]:
# Logistic regression classifier with class weight
model = LogisticRegression(class_weight={0: cost_fn, 1: cost_fp}, random_state=random_state, max_iter=1000)
cv_scores = cross_val_score(model, X_no_na_train, y_train, cv=5, scoring='roc_auc')
print(f"Mean CV AUC-ROC: {np.mean(cv_scores)}, Std CV AUC-ROC: {np.std(cv_scores)}")

# Entraîner le modèle sur l'ensemble d'entraînement complet
model.fit(X_no_na_train, y_train)

# Prédire les probabilités sur l'ensemble de test
y_pred_proba = model.predict_proba(X_no_na_test)[:, 1]

metrics = assess_model(y_pred_proba, y_test)
print(f"Test AUC-ROC: {metrics['auc-roc']}")
print(f"Test F1: {metrics['f1']}")
print(f"Test Precision: {metrics['precision']}")
print(f"Test Recall: {metrics['recall']}")

log_model_metrics('LogisticRegression', X_train.columns.tolist(), metrics)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean CV AUC-ROC: 0.5242008662812383, Std CV AUC-ROC: 0.00430157480949822


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test AUC-ROC: 0.5286756624749345
Test F1: 0.0
Test Precision: 0.0
Test Recall: 0.0


Test AUC-ROC: 0.5286756624749345


On choisit bien LGBMClassifier qui a le meilleur AUC-ROC : 0.76