<a href="https://colab.research.google.com/github/ach224/Prediction_eligibilite_pret_bancaire/blob/Optimisation_A%C3%AFcha/Optimisation_A%C3%AFcha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Etape 4 : Optimisation

* Feature engineering (Income-to-Loan-Ratio)
* Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
* Gestion du déséquilibre (SMOTE, class_weight)

## Récupération
Nous gardons comme modèle principale pour cette phase le random forest, étant donnée qu'il avait les meilleurs métriques lors de la phase de modélisation.

In [5]:
# 0. Imports
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Chargement du dataset brut
df = pd.read_csv("/content/drive/MyDrive/DATA SCIENCES/PROJET DATA SCIENCES/loan_prediction.csv")
print(df.shape)
df.head()

# 2. Nettoyage & préparation
# a) Suppression de l'identifiant (non utile pour la prédiction)
df = df.drop(columns=["Loan_ID"])
# b) Imputation des valeurs manquantes
colonnes_cat = ["Gender", "Married", "Dependents", "Education",
                "Self_Employed", "Property_Area", "Loan_Status"]
colonnes_num = ["ApplicantIncome", "CoapplicantIncome",
                "LoanAmount", "Loan_Amount_Term", "Credit_History"]
# Catégorielles -> valeur la plus fréquente (mode)
for col in colonnes_cat:
    df[col] = df[col].fillna(df[col].mode()[0])
# Numériques -> médiane
for col in colonnes_num:
    df[col] = df[col].fillna(df[col].median())

# 3. Encodage des variables
# Encodage binaire (comme dans ton premier notebook)
df["Gender"]        = df["Gender"].map({"Male": 0, "Female": 1})
df["Married"]       = df["Married"].map({"No": 0, "Yes": 1})
df["Education"]     = df["Education"].map({"Not Graduate": 0, "Graduate": 1})
df["Self_Employed"] = df["Self_Employed"].map({"No": 0, "Yes": 1})
df["Credit_History"]= df["Credit_History"].astype(int)  # déjà 0/1
df["Loan_Status"]   = df["Loan_Status"].map({"N": 0, "Y": 1})
# Dependents : on remplace "3+" par 3 et on passe en int
df["Dependents"] = df["Dependents"].replace("3+", 3).astype(int)
# Property_Area : encodage simple 0/1/2
df["Property_Area"] = df["Property_Area"].map({"Rural": 0, "Semiurban": 1, "Urban": 2})
# Conversion en int des colonnes appropriées
cols_int = ["Gender", "Married", "Dependents", "Education",
            "Self_Employed", "Credit_History", "Property_Area", "Loan_Status"]
df[cols_int] = df[cols_int].astype(int)
df.info()

# 4. Séparation features / cible
X = df.drop(columns=["Loan_Status"])
y = df["Loan_Status"]

# 5. Train / Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print("X_train:", X_train.shape)
print("X_test :", X_test.shape)
print("y_train:", y_train.shape)
print("y_test :", y_test.shape)

# 6. Modélisation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Random Forest avant optimisation
rf_base = RandomForestClassifier(random_state=42,)
rf_base.fit(X_train, y_train)
y_pred_base  = rf_base.predict(X_test)
y_proba_base = rf_base.predict_proba(X_test)[:, 1]
RF_before_opt = pd.DataFrame([{
    "Modèle":   "Random Forest (avant optimisation)",
    "Accuracy":  accuracy_score(y_test, y_pred_base),
    "Precision": precision_score(y_test, y_pred_base),
    "Recall":    recall_score(y_test, y_pred_base),
    "F1-score":  f1_score(y_test, y_pred_base),
    "ROC-AUC":   roc_auc_score(y_test, y_proba_base),
}])
display(RF_before_opt.round(3))

(614, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    int64  
 1   Married            614 non-null    int64  
 2   Dependents         614 non-null    int64  
 3   Education          614 non-null    int64  
 4   Self_Employed      614 non-null    int64  
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    int64  
 10  Property_Area      614 non-null    int64  
 11  Loan_Status        614 non-null    int64  
dtypes: float64(3), int64(9)
memory usage: 57.7 KB
X_train: (491, 11)
X_test : (123, 11)
y_train: (491,)
y_test : (123,)


Unnamed: 0,Modèle,Accuracy,Precision,Recall,F1-score,ROC-AUC
0,Random Forest (avant optimisation),0.829,0.848,0.918,0.881,0.789


Après avoir récupérer tous les codes précents, on peut passer à l'étape de modélisation concrètes.

## Etape 1 : Feature engineering
Le feature engineering consiste à créer une ou plusieurs nouvelles variables à partir de celle déjà existantes, dans le but d'améliorer la prédiction du modèle