<a href="https://colab.research.google.com/github/ach224/Prediction_eligibilite_pret_bancaire/blob/Etape-1-2-(data-cleaning-%2B-EDA)---A%C3%AFcha/Projet_1_%E2%80%93_Pr%C3%A9diction_de_l%E2%80%99%C3%A9ligibilit%C3%A9_%C3%A0_un_pr%C3%AAt_bancaire.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projet 1 ‚Äì Pr√©diction de l‚Äô√©ligibilit√© √† un pr√™t bancaire
## üéØ Objectif du projet
Pr√©dire si un client est √©ligible √† un pr√™t bancaire en utilisant des mod√®les de machine learning (ex. Gradient Boosting, Logistic Regression). L‚Äôenjeu est de reproduire le processus de scoring bancaire sur un dataset public.
## üìÇ Dataset
Nom : Loan Prediction Dataset

Source : https://www.kaggle.com/datasets/ninzaami/loan-predication

## üõ†Ô∏è √âtapes du projet
**1. Compr√©hension et pr√©paration des donn√©es**
- Charger le dataset
- Comprendre les colonnes
- V√©rifier valeurs manquantes, doublons, valeurs aberrantes
- Encoder variables cat√©gorielles
- Normaliser / standardiser les variables num√©riques si n√©cessaire

**2. Analyse exploratoire (EDA)**
- Visualiser la distribution des variables
- Comparer revenus entre √©ligibles et non √©ligibles
- √âtudier impact de Credit_History et Education
- V√©rifier d√©s√©quilibre des classes dans Loan_Status

**3. Mod√©lisation**
- D√©finir variable cible : Loan_Status
- S√©parer train/test
- Tester plusieurs mod√®les (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting)
- Comparer performances (Accuracy, Precision, Recall, F1-score, ROC-AUC)

**4. Optimisation**
- Feature engineering (Income-to-Loan-Ratio)
- Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Gestion du d√©s√©quilibre (SMOTE, class_weight)

**5. √âvaluation finale**
- Comparer r√©sultats sur test set
- S√©lectionner mod√®le final
- Interpr√©ter features importantes (feature importance, SHAP values)

**6. Restitution**
- R√©diger un rapport clair avec objectif, m√©thodologie, r√©sultats et recommandations

**7. (Optionnel) Application**
- Cr√©er un dashboard avec Streamlit ou Gradio permettant de saisir les infos d‚Äôun client et pr√©dire son √©ligibilit√©


# Etape 1 : Compr√©hension et pr√©paration des donn√©es

In [1]:
# Importation des donn√©es
from google.colab import files
files.upload()

Saving loan_prediction.csv to loan_prediction.csv


{'loan_prediction.csv': b'Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status\r\nLP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y\r\nLP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N\r\nLP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y\r\nLP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y\r\nLP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y\r\nLP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y\r\nLP001013,Male,Yes,0,Not Graduate,No,2333,1516,95,360,1,Urban,Y\r\nLP001014,Male,Yes,3+,Graduate,No,3036,2504,158,360,0,Semiurban,N\r\nLP001018,Male,Yes,2,Graduate,No,4006,1526,168,360,1,Urban,Y\r\nLP001020,Male,Yes,1,Graduate,No,12841,10968,349,360,1,Semiurban,N\r\nLP001024,Male,Yes,2,Graduate,No,3200,700,70,360,1,Urban,Y\r\nLP001027,Male,Yes,2,Graduate,,2500,1840,109,360,1,Urban,Y\r\nLP001028,Male,Yes,2,Graduate,No,3073,8106,200,360,1,U

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('loan_prediction.csv')
df.shape

(614, 13)

In [3]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
# Nom des colonnes
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [7]:
df["Property_Area"].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [8]:
df['Credit_History'].unique()

array([ 1.,  0., nan])

In [9]:
for i in df.columns:
  print(i, df[i].unique())

Loan_ID ['LP001002' 'LP001003' 'LP001005' 'LP001006' 'LP001008' 'LP001011'
 'LP001013' 'LP001014' 'LP001018' 'LP001020' 'LP001024' 'LP001027'
 'LP001028' 'LP001029' 'LP001030' 'LP001032' 'LP001034' 'LP001036'
 'LP001038' 'LP001041' 'LP001043' 'LP001046' 'LP001047' 'LP001050'
 'LP001052' 'LP001066' 'LP001068' 'LP001073' 'LP001086' 'LP001087'
 'LP001091' 'LP001095' 'LP001097' 'LP001098' 'LP001100' 'LP001106'
 'LP001109' 'LP001112' 'LP001114' 'LP001116' 'LP001119' 'LP001120'
 'LP001123' 'LP001131' 'LP001136' 'LP001137' 'LP001138' 'LP001144'
 'LP001146' 'LP001151' 'LP001155' 'LP001157' 'LP001164' 'LP001179'
 'LP001186' 'LP001194' 'LP001195' 'LP001197' 'LP001198' 'LP001199'
 'LP001205' 'LP001206' 'LP001207' 'LP001213' 'LP001222' 'LP001225'
 'LP001228' 'LP001233' 'LP001238' 'LP001241' 'LP001243' 'LP001245'
 'LP001248' 'LP001250' 'LP001253' 'LP001255' 'LP001256' 'LP001259'
 'LP001263' 'LP001264' 'LP001265' 'LP001266' 'LP001267' 'LP001273'
 'LP001275' 'LP001279' 'LP001280' 'LP001282' 'LP001289

On va passer a des variables num√©riques pour que l'analyse soit plus fluide.
* Pour la colonne ['Gender'] : 0 = Male; 1 = Female;
* Pour la colonne ['Married'] : 0 = No; 1 = Yes;
* Pour la colonne ['Dependents'] : changer le type en int
* Pour la colonne ['Education'] : 0 = No graduate; 1 = Graduate;
* Pour la colonne ['Self_Employed'] : 0 = No; 1 = Yes;
* Pour la colonne ['Credit_History'] : 0 = 0.; 1 = 1.;
* Pour la colonne ['Loan_Status'] : 0 = No; 1 = Yes.

Avant de faire √ßa, on doit retirer les valeurs nulles.