Exercice 1

Defining the Problem and Data Collection
Project Title: Loan Default Prediction

Problem Statement
Loan default poses a significant financial risk to lending institutions. Predicting whether a loan applicant is likely to default enables banks and financial services to make informed lending decisions, manage risk proactively, and minimize losses.

Objective:
Develop a machine learning model to predict the likelihood of a loan applicant defaulting on their loan, using historical applicant and loan data.

Key Question:
Taking into account a loan applicant’s profile and financial history, is there a loan default risk?

Data Requirements
To build an effective predictive model, we need a comprehensive dataset with both applicant-level and loan-level features. Below is a breakdown of the required data types:

 A. Personal & Demographic Information

| **Feature**          | **Description**                          |
| -------------------- | ---------------------------------------- |
| Age                  | Applicant’s age                          |
| Gender               | Male/Female/Other                        |
| Marital Status       | Married, Single, etc.                    |
| Number of Dependents | Family size that may impact finances     |
| Education Level      | Highest qualification                    |
| Employment Status    | Employed, Unemployed, Self-Employed      |
| Occupation Type      | Profession (e.g., tech, labor, services) |

 B. Financial History & Creditworthiness
| **Feature**          | **Description**                       |
| -------------------- | ------------------------------------- |
| Credit Score         | Score from credit bureau (e.g., FICO) |
| Existing Loans/Debts | Number and amount of ongoing loans    |
| Annual Income        | Total yearly income                   |
| Debt-to-Income Ratio | Key indicator of repayment ability    |
| Bank Account Status  | Balance trends, overdraft history     |

 C. Loan Information
| **Feature**        | **Description**                      |
| ------------------ | ------------------------------------ |
| Loan Amount        | Amount requested                     |
| Loan Term          | Duration (e.g., 36 months)           |
| Interest Rate      | Rate offered for the loan            |
| Loan Purpose       | E.g., Home, Car, Education, Personal |
| Collateral Details | If secured loan                      |
| Loan Approval Date | When the loan was granted            |

D. Repayment History (Target Variable)
| **Feature**            | **Description**                 |
| ---------------------- | ------------------------------- |
| On-Time Payments       | Number of on-time payments made |
| Missed Payments        | Missed or late payments         |
| Default Status (Label) | 0 = No default, 1 = Default     |

3. Data Sources
Data can be collected from the following internal and external sources:

🔸 A. Internal Sources (from financial institution):
Loan Application Records: Contains applicant details and loan amounts.

Payment History Databases: For repayment and delinquency information.

CRM Systems: Customer demographics and interaction logs.

B. External Sources:
Credit Bureaus (e.g., Equifax, TransUnion, Experian):

Credit scores, credit reports, delinquencies, public records.

Government Databases:

Employment/unemployment records, income tax data.

Open Banking APIs:

Transaction history, account balances, income verification.

Third-Party Data Providers:

Social media signals, spending behavior (if permitted and ethical).

4. Ethical and Legal Considerations
Ensure data privacy and security (e.g., GDPR, CCPA compliance).

Avoid bias in model predictions (e.g., race, gender-based decisions).

Collect only necessary and consented data.

Summary Table

| **Category**          | **Examples of Data**                 | **Potential Sources**                      |
| --------------------- | ------------------------------------ | ------------------------------------------ |
| Personal Information  | Age, Gender, Marital Status          | Loan applications, CRM                     |
| Financial History     | Credit score, Income, Existing debts | Credit bureaus, bank APIs                  |
| Loan Details          | Loan amount, term, interest rate     | Internal loan records                      |
| Repayment Behavior    | Payment history, Default status      | Loan servicing systems                     |
| External Verification | Employment, tax filings              | Government databases, payroll integrations |





Exercise 2 : #Feature Selection and Model Choice for Loan Default Prediction#

In [3]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
from scipy import stats 
import numpy as np

df = pd.read_csv("train.csv")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

Loan_status is the target variable, it is a binary variable with answer Y (for approved) or N (not approved)

                #10 most likely relevant features for predicting loan defaults#

| **Feature**            |   **10 relevant features chosen - explanation **                                 |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| **ApplicantIncome**    | Higher income usually means higher approval probability, hence a better ability to repay the loan.  |
| **CoapplicantIncome**  | Similar logic; joint income improves repayment capability.                                                  |
| **LoanAmount**         | A higher amount of the loan, relative to income can affect affordability.                                   |
| **Loan\_Amount\_Term** | A longer term can reduce monthly payments but increase total interest; affects affordability.               |
| **Credit\_History**    | Previous credit performance is one of the strongest predictors, highly predictive of future behavior.       |
| **Education**          | May indirectly indicate job stability or earning potential.                                                 |
| **Married**            | Dual-income households may be viewed as more financially stable.                                            |
| **Self\_Employed**     | Self-employed individuals might be seen as higher risk due to income variability.                           |
| **Property\_Area**     | Lending practices might vary by region (urban vs. rural, etc.).                                             |
| **Dependents**         | More dependents could imply greater financial burden.                                                       |


TO SUM IT UP,
Top predictors are: Credit_History, ApplicantIncome, LoanAmount, CoapplicantIncome, Loan_Amount_Term

Moderate Predictors (Contextual/Demographic):
Education Married Self_Employed Property_Area Dependents

EXCLUDE:
Loan_ID (non-predictive)

MIGHT NOT BE SIGNIFICANT:
Gender	Might not be significant on its own; to correlate with income or coapplicant status.
Loan_ID	Just an identifier — not predictive.

MISSING VALUES:
Features like - Credit_History, LoanAmount, Self_Employed, and Dependents — imputation or cleaning will be necessary before modeling.


#€xercice 3#

Training, Evaluating, and Optimizing the Model
Instructions
Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the model’s performance, mentioning specific metrics that would be relevant to evaluate the model.


| **Modèle**                                     | **Caractéristiques**                                                                        | **Avantages**                                                   | **Inconvénients**                                  |
| ---------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------- | -------------------------------------------------- |
| **Logistic Regression**                        | Simple, interprétable, bon modèle de base.                                                     | Rapide à entraîner, donne des probabilités, facile à expliquer. | Limité aux relations linéaires.                    |
| **Decision Tree**                              | Facile à interpréter et à visualiser, gère partiellement les données manquantes.               | Intuitif, pas besoin de normalisation.                          | Sujet à l'overfitting (sauf si élagué).            |
| **Random Forest**                              | Gère les non-linéarités et les interactions, robuste, généralise mieux qu’un seul arbre.       | Réduit l’overfitting, bonne précision, résistant au bruit.      | Moins interprétable que la régression logistique.  |
| **Gradient Boosting (XGBoost, LightGBM)**      | Très performant sur données tabulaires, gère bien les valeurs manquantes et les déséquilibres. | Haute précision, excellent pour les compétitions Kaggle.        | Complexe, nécessite du réglage, moins transparent. |
| **Support Vector Machine (SVM)** *(optionnel)* | Bon pour petits à moyens jeux de données, efficace en haute dimension.                         | Performant sur données complexes, robuste aux sur-ajustements.  | Moins interprétable, sensible aux paramètres.      |


For Loan Prediction (binary classification problem where Loan_Status is typically Y or N), we need a model that is:

Good at classification

Handles imbalanced classes well (if present)

Interpretable (in real-world loan decisions, explainability is important)

Etapes d'évaluation du modèle 
Etape 1: Préparation des données
Gérer les valeurs manquantes (imputation ou suppression).
Encoder les variables catégorielles (encodage d'étiquette / encodage One-Hot).
mettre à l'échelle les variables numériques si necessaire (pour les modèles comme la régression logistique , SVM).
Check for class imbalance (Loan_Status Y/N).

Etape 2: Séparation entraînement/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Etape 3: Entraîner le(s) modèle(s)
Essayer 2–3 modèles : commencer avec Logistic Regression + Random Forest ou XGBoost.

Etape 4: Evaluer avec les métriques pertinentes

| #**Métrique**#             | **Pourquoi elle est importante**                                                                            |
| ------------------------ | ----------------------------------------------------------------------------------------------------------- |
| **Accuracy**             | Mesure la justesse globale, mais peut être trompeuse si les classes sont déséquilibrées.                    |
| **Precision**            | Mesure combien de prédictions “Oui” sont réellement correctes — crucial si les faux positifs sont coûteux.  |
| **Recall (Sensibilité)** | Mesure combien de cas “Oui” réels sont correctement prédits — important si les faux négatifs sont risqués.  |
| **F1-Score**             | Moyenne harmonique de la précision et du rappel — bon indicateur global de performance.                     |
| **AUC-ROC**              | Mesure la capacité du modèle à classer un positif au-dessus d’un négatif — utile en classification binaire. |
| **Confusion Matrix**     | Visualise les vrais positifs (TP), faux positifs (FP), vrais négatifs (TN) et faux négatifs (FN).           |

| **Phase**           | **Action**                                                                  |
| ------------------- | --------------------------------------------------------------------------- |
| **Data Prep**       | Nettoyer, encoder, diviser les données, normaliser si nécessaire.           |
| **Model Choices**   | Choisir parmi : Régression Logistique, Random Forest, XGBoost.              |
| **Key Metrics**     | Évaluer avec : Accuracy, Précision, Rappel, F1-score, AUC-ROC.              |
| **Final Selection** | Sélectionner le modèle selon les performances sur test et cross-validation. |
| **Explainability**  | Utiliser SHAP ou l’importance des variables (surtout en finance).           |


Task: Predicting Stock Prices
 Proposed Solution: Use a Supervised Learning model – specifically a Time Series Regression model with Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks.
 Why Supervised Learning?
Stock price prediction involves using historical price data (time-stamped) to predict future price values. Since we have labeled data (prices for past time points), it's a regression problem with clear input-output mappings. This fits the supervised learning paradigm.
Recommended Model: LSTM (Long Short-Term Memory Network)
🔹 Reason for Choosing LSTM:
| **Criteria**           | **Justification**                                                                  |
| ---------------------- | ---------------------------------------------------------------------------------- |
| Temporal dependencies  | Stock prices depend on historical data — LSTM can learn long-term patterns         |
| Non-linearity          | Markets are nonlinear — LSTMs can model complex, nonlinear relationships           |
| Sequence data handling | LSTM is explicitly designed for sequential data like time series                   |
| Memory of past data    | Unlike traditional RNNs, LSTMs handle vanishing gradients and remember long trends |

Architecture Example:
Input: Historical stock prices (e.g., past 60 days)

LSTM layers (1 or 2 stacked)

Dense layer (fully connected)

Output: Predicted price for next day or next week
Evaluation Metrics:

| **Metric**                            | **Why it's useful**                          |
| ------------------------------------- | -------------------------------------------- |
| MAE (Mean Absolute Error)             | Measures average absolute prediction error   |
| RMSE (Root Mean Squared Error)        | Penalizes larger errors more significantly   |
| MAPE (Mean Absolute Percentage Error) | Good for interpreting performance in % terms |

 Challenges to Consider:

| **Challenge**             | **Impact**                                                        |
| ------------------------- | ----------------------------------------------------------------- |
| Market noise & randomness | Stock prices have high volatility and noise                       |
| External factors          | News, events, macroeconomics aren't always captured in price data |
| Overfitting risk          | Especially with deep models and limited historical data           |
| Non-stationarity          | Market patterns can change over time                              |

Enhancements (Optional):
Use additional features: trading volume, moving averages, sentiment analysis (news, tweets)

Apply hybrid models: Combine LSTM with ARIMA or attention mechanisms

Use ensemble methods for improved robustness

Summary

| **Aspect**   | **Choice**                       |
| ------------ | -------------------------------- |
| Model Type   | Supervised Learning (Regression) |
| Algorithm    | LSTM (Long Short-Term Memory)    |
| Input        | Historical price time series     |
| Output       | Future stock price               |
| Alternatives | ARIMA, Prophet, GRU, Transformer |



#Exercise 5

Designing an Evaluation Strategy for Different ML Models
Instructions
Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
Address the challenges and limitations of evaluating models in each category.


| **Métrique** | **Cas d’usage**                          |
| ------------ | ---------------------------------------- |
| Accuracy     | Lorsque les classes sont équilibrées     |
| Precision    | Lorsque les faux positifs sont coûteux   |
| Recall       | Lorsque les faux négatifs sont coûteux   |
| F1-score     | Bon compromis entre précision et rappel  |
| ROC-AUC      | Mesure la séparabilité entre les classes |


| **Méthode**                 | **Description**                                               |
| --------------------------- | ------------------------------------------------------------- |
| Train/Test Split            | Séparation basique des données (ex. 80/20)                    |
| K-Fold Cross-Validation     | Estimation robuste de la performance sur plusieurs découpages |
| Stratified Cross-Validation | Maintient la proportion des classes dans chaque pli           |
| Confusion Matrix            | Visualisation des TP, FP, FN, TN                              |



| **Défi**               | **Description**                                                |
| ---------------------- | -------------------------------------------------------------- |
| Classes déséquilibrées | Accuracy trompeuse ; F1-score et ROC-AUC plus informatifs      |
| Overfitting            | Surtout avec de nombreux attributs                             |
| Fuite de données       | Attention aux fuites de la cible pendant la validation croisée |



| **Métrique / Technique**   | **Objectif**                                                       |
| -------------------------- | ------------------------------------------------------------------ |
| Silhouette Score           | Mesure la cohérence des clusters (1 = bien, -1 = mal)              |
| Elbow Method               | Choisir le K optimal avec la somme des carrés intra-clusters (SSE) |
| Davies-Bouldin Index       | Valeurs plus basses = meilleure séparation des clusters            |
| Visualisations (PCA/t-SNE) | Réduction de dimensions pour visualisation en 2D                   |
| Cluster Purity             | Mesure de l’alignement avec les vraies étiquettes (si disponibles) |



| **Défi**                  | **Description**                                    |
| ------------------------- | -------------------------------------------------- |
| Absence de vérité terrain | Difficile d’évaluer objectivement sans labels      |
| Subjectivité              | Interprétation humaine des clusters nécessaire     |
| Sensibilité à l’échelle   | Les distances sont sensibles à la mise à l’échelle |
| Conditions initiales      | K-Means dépend des centroïdes initiaux             |



| **Métrique**                 | **Description**                                             |
| ---------------------------- | ----------------------------------------------------------- |
| Cumulative Reward            | Récompense totale accumulée – principale métrique de succès |
| Average Reward per Episode   | Suivi de la progression de l’apprentissage                  |
| Episode Length               | Moins de pas = meilleur chemin                              |
| Convergence Time             | Nombre d’épisodes avant stabilisation                       |
| Exploration vs. Exploitation | Suivi via ε dans la stratégie ε-greedy * (proba choix aléat.|



| **Défi**                     | **Description**                                  |
| ---------------------------- | ------------------------------------------------ |
| Environnements stochastiques | Résultats à moyenner sur plusieurs épisodes      |
| Shaping des récompenses      | Mauvaise conception peut nuire à l’apprentissage |
| Convergence lente            | Nécessite souvent beaucoup d’itérations          |
| Récompenses rares            | Difficile d’apprendre sans signal fréquent       |



| **Modèle**    | **Métriques & Méthodes**                                   | **Défis Principaux**                                        |
| ------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
| Supervised    | Accuracy, Precision, Recall, F1, ROC-AUC, Cross-Validation | Données déséquilibrées, overfitting, fuite de données       |
| Unsupervised  | Silhouette, Elbow, Cluster Purity, PCA                     | Absence de labels, subjectivité, sensibilité aux paramètres |
| Reinforcement | Reward cumulé, Reward moyen, Convergence                   | Récompenses rares, exploration, lenteur de convergence      |