# Exercises XP: Student Notebook

For each exercise, the **Instructions** from the plateform are guided, and the **Guidance** explains exactly what you must do to complete the task.

## What you will learn
- How to clearly define and articulate a machine learning problem statement.

- The process of data collection, including identifying relevant data types and potential data sources.
Skills in feature selection and justification for machine learning models, particularly in the context of loan default prediction.

- Understanding of different types of machine learning models and their suitability for various real-world scenarios.

- Techniques and strategies for evaluating the performance of different machine learning models, including choosing appropriate metrics and understanding their implications.

## What you will create
- A detailed problem statement and data collection plan for a loan default prediction project, including identification of key data types and sources.
- A comprehensive feature selection analysis for a hypothetical loan default prediction dataset.
- A theoretical evaluation strategy for three different types of machine learning models, addressing the unique challenges and metrics relevant to each model type.
- Thoughtful analyses and justifications for choosing specific machine learning approaches for varied scenarios such as stock price prediction, library organization, and robot navigation.
- A document or presentation that showcases your understanding and approach to evaluating and optimizing machine learning models in diverse contexts.

## ðŸŒŸ Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction

### Instructions
- Write a clear problem statement for predicting loan defaults.
- Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
- Discuss the sources where you can collect this data (e.g., financial institutionâ€™s internal records, credit bureaus).

**Expected Output:** A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

### Guidance
- Please write your answer as a short document. Begin by stating the prediction objective in a complete sentence that names the target variable and the decision it will support. Then, describe the data types you would collect in complete sentences. For each data type, explain in one sentence why it could help predict loan defaults.

- After that, name realistic data sources in complete sentences, and briefly describe how you would obtain or integrate each source.

- Finally, include one paragraph that explains risks and constraints such as privacy, regulation, data quality, sampling bias, and governance.

### Your answer
> The objectif of this project is to predict whether a loan applicant will default on their loan, using historical, personal, and financial information. The target variable is a binary indicator of default(Yes/No), and the prediction will support risk-based decision-making for loan approvals.

> To build this model, several categories of data are required, including personal and socio-economic details such as age, gender, marital status, number of dependents, employment status, income level, and region of residence, as these variables help estimte financial stability and repayment capacity. Additional fiancial and credit data such as credit scores, existing debts, monthly financial obligations, and if available, bank account history, are essential to evaluate creditworthiness and financial discipline. Loan specific attributes, including loan amount, duration, interest rate, and loan purpose, further influence the risk assiciated with the credit request. Finally, repayment history, including past late payments, or previous defaults, provides one of the strongest predictors of future repayment behavior.

> These data can be collected from institutional records (client portfolios, repayment logs, past applications), customer provided application forms, and public socio-economic datasets that complement demographic and regional information.

> Working with personal and financial data introduces critical risks and constraints that must be managed carefully. The institution must comply with privacy and data protection regulations, such as CNDP guidelines and general rules governing personnaly indentifiable information. Potnetial challenges include data quality issues, missing or inconsistent values, biased historical decisions that may unintentionally propagate discrimination, and sampling bias that could distort the model's predictions. strong data governance is therefore essential, ensuring secure storage, restricted access, anonymization where appropriate, and ethical handling of sensitive data throughout the entire model lifecycle.

## ðŸŒŸ Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

### Instructions
From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

### Guidance
- First, identify the features that you believe are most relevant, and write their names in a sentence.
Then, provide a justification in complete sentences that explains how each selected feature relates to the likelihood of default.

- If you decide to exclude common features, write one sentence for each excluded feature to explain why it is not appropriate in this context.

- Conclude with two complete sentences that explain how you would encode categorical features and how you would impute missing values.

In [3]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It provides a simple template you can modify while writing your justification.

import pandas as pd

# This placeholder DataFrame allows the cell to run even if you did not load a dataset yet.
example_columns = [
    "age","employment_length","annual_income","credit_score","loan_amount","interest_rate",
    "debt_to_income","num_delinquencies","num_open_accounts","total_utilization","home_ownership",
    "purpose","term","application_type","state","zip_code"
]
df = pd.DataFrame(columns=example_columns)

# Please replace this list with the actual columns that you select.
selected_features = [
    # e.g., "credit_score","debt_to_income","annual_income","loan_amount","interest_rate",
    # "employment_length","num_delinquencies","total_utilization"
]

print("You will now justify the selected features in complete sentences below.")

You will now justify the selected features in complete sentences below.


### Your justification
> the features that I believe are more relevant are: Applicant Income (that mesure the revenue of the applicant), Loan Amount, Credit history, self employed, 

Also explain, in two complete sentences, how you would encode categorical variables and how you would impute missing values.

### Feature Selection and Model Choice for Loan Default Prediction

> The most relevant features for predicting loan default are ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Dependents, Education, Self_Employed, Married, Property_Area, and Gender. Income-related features such as ApplicantIncome and CoapplicantIncome help evaluate the borrowerâ€™s repayment capacity. LoanAmount and Loan_Amount_Term are directly tied to affordability and monthly repayment pressure. Credit_History is one of the strongest indicators of financial discipline and past borrower behavior. Additionally, demographic and socio-economic attributes like Dependents, Education, Married status, Gender, and Self_Employed provide valuable context on financial responsibilities, job stability, and the applicantâ€™s overall repayment reliability. Property_Area may also influence default risk due to economic variations between urban, semi-urban, and rural zones.

> Loan_ID is excluded from modeling because it is only an identifier and contains no predictive value. Loan_Status is not considered a feature since it represents the target variable that the model aims to predict.

> Categorical variables such as Gender, Married, Education, Self_Employed, and Property_Area will be encoded using one-hot encoding or label encoding, depending on the needs of the algorithm. Missing values will be handled through median imputation for numerical variables and mode imputation for categorical variables, ensuring consistent and reliable data preprocessing.

## ðŸŒŸ Exercise 3 : Training, Evaluating, and Optimizing the Model

### Instructions
Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the modelâ€™s performance, mentioning specific metrics that would be relevant to evaluate the model.

### Guidance
- Begin by naming one or two candidate models in a complete sentence and explain why each model is suitable for this problem.

- Next, describe an evaluation plan in complete sentences that covers the data split, the cross-validation strategy, the metrics you will report, and how you will choose a decision threshold.

- Then, explain in complete sentences how you will address class imbalance using stratification, class weights, or resampling.

- Finally, state in one or two complete sentences how you would iterate on hyperparameters to improve performance while avoiding data leakage.

In [28]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It demonstrates standard classification metrics for binary loan default prediction.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, classification_report

# Please replace these placeholders with your true labels and predicted probabilities.
y_true = [0,1,0,1,0,0,1,0,1,0]            # placeholder labels
y_pred_proba = [0.05,0.80,0.10,0.65,0.20,0.15,0.70,0.30,0.85,0.25]  # placeholder probabilities

# You should set a decision threshold that reflects the precisionâ€“recall trade-off for your business case.
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))
print("\nClassification report:\n", classification_report(y_true, y_pred, zero_division=0))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0

Confusion matrix:
 [[6 0]
 [0 4]]

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       1.00      1.00      1.00         4

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



### Your answer
> For the loan default prediction task, I would choose Logistic Regression and Random Forest as candidate models. Logistic Regression is suitable because it provides interpretable coefficients and performs well on binary classification problems with structured data. Random Forest is also appropriate because it captures non-linear relationships, handles mixed feature types, and is robust to noise and missing values.

> To evaluate the models, I would first split the dataset into training and test sets using a stratified split to preserve the proportion of default vs. non-default cases. I would then apply cross-validation, such as 5-fold stratified cross-validation, to obtain stable performance estimates. The main metrics I would report are recall, precision, F1-score, ROC-AUC, and the confusion matrix, since predicting defaults is a high-risk decision. I would also tune and interpret the decision threshold by analyzing the precision-recall tradeoff to reduce false negatives, which are costly in loan approval scenarios.

> To address class imbalance, I would use stratified sampling during splitting and cross-validation, and I would test both class weights and resampling techniques such as SMOTE or undersampling. These methods help ensure the model does not become biased toward the majority class.

> Finally, I would optimize performance by tuning hyperparameters through GridSearchCV or RandomizedSearchCV, ensuring that all preprocessing steps occur inside a pipeline to avoid data leakage. Hyperparameter search would be repeated only on the training folds to maintain fairness and validity of the evaluation.

## ðŸŒŸ Exercise 4 : Designing Machine Learning Solutions for Specific Problems

### Instructions
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices
Organizing a Library of Books : group books into genres or categories based on similarities.
Program a robot to navigate and find the shortest path in a maze.

### Guidance
Please identify the appropriate machine learning paradigm for each scenario in complete sentences and justify your choice.

For each scenario, write one complete sentence that describes the input data, one complete sentence that describes the output, and one complete sentence that describes the learning signal or objective.

### Your answer
#### 1. Predicting Stock Prices

Type: Supervised learning (regression).

Input: Historical stock prices and financial indicators.

Output: A numerical prediction of the future stock price.

Learning signal: The model minimizes the error between predicted and actual prices.

#### 2. Organizing a Library of Books

Type: Unsupervised learning (clustering).

Input: Text descriptions or metadata of books.

Output: Groups of similar books.

Learning signal: The model identifies natural patterns without labels.

#### 3. Robot Navigating a Maze

Type: Reinforcement learning.

Input: The robotâ€™s current state and surrounding environment.

Output: The next action the robot should take.

Learning signal: Rewards for actions that lead to faster or successful navigation.

## ðŸŒŸ Exercise 5 : Designing an Evaluation Strategy for Different ML Models

### Instructions
- Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. - For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
- For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
- For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
- Address the challenges and limitations of evaluating models in each category.

### Guidance
- Please write a separate paragraph for each of the three model categories.
- In the supervised paragraph, describe your validation plan and list the metrics you will report in complete sentences.
- In the unsupervised paragraph, explain how you would measure cluster quality or structure in complete sentences and mention any diagnostic plots.
- In the reinforcement learning paragraph, describe how you would track cumulative reward, assess convergence, and balance exploration and exploitation using complete sentences.
Conclude with one complete sentence per category that states a key evaluation challenge.

In [21]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Supervised classification metrics template with placeholders.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# Replace these placeholders with your real outputs.
y_true = [0,1,1,0,1,0,0,1,0,1]
y_pred_proba = [0.1,0.7,0.8,0.2,0.6,0.3,0.4,0.9,0.2,0.85]
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC-AUC: 1.0
PR-AUC (Average Precision): 1.0


In [23]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Unsupervised clustering metrics template with synthetic data.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, n_init="auto", random_state=42)
labels = kmeans.fit_predict(X)
sil = silhouette_score(X, labels)
print("Silhouette score (higher is better):", round(sil, 4))

print("Please explain in complete sentences when you would use the elbow method and how you would interpret it.")

Silhouette score (higher is better): 0.848
Please explain in complete sentences when you would use the elbow method and how you would interpret it.




### Your answer
#### Supervised Learning Model (Classification)
> To evaluate a supervised classification model, I would begin by splitting the data into training and test sets and applying cross-validation to ensure stable performance across folds. I would report accuracy, precision, recall, and F1-score to capture both overall correctness and class-specific performance, and I would plot the ROC curve and compute the AUC to evaluate ranking quality. I would also examine the confusion matrix to identify systematic classification errors. A key challenge in evaluating supervised models is dealing with class imbalance, which can make accuracy misleading.

#### Unsupervised Learning Model (Clustering)
>To assess an unsupervised clustering model, I would measure cluster structure using metrics such as the silhouette score to quantify separation and cohesion. I would also use the elbow method to inspect how the within-cluster variance changes with different numbers of clusters and rely on diagnostic plots like silhouette diagrams or inertia curves to visually judge cluster quality. If ground truth labels exist, I could additionally compute adjusted Rand index or mutual information. A major challenge in evaluating unsupervised models is the absence of true labels, which makes objective evaluation difficult.

#### Reinforcement Learning Model
>To evaluate a reinforcement learning model, I would track cumulative reward over episodes to measure how effectively the agent learns to maximize long-term returns. I would check for convergence by observing whether the learning curve stabilizes and ensure an appropriate explorationâ€“exploitation balance by monitoring how often the agent explores new actions versus exploiting known profitable ones. I would also test the learned policy in multiple environments or seeds to ensure stability and robustness. A key challenge in evaluating reinforcement learning models is that performance can vary widely depending on stochastic environments and exploration behavior.