# Exercises XP: Student Notebook

For each exercise, the **Instructions** from the plateform are guided, and the **Guidance** explains exactly what you must do to complete the task.

## What you will learn
- How to clearly define and articulate a machine learning problem statement.

- The process of data collection, including identifying relevant data types and potential data sources.
Skills in feature selection and justification for machine learning models, particularly in the context of loan default prediction.

- Understanding of different types of machine learning models and their suitability for various real-world scenarios.

- Techniques and strategies for evaluating the performance of different machine learning models, including choosing appropriate metrics and understanding their implications.

## What you will create
- A detailed problem statement and data collection plan for a loan default prediction project, including identification of key data types and sources.
- A comprehensive feature selection analysis for a hypothetical loan default prediction dataset.
- A theoretical evaluation strategy for three different types of machine learning models, addressing the unique challenges and metrics relevant to each model type.
- Thoughtful analyses and justifications for choosing specific machine learning approaches for varied scenarios such as stock price prediction, library organization, and robot navigation.
- A document or presentation that showcases your understanding and approach to evaluating and optimizing machine learning models in diverse contexts.

## ðŸŒŸ Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction

### Instructions
- Write a clear problem statement for predicting loan defaults.
- Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
- Discuss the sources where you can collect this data (e.g., financial institutionâ€™s internal records, credit bureaus).

**Expected Output:** A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

### Guidance
- Please write your answer as a short document. Begin by stating the prediction objective in a complete sentence that names the target variable and the decision it will support. Then, describe the data types you would collect in complete sentences. For each data type, explain in one sentence why it could help predict loan defaults.

- After that, name realistic data sources in complete sentences, and briefly describe how you would obtain or integrate each source.

- Finally, include one paragraph that explains risks and constraints such as privacy, regulation, data quality, sampling bias, and governance.

### Your answer
The objective is to predict whether a loan applicant will default on their loan â€” that is, fail to complete repayment after initiating it â€” in order to help the lending institution make better approval decisions and reduce financial losses.
Data Types to Collect
Personal financial profile. This includes the applicant's income, employment status, and account balance at the time of application. Account balance is particularly relevant because a low or negative balance suggests the applicant may already be struggling financially before the loan is even granted.
Credit history. This covers whether the applicant has taken out loans in the past, how many, and whether they were repaid in full. A history of incomplete or late repayments is one of the strongest predictors of future default behavior.
Loan characteristics. This includes the amount requested, the loan duration, and the interest rate. Higher loan amounts relative to income increase the repayment burden and therefore the default risk.
Transaction and account activity. This refers to the pattern of deposits, withdrawals, and overdrafts in the applicant's bank account over time. Irregular or declining account activity can signal financial instability even when the current balance appears acceptable.
Data Sources
Internal bank records are the primary source and can be accessed directly from the institution's own database. They contain account balances, transaction history, existing loan records, and application details for current and past clients.


## ðŸŒŸ Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction

### Instructions
From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

### Guidance
- First, identify the features that you believe are most relevant, and write their names in a sentence.
Then, provide a justification in complete sentences that explains how each selected feature relates to the likelihood of default.

- If you decide to exclude common features, write one sentence for each excluded feature to explain why it is not appropriate in this context.

- Conclude with two complete sentences that explain how you would encode categorical features and how you would impute missing values.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It provides a simple template you can modify while writing your justification.

import pandas as pd

# This placeholder DataFrame allows the cell to run even if you did not load a dataset yet.
example_columns = [
    "age","employment_length","annual_income","credit_score","loan_amount","interest_rate",
    "debt_to_income","num_delinquencies","num_open_accounts","total_utilization","home_ownership",
    "purpose","term","application_type","state","zip_code"
]
df = pd.DataFrame(columns=example_columns)

# Please replace this list with the actual columns that you select.
selected_features = [
    # e.g., "credit_score","debt_to_income","annual_income","loan_amount","interest_rate",
    # "employment_length","num_delinquencies","total_utilization"
]

print("You will now justify the selected features in complete sentences below.")

### Your justification
> Credit_History is arguably the single most important feature. It tells you whether the applicant has successfully repaid debts in the past. Someone with no credit history or a bad one has already demonstrated they struggle to honor financial commitments, which directly predicts the likelihood of defaulting again.
LoanAmount matters because the larger the sum borrowed, the heavier the monthly repayment burden. A person borrowing more than their financial situation can handle is more likely to default, especially if their income is limited.
Loan_Amount_Term (the duration of the loan in months) affects how much the applicant pays each month. A short term on a large loan means high monthly payments and therefore higher default risk, while a very long term exposes the lender to risk over a longer period of uncertainty.
Self_Employed is relevant because self-employed applicants typically have irregular, unpredictable income compared to salaried employees. This income instability makes it harder to guarantee consistent monthly repayments, increasing default risk.
Loan_Status is not a feature â€” it is your target variable, the thing you are trying to predict. You should not include it as an input to your model. Its role is to be the label your model learns from.

## ðŸŒŸ Exercise 3 : Training, Evaluating, and Optimizing the Model

### Instructions
Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the modelâ€™s performance, mentioning specific metrics that would be relevant to evaluate the model.

### Guidance
- Begin by naming one or two candidate models in a complete sentence and explain why each model is suitable for this problem.

- Next, describe an evaluation plan in complete sentences that covers the data split, the cross-validation strategy, the metrics you will report, and how you will choose a decision threshold.

- Then, explain in complete sentences how you will address class imbalance using stratification, class weights, or resampling.

- Finally, state in one or two complete sentences how you would iterate on hyperparameters to improve performance while avoiding data leakage.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# It demonstrates standard classification metrics for binary loan default prediction.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, classification_report

# Please replace these placeholders with your true labels and predicted probabilities.
y_true = [0,1,0,1,0,0,1,0,1,0]            # placeholder labels
y_pred_proba = [0.05,0.80,0.10,0.65,0.20,0.15,0.70,0.30,0.85,0.25]  # placeholder probabilities

# You should set a decision threshold that reflects the precisionâ€“recall trade-off for your business case.
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))
print("\nClassification report:\n", classification_report(y_true, y_pred, zero_division=0))

### Your answer
> Logistic Regression should always be your starting point. It is simple, fast, and most importantly interpretable â€” a bank needs to be able to explain why a loan was rejected to a regulator or a customer. Logistic regression gives you clear coefficients showing the direction and weight of each feature.

## ðŸŒŸ Exercise 4 : Designing Machine Learning Solutions for Specific Problems

### Instructions
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices
Organizing a Library of Books : group books into genres or categories based on similarities.
Program a robot to navigate and find the shortest path in a maze.

### Guidance
Please identify the appropriate machine learning paradigm for each scenario in complete sentences and justify your choice.

For each scenario, write one complete sentence that describes the input data, one complete sentence that describes the output, and one complete sentence that describes the learning signal or objective.

### Your answer
> Predicting Stock Prices
â†’ Supervised Learning (Regression)
This is a supervised learning problem because you have historical data where both the input (past prices, trading volume, economic indicators) and the output (the actual price that occurred) are known. The model learns the relationship between inputs and outputs from this labeled historical data, then uses that relationship to predict future values.
Organizing a Library of Books
â†’ Unsupervised Learning (Clustering)
Here you have no predefined labels. Nobody has already tagged each book with its correct genre â€” the goal is precisely to discover natural groupings based on similarities in the content, writing style, or vocabulary. This is the definition of unsupervised learning: finding structure in data without being told what the answer should be.  Robot Navigating a Maze
â†’ Reinforcement Learning
This scenario fits neither supervised nor unsupervised learning, because there is no dataset to learn from upfront. Instead, the robot must learn by trial and error through interaction with its environment. It tries a path, hits a wall, receives a negative reward, tries another direction, reaches the exit faster, receives a positive reward â€” and gradually learns which actions lead to the best outcomes.

## ðŸŒŸ Exercise 5 : Designing an Evaluation Strategy for Different ML Models

### Instructions
- Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. - For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
- For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
- For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
- Address the challenges and limitations of evaluating models in each category.

### Guidance
- Please write a separate paragraph for each of the three model categories.
- In the supervised paragraph, describe your validation plan and list the metrics you will report in complete sentences.
- In the unsupervised paragraph, explain how you would measure cluster quality or structure in complete sentences and mention any diagnostic plots.
- In the reinforcement learning paragraph, describe how you would track cumulative reward, assess convergence, and balance exploration and exploitation using complete sentences.
Conclude with one complete sentence per category that states a key evaluation challenge.

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Supervised classification metrics template with placeholders.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# Replace these placeholders with your real outputs.
y_true = [0,1,1,0,1,0,0,1,0,1]
y_pred_proba = [0.1,0.7,0.8,0.2,0.6,0.3,0.4,0.9,0.2,0.85]
threshold = 0.5
y_pred = [1 if p >= threshold else 0 for p in y_pred_proba]

print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
print("Precision:", round(precision_score(y_true, y_pred, zero_division=0), 4))
print("Recall:", round(recall_score(y_true, y_pred, zero_division=0), 4))
print("F1-score:", round(f1_score(y_true, y_pred, zero_division=0), 4))
print("ROC-AUC:", round(roc_auc_score(y_true, y_pred_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_true, y_pred_proba), 4))

In [None]:

# This piece of code is already prefilled, run it to execute it and see the results.
# Unsupervised clustering metrics template with synthetic data.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, n_init="auto", random_state=42)
labels = kmeans.fit_predict(X)
sil = silhouette_score(X, labels)
print("Silhouette score (higher is better):", round(sil, 4))

print("Please explain in complete sentences when you would use the elbow method and how you would interpret it.")

### Your answer
> Supervised Learning â€” Random Forest Classifier
Model Choice
Random Forest is a solid classification model for problems like loan default prediction or fraud detection, where you have labeled data and need reliable, interpretable results.
Evaluation Strategy
Data splitting is the first step. You divide your dataset into a training set (80%) and a test set (20%). The model learns exclusively on the training set and is evaluated on the test set, which it has never seen, to simulate real-world performance.
Cross-validation adds reliability. Instead of relying on a single split, k-fold cross-validation (typically k=5) divides the data into 5 subsets, trains on 4 and tests on 1, rotating each time. You average the results across all 5 runs, which gives a much more stable estimate of true performance than a single train/test split.
Metrics to use:

Accuracy gives the overall percentage of correct predictions but is misleading on imbalanced datasets, so it should never be used alone.
Precision measures how many of the predicted positives were actually positive â€” important when false alarms are costly.
Recall measures how many real positives the model caught â€” critical when missing a positive case is expensive (like missing a fraudster or a defaulter).
F1-Score balances precision and recall into one number, useful when both matter.
ROC-AUC plots the true positive rate against the false positive rate across all thresholds. A score above 0.85 generally indicates a strong model.

Challenges
The biggest challenge in supervised learning evaluation is class imbalance â€” if 95% of your data belongs to one class, a model predicting that class every time scores 95% accuracy while being completely useless. Another challenge is data leakage, where information from the test set accidentally influences training, making results artificially optimistic.

2. Unsupervised Learning â€” K-Means Clustering
Model Choice
K-Means is a natural fit for tasks like grouping customers by behavior, organizing documents by topic, or segmenting products â€” situations where no labels exist and you want to discover hidden structure.
Evaluation Strategy
Since there are no labels to compare against, you evaluate unsupervised models by measuring the internal quality of the clusters themselves.
The Elbow Method helps you choose the right number of clusters (k). You run K-Means for k=1, 2, 3, â€¦ n and plot the inertia (sum of distances from each point to its cluster center) against k. The curve bends sharply at the optimal k, forming an "elbow" â€” this is where adding more clusters stops meaningfully improving the groupings.
The Silhouette Score evaluates how well each data point fits its assigned cluster compared to neighboring clusters. The score ranges from -1 to +1, where a score close to +1 means the point is well matched to its cluster and far from others, a score near 0 means it sits on a boundary, and a negative score means it may have been assigned to the wrong cluster. You want an average silhouette score as high as possible.
Visual inspection using dimensionality reduction techniques like PCA or t-SNE allows you to plot clusters in 2D and visually confirm that they are well separated and meaningful.
Challenges
The core challenge is that there is no ground truth to validate against â€” you cannot calculate accuracy or F1-score. Evaluation is inherently subjective, and two different analysts might prefer different numbers of clusters. K-Means also assumes clusters are roughly spherical and equal in size, which is often not true in real data, making it sensitive to outliers and poorly shaped groups.

3. Reinforcement Learning â€” Q-Learning (Maze Navigation)
Model Choice
Q-Learning is a classic reinforcement learning algorithm well suited to navigation and decision-making tasks, such as training a robot to find the shortest path through a maze. The agent learns a table of values (Q-values) representing the expected future reward of each action in each state.
Evaluation Strategy
Cumulative reward is the primary metric. After each training episode (one complete attempt at the maze), you sum all the rewards and penalties the agent received. Over time, you expect this number to increase as the agent learns better strategies. A stable, high cumulative reward signals that the agent has learned an effective policy.
Convergence tells you whether the agent has finished learning. You monitor the Q-values across episodes â€” when they stop changing significantly between episodes, the model has converged and found a stable policy. Plotting the reward per episode over time should show an upward trend that eventually flattens.
Exploration vs. exploitation balance is controlled by the epsilon (Îµ) parameter. Early in training, epsilon is high, meaning the agent tries random actions to explore the environment. Over time, epsilon decreases so the agent relies more on what it has already learned. Monitoring how epsilon decays and how it affects reward progression is an important part of evaluating whether training is healthy.
Path quality is a task-specific metric: once training is complete, you measure the actual length of the path the agent takes and compare it to the known optimal path (shortest possible path). A well-trained agent should get very close to optimal.
Challenges
Reinforcement learning is particularly hard to evaluate because results are highly sensitive to hyperparameters like the learning rate, discount factor, and epsilon decay schedule. A poorly tuned agent may appear to converge but actually be stuck in a suboptimal policy. Another challenge is sample efficiency â€” the agent may need millions of episodes to learn a good policy, which is computationally expensive. Finally, performance in a simulated training environment does not always transfer to the real world, a problem known as the sim-to-real gap.