**exercise 1**
Exercise 1 : Defining the Problem and Data Collection for Loan Default Prediction
Instructions:

1. Write a clear problem statement for predicting loan defaults.
2. Identify and list the types of data you would need for this project (e.g., personal details of applicants, credit scores, loan amounts, repayment history).
3. Discuss the sources where you can collect this data (e.g., financial institution’s internal records, credit bureaus).
Expected Output: A document detailing the problem statement and a comprehensive plan for data collection, including data types and sources.

answer:

-1. The goal of this project is to predict whether a loan applicant will default on their loan based on their personal, financial, and historical data. Loan default occurs when a borrower fails to repay a loan according to the agreed terms. Early identification of high-risk applicants can help financial institutions minimize losses, improve credit policies, and offer more tailored lending options.

-2. types of data:
1. Personal and Demographic Data
	•	Age
	•	Gender
	•	Marital status
	•	Number of dependents
	•	Education level
	•	Employment status and job type
	•	Residential status (own/rent)
	•	Duration at current address

2. Financial and Credit Data
	•	Monthly income
	•	Monthly expenses
	•	Total debt
	•	Credit score (from agencies like Experian, Equifax, etc.)
	•	Number of open credit accounts
	•	Debt-to-income ratio
	•	Past bankruptcies or foreclosures

3. Loan-Specific Data
	•	Loan amount requested
	•	Loan type (personal, car, mortgage, etc.)
	•	Interest rate
	•	Loan term (duration)
	•	Repayment schedule (monthly, bi-weekly, etc.)
	•	Collateral value (if applicable)
	•	Purpose of the loan

4. Repayment History
	•	On-time vs. late payments
	•	Number of missed payments
	•	Past loan defaults
	•	Total amount repaid so far
	•	Prepayment behavior

5. Behavioral or Derived Features (optional but useful)
	•	Recent credit inquiries
	•	Changes in employment/income over time
	•	Historical trends in credit score
	•	Geolocation and regional default rates

3. Sources:

Internal analytics platforms

Third-party credit monitoring tools

--Conclusion:

Collecting and consolidating the above data from various sources will enable the development of a robust predictive model. The accuracy of the model largely depends on the quality, relevance, and completeness of the data. Special attention must also be given to data privacy and compliance with relevant regulations such as GDPR or local financial data laws.

In [55]:
problem_statement = """
Predicting Loan Defaults:
The goal is to develop a machine learning model that can accurately predict whether an individual will default on a loan. This model will help financial institutions in assessing the risk associated with lending.
"""

# Data Types and Sources
data_types = ['personal details of applicants', 'credit scores', 'loan amounts', 'repayment history']
data_sources = ['financial institution’s internal records', 'credit bureaus']

# Printing out the problem statement, data types, and sources
print(problem_statement)
print("Data Types:", data_types)
print("Data Sources:", data_sources)


Predicting Loan Defaults:
The goal is to develop a machine learning model that can accurately predict whether an individual will default on a loan. This model will help financial institutions in assessing the risk associated with lending.

Data Types: ['personal details of applicants', 'credit scores', 'loan amounts', 'repayment history']
Data Sources: ['financial institution’s internal records', 'credit bureaus']


Exercise 2 : Feature Selection and Model Choice for Loan Default Prediction
Instructions
From this dataset, identify which features might be most relevant for predicting loan defaults.
Justify your choice of features.

In [39]:
import pandas as pd
import numpy as np
df = pd.read_csv('train_u6lujuX_CVtuZ9i (1).csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


Answer to exercise 2: The most relevant features for predicting loan defaults are:
- Credit_History (most critical)
- ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term
- Married, Dependents, Self_Employed, Education, Property_Area

These features cover key aspects such as financial capability, socio-economic factors, and credit behavior, which are essential for accurate default prediction.


Exercise 3 : Training, Evaluating, and Optimizing the Model
Instructions

Which model(s) would you pick for a Loan Prediction ?
Outline the steps to evaluate the model’s performance, mentioning specific metrics that would be relevant to evaluate the model.

In [41]:
#Step 1: Split the Dataset

from sklearn.model_selection import train_test_split
X = df.drop(['Loan_Status', 'Loan_ID'], axis=1)
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # Added random_state for reproducibility


Since Loan Prediction is a question of Yes or No, which is a binary classification problem, classification models will be used.

After checking that there are only 614 entries, and from exercise 2 I chose few different features, I decided to use the SVM(Support Vector Machine) since I am not sure whether the correlation is linear. Also the data tyeps are mixed with both numerical features and categorical features(such as Education and Property_area)

Steps to evaluat the model's performance:
1. As above in step 1, the Dataset is split.
2. After deciding on using SVM model, Step 2 is to train the Model. But some data need to be handled before step 3. which includes:
	•	Impute missing values (e.g., with SimpleImputer)
	•	Encode categorical variables (e.g., LabelEncoder or OneHotEncoder)
	•	Scale numeric values (e.g., StandardScaler), since SVM is sensitive to feature scale.
3. step 3: Make Predictions
4. Evaluate the Model with Metrics
5. Use Cross-Validation for Robust Evaluation


In [42]:
#Step 2: Preprocessing
from sklearn.impute import SimpleImputer
print("Missing values per column:\n", df.isnull().sum())


Missing values per column:
 Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


In [43]:
# Define categorical and numerical columns with missing values
cat_cols_with_nan = ['Married', 'Dependents', 'Self_Employed']
num_cols_with_nan = ['LoanAmount', 'Loan_Amount_Term', 'Credit_History']

# Imputer for categorical columns: most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')
df[cat_cols_with_nan] = cat_imputer.fit_transform(df[cat_cols_with_nan])

# Imputer for numerical columns: mean
num_imputer = SimpleImputer(strategy='mean')
df[num_cols_with_nan] = num_imputer.fit_transform(df[num_cols_with_nan])

# Confirm no missing values remain
print("\nMissing values after imputation:\n", df.isnull().sum())


Missing values after imputation:
 Loan_ID               0
Gender               13
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64


In [48]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.svm import SVC # Import SVC

# Define categorical and numerical features for preprocessing
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=np.number).columns

# Create preprocessing pipelines for numerical and categorical features
# Numerical features: just pass through (or scale if needed, but not the cause of this error)
# Categorical features: OneHotEncode
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), numerical_features), # Impute numerical NaNs
        ('cat', Pipeline(steps=[ # Create a sub-pipeline for categorical features
            ('imputer', SimpleImputer(strategy='most_frequent')), # Impute categorical NaNs
            ('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode
        ]), categorical_features)
    ],
    remainder='passthrough' # Keep any other columns (if any)
)

# Create the full pipeline including preprocessing and the SVC model
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('classifier', SVC(kernel='rbf'))])


In [49]:
#Step 3: train the model
model_pipeline.fit(X_train, y_train)

In [51]:
# Step 4: Make Predictions

y_pred = model_pipeline.predict(X_test)

In [52]:
# Step 5: Evaluate the Model with Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label='Y'))
print("Recall:", recall_score(y_test, y_pred, pos_label='Y'))
print("F1 Score:", f1_score(y_test, y_pred, pos_label='Y'))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.645320197044335
Precision: 0.645320197044335
Recall: 1.0
F1 Score: 0.7844311377245509
Confusion Matrix:
 [[  0  72]
 [  0 131]]


In [54]:
# Step 6: Use Cross-Validation for Robust Evaluation (using the pipeline)
# Note: X and y here are the original full dataset after imputation but before train/test split
# The cross_val_score will handle the splitting and preprocessing internally
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_pipeline, X, y, cv=5, scoring='f1_macro')
print("Average F1 Score:", scores.mean())

Average F1 Score: 0.407333777099501


In [56]:
# Hypothetical Dataset Columns
dataset_columns = ['age', 'income', 'loan amount', 'repayment history', 'credit score']

# Feature Selection
selected_features = ['income', 'loan amount', 'repayment history', 'credit score'] # Justify why you chose these features

# Justification for Feature Selection
feature_justification = """
Selected Features Justification:
- Income: Higher income might correlate with a lower risk of default.
- Loan Amount: Larger loan amounts could be riskier and more likely to lead to defaults.
- Repayment History: Past repayment behavior is often a good predictor of future behavior.
- Credit Score: Indicates the borrower's creditworthiness and history of managing debts.
"""

print(feature_justification)


Selected Features Justification:
- Income: Higher income might correlate with a lower risk of default.
- Loan Amount: Larger loan amounts could be riskier and more likely to lead to defaults.
- Repayment History: Past repayment behavior is often a good predictor of future behavior.
- Credit Score: Indicates the borrower's creditworthiness and history of managing debts.



Exercise 4 : Designing Machine Learning Solutions for Specific Problems
Instructions

For each of these scenario, decide which type of machine learning would be most suitable. Explain.

Predicting Stock Prices : predict future prices

**Answer**: Supervised Learning because there will be historical stock data(features like date, previous prices, volume, etc.) and the known stock prices which are labels. You can train the model on past data to predict future values(regression task). Common Algorithms: Linear Regression, LSTM (for time series), Random Forest Regressor.

Organizing a Library of Books : group books into genres or categories based on similarities.
**Answer**: Unsupervised Learning because there are no predefined labels(genres), but patterns can be discovered(for example, according to similar content, or styles). The model will group or cluster the books according to similarity. Common Algorithms: K-Means, Hierarchical Clustering, PCA.

Program a robot to navigate and find the shortest path in a maze.
**Answer**: Reinforcement Learning because the robot interacts with the environment (maze), learns from trial and error, and receives rewards or penalties based on its actions. The goal is to learn the optimal policy which is the best path. Common Algorithms: Q-Learning, Deep Q-Networks (DQN), SARSA

In [57]:
# Scenario Solutions
scenarios = {
    "Predicting Stock Prices": "Supervised Learning - Regression Model",
    "Organizing a Library of Books": "Unsupervised Learning - Clustering Model",
    "Program a Robot to Navigate a Maze": "Reinforcement Learning"
}

# Print out the solutions for each scenario
for scenario, solution in scenarios.items():
    print(f"{scenario}: {solution}")

Predicting Stock Prices: Supervised Learning - Regression Model
Organizing a Library of Books: Unsupervised Learning - Clustering Model
Program a Robot to Navigate a Maze: Reinforcement Learning


Exercise 5 : Designing an Evaluation Strategy for Different ML Models

Instructions

Select three types of machine learning models: one from supervised learning (e.g., a classification model), one from unsupervised learning (e.g., a clustering model), and one from reinforcement learning. For the supervised model, outline a strategy to evaluate its performance, including the choice of metrics (like accuracy, precision, recall, F1-score) and methods (like cross-validation, ROC curves).
For the unsupervised model, describe how you would assess the effectiveness of the model, considering techniques like silhouette score, elbow method, or cluster validation metrics.
For the reinforcement learning model, discuss how you would measure its success, considering aspects like cumulative reward, convergence, and exploration vs. exploitation balance.
Address the challenges and limitations of evaluating models in each category.

In [58]:
# Supervised Learning Evaluation Strategy
supervised_evaluation = """
- Use cross-validation to assess the model's performance.
- Evaluate using metrics like accuracy, precision, recall, F1-score.
- Analyze the ROC curves to understand the trade-off between the true positive rate and false positive rate.
"""

# Unsupervised Learning Evaluation Strategy
unsupervised_evaluation = """
- Assess the effectiveness using silhouette score or the elbow method.
- Validate clusters by examining their interpretability and relevance to the problem statement.
"""

# Reinforcement Learning Evaluation Strategy
reinforcement_evaluation = """
- Measure success by the cumulative reward achieved by the model.
- Evaluate the balance between exploration and exploitation.
- Monitor the convergence of the model to ensure learning stability.
"""

print("Supervised Learning Evaluation Strategy:", supervised_evaluation)
print("Unsupervised Learning Evaluation Strategy:", unsupervised_evaluation)
print("Reinforcement Learning Evaluation Strategy:", reinforcement_evaluation)

Supervised Learning Evaluation Strategy: 
- Use cross-validation to assess the model's performance.
- Evaluate using metrics like accuracy, precision, recall, F1-score.
- Analyze the ROC curves to understand the trade-off between the true positive rate and false positive rate.

Unsupervised Learning Evaluation Strategy: 
- Assess the effectiveness using silhouette score or the elbow method.
- Validate clusters by examining their interpretability and relevance to the problem statement.

Reinforcement Learning Evaluation Strategy: 
- Measure success by the cumulative reward achieved by the model.
- Evaluate the balance between exploration and exploitation.
- Monitor the convergence of the model to ensure learning stability.



**ANSWER TO EX 5**

Supervised Model: Random Forest Classifier
Model Overview:

Random Forest is an ensemble learning method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It is widely used for both classification and regression tasks due to its robustness and ability to handle missing data and outliers.

Evaluation Strategy

📌 A. Train-Test Split or Cross-Validation
	•	Train/Test Split: Divide dataset into training (e.g., 80%) and test set (20%).
	•	K-Fold Cross-Validation: Often used (e.g., StratifiedKFold) to reduce variance in evaluation.

📌 B. Metrics to Use
To evaluate a Random Forest model, use accuracy when classes are balanced. Precision is key when false positives are costly, while recall is crucial when missing positives is risky (e.g., loan default). F1-score balances precision and recall, ideal for imbalanced data. ROC-AUC shows how well the model distinguishes classes. A confusion matrix gives detailed insight into true/false positives and negatives.

📌 C. Model Interpretation Tools
	•	Feature Importance: Random Forest provides a ranking of feature importance.
	•	Permutation Importance: Measures effect on model performance when feature values are shuffled.
	•	Partial Dependence Plots: Visualize the effect of a feature on predictions.


2. Unsupervised Learning Evaluation Strategy

🔹 Model Example: K-Means Clustering for Customer Segmentation

🔹 Evaluation Techniques:
	•	Silhouette Score: Measures how similar an object is to its own cluster vs. others (range: -1 to 1).
	•	Elbow Method: Helps decide the optimal number of clusters by plotting within-cluster SSE vs. number of clusters.
	•	Davies-Bouldin Index, Calinski-Harabasz Index: Internal metrics to assess clustering quality.
	•	Cluster Visualization: (e.g., PCA plots) can give intuition, especially in 2D/3D.

🔹 Challenges:
	•	No ground truth → hard to “prove” clustering quality.
	•	Interpretation of clusters can be subjective.
	•	Sensitive to initialization and scale of data.

⸻

🟢 3. Reinforcement Learning Evaluation Strategy

🔹 Model Example: Q-Learning for Robot Maze Navigation

🔹 Evaluation Criteria:
	•	Cumulative Reward: Total reward gained over an episode; higher is better.
	•	Convergence of Policy: Whether the learning stabilizes and no longer changes significantly.
	•	Episode Length: Shorter paths or fewer steps to goal indicate learning success.
	•	Exploration vs. Exploitation Balance: Ensure model doesn’t get stuck exploiting sub-optimal paths.

🔹 Challenges:
	•	Long training times.
	•	Delayed rewards make evaluation difficult.
	•	Reward function design critically affects behavior.
	•	Randomness in environment affects repeatability.
