# Credit Score / Loan Approval Modelling (Portfolio Version)

This notebook trains and evaluates machine learning models to support **credit risk / loan approval decisions**.

**How to run**
1. Clone the repository
2. Make sure the dataset file is located at: `data/loan_approval_data.csv`
3. Run cells from top to bottom

> Note: This is a cleaned version for GitHub (outputs and embedded images removed to keep the repo lightweight).


In [None]:
#Importing all the librarieswe are going to use in the model
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#Loading data into a DataFrame and see the first rows of each column to explore the dataset
DATA_PATH = 'data/loan_approval_data.csv'
try:
    data_frame = pd.read_csv(DATA_PATH)
except FileNotFoundError as e:
    raise FileNotFoundError(
        f"Could not find {DATA_PATH}. Place the dataset in the repo under /data/ or update DATA_PATH."
    ) from e
data_frame.head()
#First we have to observe our data set

In [None]:
#We need the real name of the columns to be sure we are going to drop the correct ID column
data_frame.columns


In [None]:
#We need to see how many variables and rows we have, so we can have a perspective if a missing value is going to affect the model or not.
data_frame.shape

In [None]:
#In case the ID was attached to the index we reset the index
data_frame.index.name
data_frame.reset_index(drop=True, inplace=True)

In [None]:
#We drop the id and print the head of the table to see if the job was done
data_frame.drop('id', axis=1, inplace=True)
data_frame.head()

In [None]:
#We print agrain the names of the columns to be sure that ID is gone
print(list(data_frame.columns))

Exploratory Data Analysis (EDA)

Here we are going to explain each variable problem one by one. And how we are going to approach, also we can see some graphics to visualize the problem in the data. The images will have a number so the user can undrestand better what happend to the data set and how we resolve the problem.


(Image removed for GitHub version.)

In [None]:
data_frame.dtypes

In [None]:
#This is a summary stadistic for all the variables.
#AGE = In age we have some ages in text instead of number like twenty instead of 20
#SEX = In Sex we only have 221 registers of 58,645 so is not going to help us with a prediction. We have to drop it.
#EDUCATION = We can observe here that "Unknown" is the most common in the Education qualifications variable. We have to drop it.
#INCOME (IMAGE EDA 1)= Income is to spread so we have to make categories and fit each register depending on certain range that we are going to define.
#OWNERSHIP =  In home ownershiot we don´t have any issue
#EMPLOYEMENT = Employment lenght, here we have some outliers as we can observe in the max that there are 150 year of employement. This is not possible, probably is an error of typo. There are 3 values above 100 year so we are going to drop it too.
#LOAN INTEND = Loan Intent we are find here we have 6 categories
#LOAN AMOUNT (IMAGE EDA 2) = Loan amount have a left-skewed data we can observe this in Image EDA 1.1
#LOAN INTEREST RATE (IMAGE EDA 3) = Here we have 3 outliers, values beow zero and values above 100 there are just 3 so we can drop them because the are outliers that can affect the rate prediction but they are just 3 so it´s not going to affect the data set
#LOAN INCOME RATIO = This one is OK
#PAYMENT DEFAULT = The instances here are not standarized, we have 4 diferente type of answears and should be just Yes or NO. We are going to implement an standar.
#CREDIT HISTORY = This one is Ok.
#LOAN APPROVAL = We have to drop it because this can affect the classification problem in the model.
#MAX LOAN = This variable is the target output of the regression model in Part B. If we included in the model we are going to have a 100% accuracy. We have to drop it
#APPLICATION ACCEPTANCE = This variable is the target output of the classification model in Part A. If we included in the model we are going to have a 100% accuracy. We have to drop it

data_frame.describe(include='all')

In [None]:
#Here we can observe the big range in the income distribution so is going to be better to create categories for the model. With 200 bins we can see the wide distribution of them.
plt.figure(figsize=(8,4))
sns.histplot(data_frame['income'], bins=200)
plt.xlim(0, 150000)   # opcional para ver mejor
plt.title('Income Distribution (IMAGE EDA 1)')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Loan amount
plt.figure(figsize=(8,4))
sns.histplot(data_frame['loan_amount'], bins=5)
plt.title('Loan Amount (IMAGE EDA 2)')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Loan Loan Interest Rate
plt.figure(figsize=(8,4))
sns.histplot(data_frame['loan_interest_rate'], bins=20)
plt.xlim(0,30)
plt.title('Loan Interes Rate(IMAGE EDA 3)')
plt.xlabel('Loan Interest Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
#The target variable is the application acceptance in the dataset
sns.countplot(x='Credit_Application_Acceptance', data=data_frame)
plt.title('Distribution of Target Variable: Credit Application (1 = Declined, 0 = Approved)')
plt.xlabel('Application Status')
plt.ylabel('Count')
plt.show()



In [None]:
data_frame['Credit_Application_Acceptance'].value_counts(normalize=True) * 100

In [None]:
#Here we can see how many missing values are in each variable, also I replaced the Unknow value from education because I observe this was the top value in it so this give us the NA numbers
data_frame.replace("Unknown", np.nan, inplace=True)
data_frame.isnull().sum()

In [None]:
#Payment default
data_frame['payment_default_on_file'].unique()

In [None]:
#standardization of the variables

#There are some value that have the age in text instead of numbers.
data_frame['age'] = data_frame['age'].replace({'Twenty Seven': 27, 'Thirty': 30, 'Twenty Three':23})

#In the payment default on file there are also some inconsistencies.
data_frame['payment_default_on_file'] = data_frame['payment_default_on_file'].replace({'N': 'NO', 'Y': 'YES'})

In [None]:
#Here we are going to create 6 categories so we can have a better classification system.
data_frame.loc[data_frame['income'] < 20000, 'income_category'] = 'Very Low'

data_frame.loc[(data_frame['income'] >= 20000) & (data_frame['income'] < 40000), 'income_category'] = 'Low'

data_frame.loc[(data_frame['income'] >= 40000) & (data_frame['income'] < 60000), 'income_category'] = 'Medium'

data_frame.loc[(data_frame['income'] >= 60000) & (data_frame['income'] < 80000), 'income_category'] = 'High'

data_frame.loc[(data_frame['income'] >= 80000) & (data_frame['income'] < 100000), 'income_category'] = 'Very High'

data_frame.loc[data_frame['income'] >= 100000, 'income_category'] = 'Extremely High'

In [None]:
#PAYMENT DEFAULT
data_frame['loan_approval_status'].unique()

In [None]:
data_frame['loan_approval_status'].describe()

In [None]:
data_frame['payment_default_on_file'].describe()

In [None]:
labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']
plt.figure(figsize=(8,4))
sns.countplot(x='income_category', data=data_frame, order=labels)
plt.title('Income Categories Distribution')
plt.xlabel('Income Category')
plt.ylabel('Count')
plt.show()

In [None]:
data_frame['max_allowed_loan'].describe()

In [None]:
#Values below 0
(data_frame['max_allowed_loan'] < 0).sum()

In [None]:
data_frame[data_frame['max_allowed_loan'] < 0]['max_allowed_loan']

In [None]:
data_frame[data_frame['max_allowed_loan'] < 0]

In [None]:
cat_cols = [col for col in data_frame.columns if data_frame[col].dtype == 'object']
cat_cols


In [None]:
data_frame['age'] = pd.to_numeric(data_frame['age'], errors='coerce')


In [None]:
cat_cols = [col for col in data_frame.columns if data_frame[col].dtype == 'object']
cat_cols

In [None]:
#Droping variables
data_frame.drop('Sex', axis=1, inplace=True)
data_frame.drop('Education_Qualifications', axis=1, inplace=True)
data_frame.drop('max_allowed_loan', axis=1, inplace=True)
data_frame.drop('loan_approval_status', axis=1, inplace=True)
#data_frame.drop('Credit_Application_Acceptance', axis=1, inplace=True)
data_frame.head()

In [None]:
# Convertir age a numérico (quita puntos y convierte errores a NaN)
data_frame['age'] = pd.to_numeric(data_frame['age'], errors='coerce')


In [None]:
# Drop age below 0 or above 100 (age < 0 or age > 100)
data_frame = data_frame[(data_frame['age'] >= 0) & (data_frame['age'] <= 100)]

# Drop employment_length above 50 years
data_frame = data_frame[data_frame['emplyment_length'] <= 50]

# Drop loan_interest_rate below 0
data_frame = data_frame[data_frame['loan_interest_rate'] >= 0]

In [None]:
#We need to detect all the categorical variables and convert them to numberical data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cat_cols = ['home_ownership', 'loan_intent', 'payment_default_on_file', 'income_category']
for col in cat_cols:
    data_frame[col] = le.fit_transform(data_frame[col])


THIS IS THE CLEAN DATA SET

(Image removed for GitHub version.)

In [None]:
#This is the new stats of the data set
data_frame.describe(include='all').transpose()

In [None]:
data_frame.columns

In [None]:
#BEfore we had (58645, 16) and now we have 58,621 and 11 variables
data_frame.shape

Modelling: Create Predictive Classification Models


(Image removed for GitHub version.)

In [None]:
# Select all valid input features (numerical + categorical)
X = data_frame[['age','income','home_ownership','emplyment_length','loan_intent', 'loan_amount','loan_interest_rate','loan_income_ratio','payment_default_on_file','credit_history_length','income_category']]

# Output variable
y = data_frame['Credit_Application_Acceptance']

# Display feature names
print("Feature Names Used in the Classification Models:")
print(list(X.columns))

# Display data shapes
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


In [None]:
# We removed a very small number of missing values, which has no impact on the dataset distribution and ensures that the train-test split works without errors
#Remove NaNs from X
X = X.dropna()

# Align y with cleaned X
y = y.loc[X.index]

# Also drop NaNs in y (just 1)
y = y.dropna()
X = X.loc[y.index]   # Align again

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)


In [None]:
#A lender has agreed to offer a loan (Yes: 0 and No:1)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

print("Confusion Matrix: Naive Bayes")
ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_nb)).plot()



In [None]:
# Logistic Regression model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)


In [None]:
#A lender has agreed to offer a loan (Yes: 0 and No:1)
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

cm_lr = confusion_matrix(y_test, y_pred_lr)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_lr)
disp.plot(cmap='viridis')


In [None]:
#Metrics por LR
#A lender has agreed to offer a loan (Yes: 0 and No:1)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_lr))


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,   # number of trees
    random_state=42,
    max_depth=None     # Can be adjusted
)

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)


In [None]:
#A lender has agreed to offer a loan (Yes: 0 and No:1)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm_rf = confusion_matrix(y_test, y_pred_rf)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf)
disp_rf.plot(cmap='viridis')


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def get_scores(y_true, y_pred, model_name):
    print(f"---- {model_name} ----")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Recall:", recall_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("F1-Score:", f1_score(y_true, y_pred))

    # For ROC AUC we need probabilities for the positive class
    try:
        y_prob = model.predict_proba(X_test)[:,1]
        print("AUC-ROC:", roc_auc_score(y_true, y_prob))
    except:
        print("AUC-ROC: Not available (model has no predict_proba)")
    print("\n")

# NB
get_scores(y_test, y_pred_nb, "Naive Bayes")

# LR
get_scores(y_test, y_pred_lr, "Logistic Regression")

# RF
get_scores(y_test, y_pred_rf, "Random Forest")


In [None]:
#The AUC don´t appear above because this calculation requires continuos values and we were doing estimated probabilities. After this adjustment the AUC can show us the probability to belong to certain class
from sklearn.metrics import roc_auc_score

def get_scores_with_auc(model, X_test, y_test, model_name):
    print(f"{model_name}")

    # Predict classes
    y_pred = model.predict(X_test)

    # Metrics with class predictions
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("F1-Score:", f1_score(y_test, y_pred))

    # Calculate AUC if possible
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_proba)
        print("AUC-ROC:", auc)
    else:
        print("AUC-ROC: Not available (model has no predict_proba)")


In [None]:
get_scores_with_auc(nb, X_test, y_test, "Naive Bayes")
get_scores_with_auc(lr, X_test, y_test, "Logistic Regression")
get_scores_with_auc(rf, X_test, y_test, "Random Forest")


In [None]:
#Evidence of overfitting or unfitting
print("Train Accuracy:", rf.score(X_train, y_train))
print("Test Accuracy:", rf.score(X_test, y_test))


In [None]:
#Question E
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define hyperparameters
param_grid = {'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'max_features': ['sqrt', 'log2']}

rf = RandomForestClassifier(random_state=42)

# Define GridSearchCV (i)
grid = GridSearchCV(estimator=rf,
    param_grid=param_grid,
    cv=3,scoring='f1',n_jobs=-1)

# Training
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)


In [None]:
best_rf = grid.best_estimator_
best_rf.fit(X_train, y_train)


In [None]:
y_pred_best = best_rf.predict(X_test)

get_scores(y_test, y_pred_best, "Random Forest Optimizado")


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Generate confusion matrix
cm_best_rf = confusion_matrix(y_test, y_pred_best)

# Display confusion matrix
print("Confusion Matrix: Random Forest Optimized")
disp = ConfusionMatrixDisplay(confusion_matrix=cm_best_rf)
disp.plot()


In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def print_metrics(title, y_true, y_pred):
    print(f"---- {title} ----")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Recall:", recall_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("F1-Score:", f1_score(y_true, y_pred))
    print("\n")

# BEFORE TUNING (Random Forest original)
print_metrics("Random Forest (Before Tuning)", y_test, y_pred_rf)

# AFTER TUNING (GridSearchCV best estimator)
print_metrics("Random Forest (After Tuning)", y_test, y_pred_best)

Maximum Loan Amount Prediction PART (B)

(Image removed for GitHub version.)

In [None]:
#Task 1:Regression

# Show dataset dimensions (subset retained for regression)
print("Shape of the retained dataset:", data_frame.shape)

# Select the features retain for regression
retained_features = ['age', 'income', 'employment_length', 'loan_interest_rate']

# Show the selected features
print("Retained features for the regression model:")
print(retained_features)

# Target variable
target = 'loan_amount'
print("Target variable:", target)


In [None]:
#Task 2: Data Understanding: Producing Your Experimental Design

import matplotlib.pyplot as plt

retained_features = ['age', 'income', 'emplyment_length', 'loan_interest_rate']
target = 'loan_amount'

# Plot each retained feature
for col in retained_features + [target]:
    plt.figure(figsize=(5,4))
    plt.hist(data_frame[col], bins=30, color='skyblue', edgecolor='black')
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.show()


In [None]:
#Task 3 — Data Preprocessing: Transforming your data
# Look at summary statistics
data_frame[retained_features + [target]].describe()

In [None]:
#Task 4: Split 80/20
from sklearn.model_selection import train_test_split

# X_model1 = only numerical retained features
X_model1 = data_frame[['age', 'income', 'emplyment_length', 'loan_interest_rate']]

# Target variable
y = data_frame['loan_amount']

# Train-test split
X1_train, X1_test, y1_train, y1_test = train_test_split(X_model1, y, test_size=0.2, random_state=42)


In [None]:
#Step 2 (Task 4)
# Select only numeric features for Model 1 (DT1)
numeric_features = [
    'age',
    'income',
    'emplyment_length',
    'loan_interest_rate',
    'loan_income_ratio']


In [None]:
all_features = [col for col in df_encoded.columns if col != 'loan_amount']


In [None]:
#Step 2 (Task 4)
# All retained features for Model 2 (DT2)
categorical_features = [
    'home_ownership',
    'loan_intent',
    'payment_default_on_file',
    'credit_history_length',
    'income_category'
]

df_encoded = data_frame.copy()
le = LabelEncoder()

for col in categorical_features:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


In [None]:
# Model 1 (Task 4)
print("DT1 feature count:", len(numeric_features))

X1 = data_frame[numeric_features]
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, test_size=0.2, random_state=42)


print("DT1 train shape:", X1_train.shape)
print("DT1 test shape:", X1_test.shape)


In [None]:
# Model 2 (DT2) - (Task 4)
print("DT2 feature count:", len(all_features))

X2 = data_frame[all_features]
X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y, test_size=0.2, random_state=42
)

print("DT2 train shape:", X2_train.shape)
print("DT2 test shape:", X2_test.shape)


Task (5) – Evaluation: How good are your models

In [None]:
from sklearn.preprocessing import LabelEncoder

df_encoded = data_frame.copy()
le = LabelEncoder()

for col in categorical_features:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


In [None]:
from sklearn.tree import DecisionTreeRegressor

# Create DT1 (numeric-only model)
dt1 = DecisionTreeRegressor(random_state=42)
dt1.fit(X1_train, y1_train)

print("DT1 model trained successfully.")

In [None]:
regression_scores(dt1, X1_test, y1_test, "DT1")


In [None]:
# Create DT2
from sklearn.tree import DecisionTreeRegressor
dt2 = DecisionTreeRegressor(random_state=42)
dt2.fit(X2_train, y2_train)

print("DT2 model trained successfully.")

In [None]:
regression_scores(dt2, X2_test, y2_test, "DT2")

In [None]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

# Re-train best model (DT1) with pre-pruning (max depth = 4)
dt1_pruned = DecisionTreeRegressor(max_depth=4, random_state=42)
dt1_pruned.fit(X1_train, y1_train)

# Predictions
y_pred_pruned = dt1_pruned.predict(X1_test)

# Evaluate pruned model
print("DT1 Pruned (max_depth=4)")
regression_scores(dt1_pruned, X1_test, y1_test, "DT1 Pruned")

# Plot the pruned tree
plt.figure(figsize=(20, 10))
plot_tree(dt1_pruned, feature_names=X1_train.columns, filled=True, rounded=True)
plt.show()


In [None]:
print("Features used in DT1:")
print(X1_train.columns.tolist())
print("Number of features:", len(X1_train.columns))


In [None]:
# Client 60256 values
client_DT1 = {
    "age": 56,
    "income": 57000,
    "emplyment_length": 15,
    "loan_interest_rate": 23,
    "loan_income_ratio": 10
}

client_df = pd.DataFrame([client_DT1])[X1_train.columns]

pred_60256 = dt1_pruned.predict(client_df)[0]
pred_60256
