# Loan Approval Prediction - End-to-End Workflow

**üéØ Goal**: We will apply a Supervised Learning workflow on the Loan Approval dataset to predict whether a loan application will be approved or not.

**üóíÔ∏è Scenario**

The Loan Approval dataset contains information about loan applicants, including their personal details, financial information, and loan characteristics. Our task is to build a model that can predict loan approval status.

**‚ö° Task**

## 1. Imports and Data Loading

First, we import the necessary libraries and load the dataset from the provided URL. We then inspect the dataframe to understand its structure and check for missing values, which determines our preprocessing strategy.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the dataset from the provided URL
loan_data = pd.read_csv('https://raw.githubusercontent.com/prasertcbs/basic-dataset/refs/heads/master/Loan-Approval-Prediction.csv')

# Display the shape of the dataset
loan_data.shape

(614, 13)

In [3]:
# Display the first 5 rows to inspect data types and example values
loan_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
# Check the distribution of the target variable to see if classes are balanced
loan_data['Loan_Status'].value_counts()

Loan_Status
Y    422
N    192
Name: count, dtype: int64

In [5]:
# Identify columns with missing values to decide on imputation strategies
loan_data.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

## 2. Feature Selection and Splitting

We separate the data into the feature matrix (X) and the target vector (y) as specified in the assignment. We will identify numerical and categorical features to build appropriate preprocessing pipelines.

In [None]:
# Split features and target as specified in assignment.md
X = loan_data.drop('Loan_Status', axis=1)
y = loan_data['Loan_Status']

# Display feature names and data types
print("Feature columns:")
print(X.columns.tolist())
print("\nData types:")
print(X.dtypes)

# Encode target variable from 'Y'/'N' to 1/0 for sklearn compatibility
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

print(f"\nTarget variable encoding: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")
print(f"Target distribution after encoding:\n{pd.Series(y).value_counts()}")

Feature columns:
['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area']

Data types:
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
dtype: object


In [None]:
# Identify numerical and categorical features
# Numerical features are typically int64 or float64
# Categorical features are typically object type
# Note: Exclude Loan_ID as it's just an identifier and not useful for prediction

numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Remove Loan_ID from categorical features if present
if 'Loan_ID' in categorical_features:
    categorical_features.remove('Loan_ID')

print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)

Numerical features: ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
Categorical features: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']


## 3. Building the Pre-Processing Pipeline

This is the core of the workflow. We use Pipeline to chain sequential steps (like imputation and scaling) and ColumnTransformer to apply these different pipelines to specific columns (numerical vs. categorical) simultaneously.

- **Numerical Data**: We fill missing values with the median and scale data to unit variance using StandardScaler.

- **Categorical Data**: We fill missing values with the most frequent value and convert text categories into binary vectors (One-Hot Encoding).

`sklearn`'s pipeline is a tool that allows us to assemble several steps together. It sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be 'transforms', that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

In [9]:
# 1. Define pipeline for numerical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Fill missing values with median
    ('scaler', StandardScaler())                   # Standardize features (mean=0, variance=1)
])

# 2. Define pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing with mode
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    # Convert categories to binary vectors
])

# 3. Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        # Apply numerical pipeline to numerical columns
        ('num', numerical_transformer, numerical_features),
        # Apply categorical pipeline to categorical columns
        ('cat', categorical_transformer, categorical_features)
    ])

# 4. Create the full end-to-end pipeline including the model
# This ensures raw data flows through preprocessing directly into the model
model = LogisticRegression()
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)
                          ])

## 4. Training and Evaluation (Logistic Regression)

Finally, we split the data into training and testing sets. We fit the entire pipeline on the training data and evaluate its performance on the unseen test data using various classification metrics.

In [10]:
# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the pipeline
pipeline.fit(X_train, y_train)

# Generate predictions
preds = pipeline.predict(X_test)

# Calculate classification metrics
accuracy = accuracy_score(y_test, preds)
precision = precision_score(y_test, preds)
recall = recall_score(y_test, preds)
f1 = f1_score(y_test, preds)

# Calculate ROC curve and AUC score
# Note: We use predict_proba for ROC/AUC to get probability scores instead of class labels
fpr, tpr, thresholds = roc_curve(y_test, pipeline.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)

# Output results
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print(f"AUC: {roc_auc:.2f}")

ValueError: pos_label=1 is not a valid label. It should be one of ['N', 'Y']

## 5. Alternative Model (KNN with MinMaxScaler)

Now, we recreate the workflow but use min-max scaling for numerical features and KNN classifier for the model. This allows us to compare different preprocessing and modeling approaches.

In [None]:
# 1. Redefine numerical pipeline with Min-Max Scaling
# Min-Max scaling scales data to a fixed range [0, 1], which preserves the shape of the original distribution
numerical_transformer_knn = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Fill missing values
    ('scaler', MinMaxScaler())                     # Scale to range [0, 1]
])

# 2. Update the ColumnTransformer
# We reuse the 'categorical_transformer' defined in the previous section (Imputer + OneHotEncoder)
preprocessor_knn = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer_knn, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 3. Define the new model: K-Nearest Neighbors
model_knn = KNeighborsClassifier()

# 4. Create the new pipeline
pipeline_knn = Pipeline(steps=[('preprocessor', preprocessor_knn),
                               ('model', model_knn)
                              ])

**Training and Evaluation**

We fit this new pipeline to the same training data used previously and evaluate its performance. This allows for a direct comparison between the Logistic Regression (StandardScaler) approach and this KNN (MinMaxScaler) approach.

In [None]:
# Train the KNN pipeline
pipeline_knn.fit(X_train, y_train)

# Generate predictions
preds_knn = pipeline_knn.predict(X_test)

# Calculate classification metrics
accuracy_knn = accuracy_score(y_test, preds_knn)
precision_knn = precision_score(y_test, preds_knn)
recall_knn = recall_score(y_test, preds_knn)
f1_knn = f1_score(y_test, preds_knn)

# Calculate ROC/AUC
# Note: KNN supports predict_proba, which allows us to calculate AUC just like Logistic Regression
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, pipeline_knn.predict_proba(X_test)[:,1])
roc_auc_knn = auc(fpr_knn, tpr_knn)

# Output results
print(f'Accuracy: {accuracy_knn:.2f}')
print(f'Precision: {precision_knn:.2f}')
print(f'Recall: {recall_knn:.2f}')
print(f'F1 Score: {f1_knn:.2f}')
print(f"AUC: {roc_auc_knn:.2f}")

## Model Comparison

Let's compare the performance of both models side by side:

In [None]:
# Create a comparison DataFrame
comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC'],
    'Logistic Regression': [accuracy, precision, recall, f1, roc_auc],
    'KNN (MinMaxScaler)': [accuracy_knn, precision_knn, recall_knn, f1_knn, roc_auc_knn]
})

print("Model Comparison:")
print(comparison.to_string(index=False))