# Loan Approval Classification

## Introduction

In this practice exercise, you will explore and analyze a dataset related to loan approval decisions. This dataset contains various features that might influence the likelihood of a loan being approved, such as personal income, employment experience, credit history and more. Your goal is to **build a deep learning model** using a neural network to predict whether a loan will be approved based on these features. This will give you practical experience in **data preprocessing**, **feature selection** and **model training** using PyTorch.

The dataset can be downloaded from Kaggle:[Loan Approval Classification Data](https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data).

This exercise will be divided into two parts:

- **Part 1**: You will use **all available features** to build and train the model.
- **Part 2**: You will refine the model by using only those **features that show a strong correlation** (greater than 25%) with the target variable, which should help in improving the model's performance. 

![Alt Text](./images/lp2.jpg)

## Import Libraries

This code imports necessary libraries and modules. Pandas is used for data manipulation, **PyTorch** for building and training the neural network and **scikit-learn** for data scaling and splitting.

In [65]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

## Load Data and Perform One-hot Encoding

In this cell, the loan dataset is loaded from a CSV file. It also performs **one-hot encoding** on categorical variables to convert them into a format that can be processed by neural networks.

In [64]:
data_path = './data/loan_data.csv'
loan_data = pd.read_csv(data_path)

# One-hot encode non-numerical data
loan_data_encoded = pd.get_dummies(loan_data)

In [29]:
loan_data_encoded

Unnamed: 0,person_age,person_income,person_emp_exp,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,loan_status,person_gender_female,...,person_home_ownership_OWN,person_home_ownership_RENT,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,previous_loan_defaults_on_file_No,previous_loan_defaults_on_file_Yes
0,22.0,71948.0,0,35000.0,16.02,0.49,3.0,561,1,True,...,False,True,False,False,False,False,True,False,True,False
1,21.0,12282.0,0,1000.0,11.14,0.08,2.0,504,0,True,...,True,False,False,True,False,False,False,False,False,True
2,25.0,12438.0,3,5500.0,12.87,0.44,3.0,635,1,True,...,False,False,False,False,False,True,False,False,True,False
3,23.0,79753.0,0,35000.0,15.23,0.44,2.0,675,1,True,...,False,True,False,False,False,True,False,False,True,False
4,24.0,66135.0,1,35000.0,14.27,0.53,4.0,586,1,False,...,False,True,False,False,False,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44995,27.0,47971.0,6,15000.0,15.66,0.31,3.0,645,1,False,...,False,True,False,False,False,True,False,False,True,False
44996,37.0,65800.0,17,9000.0,14.07,0.14,11.0,621,1,True,...,False,True,False,False,True,False,False,False,True,False
44997,33.0,56942.0,7,2771.0,10.02,0.05,10.0,668,1,False,...,False,True,True,False,False,False,False,False,True,False
44998,29.0,33164.0,4,12000.0,13.23,0.36,6.0,604,1,False,...,False,True,False,True,False,False,False,False,True,False


## Part 1: Utilizing All Features

This cell prepares the data by separating the **features** (independent variables) and the **target** (dependent variable *loan_status*).

In [66]:
features = loan_data_encoded.drop('loan_status', axis=1)
target = loan_data_encoded['loan_status']

After executing the cell that separates the features and the target variable for the entire dataset, you have the flexibility to directly proceed to the "*Splitting and Scaling the Data*" section. This allows you to **test the model using all available features** without applying any feature selection based on correlation. When you're ready to **experiment with a potentially more efficient model** that only includes highly correlated features, you can return and run the intermediate cells that perform **feature selection** based on the correlation threshold. This approach lets you **compare the performance** of the neural network when using all features versus using a subset of features with strong correlations to the target variable, enhancing your understanding of feature importance and model optimization.

## Part 2: Using Highly Correlated Features

 This cell identifies and selects features that have a **strong correlation** (absolute value greater than 25%) with the target variable *loan_status*. The selected features are expected to have more impact on the model's predictions.

In [79]:
correlation_matrix = loan_data_encoded.corr()
target_correlation = correlation_matrix['loan_status']

significant_features = target_correlation[abs(target_correlation) > 0.25].index.tolist()
significant_features.remove('loan_status')  # Exclude the target variable itself
print(significant_features)
filtered_data = loan_data_encoded[significant_features + ['loan_status']]

['loan_int_rate', 'loan_percent_income', 'person_home_ownership_RENT', 'previous_loan_defaults_on_file_No', 'previous_loan_defaults_on_file_Yes']


## Data Preparation for Training

Here, the dataset is split into **filtered features** (independent variables) and the target (dependent variable, *loan_status*).

In [68]:
filtered_data = loan_data[['loan_status', 'person_income', 'credit_score', 'loan_percent_income','loan_intent']]
filtered_data = pd.get_dummies(filtered_data,columns=['loan_intent'])

In [69]:
features = filtered_data.drop('loan_status', axis=1)
target = filtered_data['loan_status']

In [62]:
print(target)

0        1
1        0
2        1
3        1
4        1
        ..
44995    1
44996    1
44997    1
44998    1
44999    1
Name: loan_status, Length: 45000, dtype: int64


## Splitting and Scaling the Data

The data is divided into training and validation sets. The features are then **scaled** using standard scaling, which normalizes the data by subtracting the mean and scaling to unit variance, facilitating more efficient training of neural networks.

In [70]:
X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

## Converting Data to PyTorch Tensors and Creating Data Loaders

This cell converts the scaled training and validation sets into PyTorch **tensors** and creates **data loaders**. Data loaders facilitate batch processing, shuffling for the training set and sequential access for the validation set.

In [75]:
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val_scaled, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)

In [63]:
X_train_tensor

tensor([[ 0.1942, -0.0324,  0.1191,  ..., -0.4834,  2.2294, -0.4573],
        [-0.0900,  0.5247,  0.2339,  ..., -0.4834, -0.4486,  2.1866],
        [ 2.8282,  1.0022, -0.7999,  ..., -0.4834,  2.2294, -0.4573],
        ...,
        [-0.2610,  0.3854,  0.4637,  ..., -0.4834, -0.4486,  2.1866],
        [ 1.1450, -0.5697, -0.3404,  ..., -0.4834, -0.4486, -0.4573],
        [ 1.5225,  0.6242, -0.7999,  ..., -0.4834,  2.2294, -0.4573]])

## Defining the NN Model

A **sequential neural network model** is defined here, featuring several layers with ReLU activations and a final sigmoid layer to output a probability for binary classification.

In [76]:
model = nn.Sequential(
    nn.Linear(features.shape[1], 32),
    nn.ReLU(),
    nn.Linear(32, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
    nn.Sigmoid()
)

## Setting Loss Function and Optimizer

**Binary Cross-Entropy Loss** is used as the loss function, which is suitable for binary classification problems. The **Adam** optimizer is used to adjust model weights based on this loss.

In [77]:
criterion = nn.BCEWithLogitsLoss() 
optimizer = optim.Adam(model.parameters(), lr=0.0015)

## Training and Validation of the Model

This final code cell contains the training loop where the model is trained for a specified number of epochs, and its performance is evaluated on both the training and validation datasets. Loss and accuracy metrics are calculated and printed for each epoch to monitor the training progress and effectiveness of the model.

In [78]:
num_epochs = 40
for epoch in range(num_epochs):
    model.train()
    loss = 0
    train_correct = 0
    train_total = 0
    for inputs, labels in train_loader:
        outputs = model(inputs).squeeze()
        loss = criterion(outputs, labels)
        predicted = nn.functional.sigmoid(outputs) > 0.5
        train_total += labels.size(0)
        train_correct += (predicted == labels).sum().item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    train_accuracy = 100 * train_correct / train_total
    print(f'Epoch {epoch+1}, Loss: {loss.item()}, Train Acc: {train_accuracy}%')
model.eval()
with torch.no_grad():
    outputs = model(X_train_tensor)
    y_pred = nn.functional.sigmoid(outputs) > 0.5
    y_pred_correct = y_pred.type(torch.float32) == y_train_tensor
    print(y_pred_correct.type(torch.float32).mean())

Epoch 1, Loss: 0.6931177377700806, Train Acc: 22.81388888888889%
Epoch 2, Loss: 0.6735822558403015, Train Acc: 37.205555555555556%
Epoch 3, Loss: 0.6763483285903931, Train Acc: 71.11111111111111%
Epoch 4, Loss: 0.681469202041626, Train Acc: 74.46944444444445%
Epoch 5, Loss: 0.6768664121627808, Train Acc: 74.73611111111111%
Epoch 6, Loss: 0.6877772212028503, Train Acc: 76.1361111111111%
Epoch 7, Loss: 0.681305468082428, Train Acc: 76.65555555555555%
Epoch 8, Loss: 0.6457577347755432, Train Acc: 76.08333333333333%
Epoch 9, Loss: 0.703721284866333, Train Acc: 76.51388888888889%
Epoch 10, Loss: 0.6459529995918274, Train Acc: 76.34722222222223%
Epoch 11, Loss: 0.6578864455223083, Train Acc: 76.21111111111111%
Epoch 12, Loss: 0.6747952103614807, Train Acc: 76.96388888888889%
Epoch 13, Loss: 0.6579850912094116, Train Acc: 76.91666666666667%
Epoch 14, Loss: 0.6931791305541992, Train Acc: 77.23333333333333%
Epoch 15, Loss: 0.6459723114967346, Train Acc: 78.30833333333334%
Epoch 16, Loss: 0.6593

## Comparative Analysis of Model Performance

In our hands-on exercise on loan approval predictions using PyTorch, we observed a significant result concerning model accuracy. When comparing the performance of the two models—one trained with all available features and the other with only the features highly correlated (greater than 25%) with the target variable—a notable pattern emerged:

**Models' Performance Insight**:

- The model trained with **all available features** demonstrated **higher accuracy** on the **validation dataset** compared to the model that used only the highly correlated features.

## Why Does Using All Features Sometimes Lead to Better Performance?

1. **Comprehensive Data Representation**: When all features are used, the model has access to a broader spectrum of information. This comprehensive data can include subtle but valuable patterns that might be overlooked when only selecting features based on correlation.

2. **Interactions Between Features**: Some features may not show a strong individual correlation with the target variable but could be influential when combined with other features. Using all available data allows the model to capture these complex interactions.

3. **Reduced Risk of Overfitting to Correlation**: Focusing solely on highly correlated features can sometimes lead to models that are too tailored to the specifics of the training data, including potential noise and anomalies specific to that subset of features. This can affect the model's ability to generalize to new, unseen data.

## Conclusion

The findings suggest that while feature selection based on correlation can **simplify the model and reduce training time**, it **does not always guarantee** improved performance. This underscores the importance of experimenting with different feature subsets and using validation performance as a benchmark to judge the best approach. In this case, utilizing the complete dataset provided a more accurate and robust model for predicting loan approvals, highlighting that more data often leads to better learning and prediction capabilities.

In practice, it's beneficial to **evaluate models both with and without feature selection** to determine the **optimal** approach for specific datasets and objectives. This exercise not only enhances your technical skills but also deepens your understanding of strategic model development in data science.