
Data Dictionary and Overview (from Kaggle https://www.kaggle.com/datasets/udaymalviya/bank-loan-data?resource=download)

This dataset contains 45,000 records of loan applicants, with various attributes related to personal demographics, financial status, and loan details.

Personal Information

person_age: Age of the applicant (in years).

person_gender: Gender of the applicant (male, female).

person_education: Educational background (High School, Bachelor, Master, etc.).

person_income: Annual income of the applicant (in USD).

person_emp_exp: Years of employment experience.

person_home_ownership: Type of home ownership (RENT, OWN, MORTGAGE).


Loan Details

loan_amnt: Loan amount requested (in USD).

loan_intent: Purpose of the loan (PERSONAL, EDUCATION, MEDICAL, etc.).

loan_int_rate: Interest rate on the loan (percentage).

loan_percent_income: Ratio of loan amount to income.


Credit & Loan History

cb_person_cred_hist_length: Length of the applicant's credit history (in years).

credit_score: Credit score of the applicant.

previous_loan_defaults_on_file: Whether the applicant has previous loan defaults (Yes or No).



Target Variable

loan_status: 1 if the loan was repaid successfully, 0 if the applicant defaulted.


In [None]:
import pandas as pd
# Step 1 Clean the data
df = pd.read_csv("loan_data.csv")

# See if there are rows with nulls that need to be removed
rows_with_nulls = df.isna().any(axis=1).sum()
print(f"Number of rows with null values: {rows_with_nulls}")

# No nulls so we are fine

Number of rows with null values: 0


In [4]:
# Step 2 Some basic EDA
print(df.describe(include='all'))

          person_age person_gender person_education  person_income  \
count   45000.000000         45000            45000   4.500000e+04   
unique           NaN             2                5            NaN   
top              NaN          male         Bachelor            NaN   
freq             NaN         24841            13399            NaN   
mean       27.764178           NaN              NaN   8.031905e+04   
std         6.045108           NaN              NaN   8.042250e+04   
min        20.000000           NaN              NaN   8.000000e+03   
25%        24.000000           NaN              NaN   4.720400e+04   
50%        26.000000           NaN              NaN   6.704800e+04   
75%        30.000000           NaN              NaN   9.578925e+04   
max       144.000000           NaN              NaN   7.200766e+06   

        person_emp_exp person_home_ownership     loan_amnt loan_intent  \
count     45000.000000                 45000  45000.000000       45000   
unique     

In [5]:
# View df
df

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,0.49,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,0.08,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,0.44,3.0,635,No,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,0.44,2.0,675,No,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,0.53,4.0,586,No,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44995,27.0,male,Associate,47971.0,6,RENT,15000.0,MEDICAL,15.66,0.31,3.0,645,No,1
44996,37.0,female,Associate,65800.0,17,RENT,9000.0,HOMEIMPROVEMENT,14.07,0.14,11.0,621,No,1
44997,33.0,male,Associate,56942.0,7,RENT,2771.0,DEBTCONSOLIDATION,10.02,0.05,10.0,668,No,1
44998,29.0,male,Bachelor,33164.0,4,RENT,12000.0,EDUCATION,13.23,0.36,6.0,604,No,1


In [27]:
import numpy as np
from sklearn import preprocessing
# Step 3 Preprocess

# Convert yes no column to boolean
loans_bool = (np.where(df["previous_loan_defaults_on_file"] == "Yes", True, False)).reshape(-1, 1) 

# Seperate out dependent variable
y = df["loan_status"]


# Numerical Variables
numerical = df[["person_age", "person_income", "person_emp_exp", "loan_amnt", "loan_int_rate", "loan_percent_income", "cb_person_cred_hist_length", "credit_score"]]

# Min Max scale numerical data
scaler = preprocessing.MinMaxScaler()
numerical_scaled = scaler.fit_transform(numerical)


# Categorical Variables
categorical = df[['person_gender', 'person_education', 'person_home_ownership', 'loan_intent']]
categorical_dummies = np.array(pd.get_dummies(categorical, drop_first=True))

# Create Phi with numerical, categorical, and boolean variables
Phi = np.hstack((numerical_scaled, categorical_dummies, loans_bool, np.ones(df.shape[0]).reshape(-1, 1)))

print(f"Number of columns after preprocessing: {Phi.shape[1]}")

Number of columns after preprocessing: 23
