Business Scenario
A fintech company provides instant credit limits to customers using a mobile app.
Instead of using complex models, the company wants a similarity-based system that works like:
“Show me customers similar to this new customer, and decide risk accordingly.”
Challenges:
Customers are not easily separable by straight lines
Decisions depend on nearness, not formulas
Feature scale (income vs age) matters a lot
The system must be interactive and explainable


•  Load the dataset and explore customer attributes such as:
Age
Income
Loan amount
Credit history
•  Identify which features should be used to measure customer similarity.


• Prepare the data so that distance-based comparison is meaningful.
• Explain why preprocessing is necessary for this algorithm.
•  Build a classification model that:
Assigns a customer to High Risk or Low Risk
Makes decisions based on nearest neighbors
•  Train the model using historical customer data.
•  Experiment with different values of K.


Analyze:
What happens when K is very small?
What happens when K is very large?


•  Identify the value of K that gives balanced performance.
• Predict risk category for unseen customers.
• Show how the prediction changes when K changes.
•  Evaluate the model using:
Accuracy
Confusion Matrix
•  Analyze:
How many risky customers were correctly identified?
How many safe customers were misclassified?

In [50]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load & Inspect

In [51]:
path = '../Data_Source/Kaggle/credit_risk_dataset.csv'
df = pd.read_csv(path)

In [52]:
df.shape

(32581, 12)

In [53]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [54]:
df.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [55]:
df.dtypes

person_age                      int64
person_income                   int64
person_home_ownership          object
person_emp_length             float64
loan_intent                    object
loan_grade                     object
loan_amnt                       int64
loan_int_rate                 float64
loan_status                     int64
loan_percent_income           float64
cb_person_default_on_file      object
cb_person_cred_hist_length      int64
dtype: object

In [56]:
df.isna().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

# Handling Missing values

In [57]:
# percentage of missing values
perc1 = df['person_emp_length'].isna().sum() / df.shape[0] * 100
perc2 = df['loan_int_rate'].isna().sum() / df.shape[0] * 100

print(f'person_emp_length: {perc1}')
print(f'loan_int_rate: {perc2}')

person_emp_length: 2.7469997851508547
loan_int_rate: 9.563856235229121


In [58]:
# dropping the missing values
df = df.dropna()

In [59]:
df.isna().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_status                   0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64

In [60]:
df.shape

(28638, 12)

# Label Encoding

In [61]:
for col in df.columns:
    if df[col].dtype == 'object':
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])

In [62]:
df.dtypes

person_age                      int64
person_income                   int64
person_home_ownership           int32
person_emp_length             float64
loan_intent                     int32
loan_grade                      int32
loan_amnt                       int64
loan_int_rate                 float64
loan_status                     int64
loan_percent_income           float64
cb_person_default_on_file       int32
cb_person_cred_hist_length      int64
dtype: object

In [63]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,3,123.0,4,3,35000,16.02,1,0.59,1,3
1,21,9600,2,5.0,1,1,1000,11.14,0,0.1,0,2
2,25,9600,0,1.0,3,2,5500,12.87,1,0.57,0,3
3,23,65500,3,4.0,3,2,35000,15.23,1,0.53,0,2
4,24,54400,3,8.0,3,2,35000,14.27,1,0.55,1,4


# Feature Separation

In [64]:
X = df.drop('loan_status', axis=1)
y = df.loan_status

# Train test split

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Standardization

In [66]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Building

In [67]:
classifier = KNeighborsClassifier(n_neighbors=25)
classifier.fit(X_train_scaled, y_train)

0,1,2
,n_neighbors,25
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [68]:
y_pred = classifier.predict(X_test_scaled)

In [69]:
# accuracy
accuracy_score(y_test, y_pred)

0.8681913407821229

In [70]:
# confusion
confusion_matrix(y_test, y_pred)

array([[4305,  138],
       [ 617,  668]], dtype=int64)