<a href="https://colab.research.google.com/github/ganbagal/GEN_AI_Assignments/blob/main/GEN_AI_Bronze_Badge_Problem_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement:**

A financial institution wants to predict whether a customer will default on a loan before approving it. Early identification of risky customers helps reduce financial loss.
You are working as a Machine Learning Analyst and must build a classification model using the K-Nearest Neighbors (KNN) algorithm to predict loan default.


This case introduces:

•	Mixed feature types

•	Financial risk interpretation

•	Class imbalance awareness

Age, AnnualIncome(lakhs), CreditScore(300-900), LoanAmount(lakhs), LoanTerm(years), EmploymentType, loan(yes/no)

28,6.5,720,5,5,Salaried,0

45,12,680,10,10,Self-Employed,1

35,8,750,6,7,Salaried,0

50,15,640,12,15,Self-Employed,1

30,7,710,5,5,Salaried,0

42,10,660,9,10,Salaried,1

26,5.5,730,4,4,Salaried,0

48,14,650,11,12,Self-Employed,1

38,9,700,7,8,Salaried,0

55,16,620,13,15,Self-Employed,1




**Interpretation:**

**1.	Identify high-risk customers.**

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

data = {

        'Age': [28,45,35,50,30,42,26,48,38,55],
        'AnnualIncome': [6.5,12,8,15,7,10,5.5,14,9,16],
        'CreditScore': [720,680,750,640,710,660,730,650,700,620],
        'LoanAmount': [5,10,6,12,5,9,4,11,7,13],
        'LoanTerm': [5,10,7,15,5,10,4,12,8,15],
        'EmploymentType': ['Salaried','Self-Employed','Salaried','Self-Employed','Salaried','Salaried','Salaried','Self-Employed','Salaried','Self-Employed'],
        'Loan': [0,1,0,1,0,1,0,1,0,1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan
0,28,6.5,720,5,5,Salaried,0
1,45,12.0,680,10,10,Self-Employed,1
2,35,8.0,750,6,7,Salaried,0
3,50,15.0,640,12,15,Self-Employed,1
4,30,7.0,710,5,5,Salaried,0
5,42,10.0,660,9,10,Salaried,1
6,26,5.5,730,4,4,Salaried,0
7,48,14.0,650,11,12,Self-Employed,1
8,38,9.0,700,7,8,Salaried,0
9,55,16.0,620,13,15,Self-Employed,1


In [None]:
le = LabelEncoder()
df['EmploymentType'] = le.fit_transform(df['EmploymentType'])

X = df.drop('Loan', axis=1)
y = df['Loan']

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_scaled, y)

**1. Identify high-risk customers.**

In [None]:
df["PredictedRisk"] = knn.predict(X_scaled)
high_risk_customers = df[df["PredictedRisk"] == 1]
print(high_risk_customers)

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm  EmploymentType  Loan  \
1   45          12.0          680          10        10               1     1   
3   50          15.0          640          12        15               1     1   
5   42          10.0          660           9        10               0     1   
7   48          14.0          650          11        12               1     1   
9   55          16.0          620          13        15               1     1   

   PredictedRisk  
1              1  
3              1  
5              1  
7              1  
9              1  


**2. What patterns lead to loan default?**

In [None]:
#Pattern 1 : Credit Score

df = pd.DataFrame(data)
defaulted = df[df['Loan'] == 1]
non_defaulted = df[df['Loan'] == 0]

print(df.groupby('Loan')["CreditScore"].mean())
print("Lower credit score, higher probability of default")

Loan
0    722.0
1    650.0
Name: CreditScore, dtype: float64
Lower credit score, higher probability of default


In [None]:
# Pattern 2 : Loan amount VS Income

df["Loan_To_Income"] = df["LoanAmount"] / df["AnnualIncome"]
print(df.groupby("Loan")["Loan_To_Income"].mean())

print("Dafaulters have higher loan burden relative to income")

Loan
0    0.747713
1    0.826310
Name: Loan_To_Income, dtype: float64
Dafaulters have higher loan burden relative to income


In [None]:
# Pattern 3: Employment Type
print(df.groupby("EmploymentType")["Loan"].mean())
print("Self-employed customers default more frequently")

EmploymentType
Salaried         0.166667
Self-Employed    1.000000
Name: Loan, dtype: float64
Self-employed customers default more frequently


In [None]:
# Pattern 4: Loan Tenure
print(df.groupby("Loan")["LoanTerm"].mean())
print("Longer loan terms lead to higher default rates")

Loan
0     5.8
1    12.4
Name: LoanTerm, dtype: float64
Longer loan terms lead to higher default rates


In [None]:
# Pattern 4: Age
print(df.groupby("Loan")["Age"].mean())
print("Defaulters are common in higher age brackets")

Loan
0    31.4
1    48.0
Name: Age, dtype: float64
Defaulters are common in higher age brackets


**3.	How do credit scores and income influence predictions?**

In [None]:
X = df[["CreditScore", "AnnualIncome"]]
y = df["Loan"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_scaled, y)

In [None]:
df["Pred_using_CS_Income"] = knn.predict(X_scaled)
df[["CreditScore","AnnualIncome","Loan","Pred_using_CS_Income"]]
print(df)
print("Model correctly flats low credit score customers as high-risk")

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm EmploymentType  Loan  \
0   28           6.5          720           5         5       Salaried     0   
1   45          12.0          680          10        10  Self-Employed     1   
2   35           8.0          750           6         7       Salaried     0   
3   50          15.0          640          12        15  Self-Employed     1   
4   30           7.0          710           5         5       Salaried     0   
5   42          10.0          660           9        10       Salaried     1   
6   26           5.5          730           4         4       Salaried     0   
7   48          14.0          650          11        12  Self-Employed     1   
8   38           9.0          700           7         8       Salaried     0   
9   55          16.0          620          13        15  Self-Employed     1   

   Loan_To_Income  Pred_using_CS_Income  
0        0.769231                     0  
1        0.833333                  

**4.	Suggest banking policies based on model output**

In [None]:
df = pd.DataFrame(data)
le = LabelEncoder()
df['EmploymentType'] = le.fit_transform(df['EmploymentType'])
X = df.drop(["Loan"],axis=1)
y=df['Loan']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_scaled, y)



In [None]:
df["PredictedRisk"] = knn.predict(X_scaled)
df["RiskCategory"] = np.where(df["PredictedRisk"] == 1, "High Risk", "Low Risk")
df[["CreditScore","AnnualIncome","LoanAmount","PredictedRisk","RiskCategory"]]

Unnamed: 0,CreditScore,AnnualIncome,LoanAmount,PredictedRisk,RiskCategory
0,720,6.5,5,0,Low Risk
1,680,12.0,10,1,High Risk
2,750,8.0,6,0,Low Risk
3,640,15.0,12,1,High Risk
4,710,7.0,5,0,Low Risk
5,660,10.0,9,1,High Risk
6,730,5.5,4,0,Low Risk
7,650,14.0,11,1,High Risk
8,700,9.0,7,0,Low Risk
9,620,16.0,13,1,High Risk


In [None]:
# Policy 1: Loan Approval Rules:

def approval_policy(row):
  if row["PredictedRisk"] == 1:
    return "Conditional / Manual Review"
  else:
    return "Approved"

df["ApprovalDecision"] = df.apply(approval_policy, axis=1)
print(df)

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm EmploymentType  Loan  \
0   28           6.5          720           5         5       Salaried     0   
1   45          12.0          680          10        10  Self-Employed     1   
2   35           8.0          750           6         7       Salaried     0   
3   50          15.0          640          12        15  Self-Employed     1   
4   30           7.0          710           5         5       Salaried     0   
5   42          10.0          660           9        10       Salaried     1   
6   26           5.5          730           4         4       Salaried     0   
7   48          14.0          650          11        12  Self-Employed     1   
8   38           9.0          700           7         8       Salaried     0   
9   55          16.0          620          13        15  Self-Employed     1   

   PredictedRisk RiskCategory             ApprovalDecision  
0              0     Low Risk                     Approved

In [None]:
# Policy 2: Interest Rate Pricing (Risk-Based):
def interest_rates(row):
  if row["PredictedRisk"] == 1:
    return "Base Rate + 3%"
  else:
    return "Base Rate"

df["InterestPolicy"] = df.apply(interest_rates, axis=1)
print(df)


   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm EmploymentType  Loan  \
0   28           6.5          720           5         5       Salaried     0   
1   45          12.0          680          10        10  Self-Employed     1   
2   35           8.0          750           6         7       Salaried     0   
3   50          15.0          640          12        15  Self-Employed     1   
4   30           7.0          710           5         5       Salaried     0   
5   42          10.0          660           9        10       Salaried     1   
6   26           5.5          730           4         4       Salaried     0   
7   48          14.0          650          11        12  Self-Employed     1   
8   38           9.0          700           7         8       Salaried     0   
9   55          16.0          620          13        15  Self-Employed     1   

   PredictedRisk RiskCategory             ApprovalDecision  InterestPolicy  
0              0     Low Risk             

In [None]:
# Policy 3: Loan amount capping:

def loan_cap(row):
  if row["PredictedRisk"] == 1:
    return row["LoanAmount"] * 0.7
  else:
    return row["LoanAmount"]

df["ApprovedLoanAmount"] = df.apply(loan_cap, axis=1)
print(df)

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan,PredictedRisk,RiskCategory,ApprovalDecision,InterestPolicy,ApprovedLoanAmount
0,28,6.5,720,5,5,Salaried,0,0,Low Risk,Approved,Base Rate,5.0
1,45,12.0,680,10,10,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,7.0
2,35,8.0,750,6,7,Salaried,0,0,Low Risk,Approved,Base Rate,6.0
3,50,15.0,640,12,15,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,8.4
4,30,7.0,710,5,5,Salaried,0,0,Low Risk,Approved,Base Rate,5.0
5,42,10.0,660,9,10,Salaried,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,6.3
6,26,5.5,730,4,4,Salaried,0,0,Low Risk,Approved,Base Rate,4.0
7,48,14.0,650,11,12,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,7.7
8,38,9.0,700,7,8,Salaried,0,0,Low Risk,Approved,Base Rate,7.0
9,55,16.0,620,13,15,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,9.1


In [None]:
# Policy 4 : Tenure Restriction

def tenure_policy(row):
  if row["PredictedRisk"] == 1 and row["LoanTerm"] > 10:
    return 10
  else:
    return row["LoanTerm"]

df["ApprovedTenure"] = df.apply(tenure_policy, axis=1)
print(df)

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm EmploymentType  Loan  \
0   28           6.5          720           5         5       Salaried     0   
1   45          12.0          680          10        10  Self-Employed     1   
2   35           8.0          750           6         7       Salaried     0   
3   50          15.0          640          12        15  Self-Employed     1   
4   30           7.0          710           5         5       Salaried     0   
5   42          10.0          660           9        10       Salaried     1   
6   26           5.5          730           4         4       Salaried     0   
7   48          14.0          650          11        12  Self-Employed     1   
8   38           9.0          700           7         8       Salaried     0   
9   55          16.0          620          13        15  Self-Employed     1   

   PredictedRisk RiskCategory             ApprovalDecision  InterestPolicy  \
0              0     Low Risk            

In [None]:
# Policy 5 : Collateral & Gurantor requirement

def collat_policy(row):
  if row["PredictedRisk"] == 1 :
    return "Collateral / Gurantor required"
  else:
    return "Collateral / Gurantor not required"

df["CollateralPolicy"] = df.apply(collat_policy, axis=1)
print(df)
df

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm EmploymentType  Loan  \
0   28           6.5          720           5         5       Salaried     0   
1   45          12.0          680          10        10  Self-Employed     1   
2   35           8.0          750           6         7       Salaried     0   
3   50          15.0          640          12        15  Self-Employed     1   
4   30           7.0          710           5         5       Salaried     0   
5   42          10.0          660           9        10       Salaried     1   
6   26           5.5          730           4         4       Salaried     0   
7   48          14.0          650          11        12  Self-Employed     1   
8   38           9.0          700           7         8       Salaried     0   
9   55          16.0          620          13        15  Self-Employed     1   

   PredictedRisk RiskCategory             ApprovalDecision  InterestPolicy  \
0              0     Low Risk            

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan,PredictedRisk,RiskCategory,ApprovalDecision,InterestPolicy,ApprovedLoanAmount,ApprovedTenure,CollateralPolicy
0,28,6.5,720,5,5,Salaried,0,0,Low Risk,Approved,Base Rate,5.0,5,Collateral / Gurantor not required
1,45,12.0,680,10,10,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,7.0,10,Collateral / Gurantor required
2,35,8.0,750,6,7,Salaried,0,0,Low Risk,Approved,Base Rate,6.0,7,Collateral / Gurantor not required
3,50,15.0,640,12,15,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,8.4,10,Collateral / Gurantor required
4,30,7.0,710,5,5,Salaried,0,0,Low Risk,Approved,Base Rate,5.0,5,Collateral / Gurantor not required
5,42,10.0,660,9,10,Salaried,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,6.3,10,Collateral / Gurantor required
6,26,5.5,730,4,4,Salaried,0,0,Low Risk,Approved,Base Rate,4.0,4,Collateral / Gurantor not required
7,48,14.0,650,11,12,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,7.7,10,Collateral / Gurantor required
8,38,9.0,700,7,8,Salaried,0,0,Low Risk,Approved,Base Rate,7.0,8,Collateral / Gurantor not required
9,55,16.0,620,13,15,Self-Employed,1,1,High Risk,Conditional / Manual Review,Base Rate + 3%,9.1,10,Collateral / Gurantor required


**5.	Compare KNN with Decision Trees for this problem.**

In [None]:
# KNN Model

df = pd.DataFrame(data)
le = LabelEncoder()
df['EmploymentType'] = le.fit_transform(df['EmploymentType'])
X = df.drop("Loan", axis=1)
y = df["Loan"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_scaled, y)
df["KNN_Prediction"] = knn.predict(X_scaled)
print(df)

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan,KNN_Prediction
0,28,6.5,720,5,5,0,0,0
1,45,12.0,680,10,10,1,1,1
2,35,8.0,750,6,7,0,0,0
3,50,15.0,640,12,15,1,1,1
4,30,7.0,710,5,5,0,0,0
5,42,10.0,660,9,10,0,1,1
6,26,5.5,730,4,4,0,0,0
7,48,14.0,650,11,12,1,1,1
8,38,9.0,700,7,8,0,0,0
9,55,16.0,620,13,15,1,1,1


In [None]:
# Decision Tree Model:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=3,random_state=42)
dt.fit(X_scaled, y)
df["DT_Prediction"] = dt.predict(X_scaled)
pint(df)

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan,KNN_Prediction,DT_Prediction
0,28,6.5,720,5,5,0,0,0,0
1,45,12.0,680,10,10,1,1,1,1
2,35,8.0,750,6,7,0,0,0,0
3,50,15.0,640,12,15,1,1,1,1
4,30,7.0,710,5,5,0,0,0,0
5,42,10.0,660,9,10,0,1,1,1
6,26,5.5,730,4,4,0,0,0,0
7,48,14.0,650,11,12,1,1,1,1
8,38,9.0,700,7,8,0,0,0,0
9,55,16.0,620,13,15,1,1,1,1


In [None]:
print("Both models correctly identify low credit score + high loan as risk")

Both models correctly identify low credit score + high loan as risk


**6.	What happens if Loan Amount dominates distance calculation?**

In [None]:
#KNN without scaling : Loan a

df = pd.DataFrame(data)
le = LabelEncoder()
df['EmploymentType'] = le.fit_transform(df['EmploymentType'])
X = df.drop("Loan", axis=1)
y = df["Loan"]

knn_raw = KNeighborsClassifier(n_neighbors=3)
knn_raw.fit(X, y)

df["Pred_No_Scaling"] = knn_raw.predict(X)
print(df)
print("KNN without scaling : Loan amount dominates")

   Age  AnnualIncome  CreditScore  LoanAmount  LoanTerm  EmploymentType  Loan  \
0   28           6.5          720           5         5               0     0   
1   45          12.0          680          10        10               1     1   
2   35           8.0          750           6         7               0     0   
3   50          15.0          640          12        15               1     1   
4   30           7.0          710           5         5               0     0   
5   42          10.0          660           9        10               0     1   
6   26           5.5          730           4         4               0     0   
7   48          14.0          650          11        12               1     1   
8   38           9.0          700           7         8               0     0   
9   55          16.0          620          13        15               1     1   

   Pred_No_Scaling  
0                0  
1                1  
2                0  
3                1  
4  

In [None]:
# Fix : Feature Scaling (Equal Influence)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_scaled, y)
df["Pred_With_Scaling"] = knn_scaled.predict(X_scaled)
df

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,Loan,Pred_No_Scaling,Pred_With_Scaling
0,28,6.5,720,5,5,0,0,0,0
1,45,12.0,680,10,10,1,1,1,1
2,35,8.0,750,6,7,0,0,0,0
3,50,15.0,640,12,15,1,1,1,1
4,30,7.0,710,5,5,0,0,0,0
5,42,10.0,660,9,10,0,1,1,1
6,26,5.5,730,4,4,0,0,0,0
7,48,14.0,650,11,12,1,1,1,1
8,38,9.0,700,7,8,0,0,0,0
9,55,16.0,620,13,15,1,1,1,1


In [None]:
# Compare Predictions:
df[["CreditScore","AnnualIncome","LoanAmount","Loan","Pred_No_Scaling","Pred_With_Scaling"]]

Unnamed: 0,CreditScore,AnnualIncome,LoanAmount,Loan,Pred_No_Scaling,Pred_With_Scaling
0,720,6.5,5,0,0,0
1,680,12.0,10,1,1,1
2,750,8.0,6,0,0,0
3,640,15.0,12,1,1,1
4,710,7.0,5,0,0,0
5,660,10.0,9,1,1,1
6,730,5.5,4,0,0,0
7,650,14.0,11,1,1,1
8,700,9.0,7,0,0,0
9,620,16.0,13,1,1,1


In [None]:
print("If loan amount dominates distance calculation in KNN, the model becomes biased towards loan size and ignores critical indicators like credit scores")

If loan amount dominates distance calculation in KNN, the model becomes biased towards loan size and ignores critical indicators like credit scores


**7.	Should KNN be used in real-time loan approval systems?**

In [None]:
from IPython.lib.security import random
from scipy.spatial import transform
import numpy as np
import time

sizes = [1_000,10_000,50_000]

for n in sizes:
  X = np.random.rand(n,6)
  y = np.random.randint(0,2,n)

  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  knn = KNeighborsClassifier(n_neighbors=5)
  knn.fit(X_scaled,y)

  new_customer = scaler.transform(np.random.rand(1,6))

  start = time.time()
  knn.predict(new_customer)
  end = time.time() - start

  print(f"Records: {n}, Prediction time: {end - start:.6f} sec")

  print("KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.")

Records: 1000, Prediction time: -1769593953.598720 sec
KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.
Records: 10000, Prediction time: -1769593953.624772 sec
KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.
Records: 50000, Prediction time: -1769593953.739314 sec
KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.


In [None]:
print("KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.")

KNN Explanation : Your loan was rejected because similar customers defaulted. So KNN should not be used for real-time loan approval system.
