# Alternative Models: KNN


We have explored using logistic regression already and found that it gave us an accuracy of 86% While this is a great accuracy, we don't really know how well this model performs against other potential models. For this reason, this part will explore using a KNN to predict whether someone would be approved for a loan.

### Setup and Goals
The goal of this part is to train a KNN classifier and compare it to our regularized logsitic regression model from part 2. We will use the same train/test split and preprocessing for fair comparison. We will primarily use accuracy to compare the models' performance.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv("data/cleaned_data.csv")
df.head(5)

Unnamed: 0,age,occupation_status,years_employed,annual_income,credit_score,credit_history_years,savings_assets,current_debt,defaults_on_file,delinquencies_last_2yrs,derogatory_marks,product_type,loan_intent,loan_amount,interest_rate,debt_to_income_ratio,loan_to_income_ratio,payment_to_income_ratio,loan_status
0,40,Employed,17.2,25579,692,5.3,895,10820,0,0,0,Credit Card,Business,600,17.02,0.423,0.023,0.008,1
1,33,Employed,7.3,43087,627,3.5,169,16550,0,1,0,Personal Loan,Home Improvement,53300,14.1,0.384,1.237,0.412,0
2,42,Student,1.1,20840,689,8.4,17,7852,0,0,0,Credit Card,Debt Consolidation,2100,18.33,0.377,0.101,0.034,1
3,53,Student,0.5,29147,692,9.8,1480,11603,0,1,0,Credit Card,Business,2900,18.74,0.398,0.099,0.033,1
4,32,Employed,12.5,63657,630,7.2,209,12424,0,0,0,Personal Loan,Education,99600,13.92,0.195,1.565,0.522,1


### One Hot Encoding

In [3]:
df = pd.get_dummies(df, columns = ['loan_intent', 'product_type', 'occupation_status'], drop_first = True)
df.columns

Index(['age', 'years_employed', 'annual_income', 'credit_score',
       'credit_history_years', 'savings_assets', 'current_debt',
       'defaults_on_file', 'delinquencies_last_2yrs', 'derogatory_marks',
       'loan_amount', 'interest_rate', 'debt_to_income_ratio',
       'loan_to_income_ratio', 'payment_to_income_ratio', 'loan_status',
       'loan_intent_Debt Consolidation', 'loan_intent_Education',
       'loan_intent_Home Improvement', 'loan_intent_Medical',
       'loan_intent_Personal', 'product_type_Line of Credit',
       'product_type_Personal Loan', 'occupation_status_Self-Employed',
       'occupation_status_Student'],
      dtype='object')

### Train/Test split

In [4]:
X = df.drop(columns=['loan_status'])
y = df['loan_status']

X_train, X_test = X.iloc[:-10000], X.iloc[-10000:]
y_train, y_test = y.iloc[:-10000], y.iloc[-10000:]


### Standardize features

In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Find optimal k value

In [6]:
k_values = range(3, 21, 2)
results = []

# iterate over each considered k value
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, weights="distance")
    knn.fit(X_train_scaled, y_train)

    y_pred = knn.predict(X_test_scaled)

    acc = accuracy_score(y_test, y_pred)

    results.append((k, acc))

results_df = pd.DataFrame(results, columns=['k', 'accuracy'])
results_df.head()

Unnamed: 0,k,accuracy
0,3,0.8542
1,5,0.8616
2,7,0.8664
3,9,0.8646
4,11,0.8667


In [7]:
# extract k value with the best accuracy
best_k = int(results_df.loc[results_df['accuracy'].idxmax(), "k"])
print("optimal k value: ", k, " with accuracy: ")

optimal k value:  19  with accuracy: 


### Fitting final model with k=19

In [8]:
knn = KNeighborsClassifier(n_neighbors=best_k, weights="distance")
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)

print("model accuracy: ", accuracy)

model accuracy:  0.8681


### Takeaways

In the end we find that the test model accuracy for our KNN model (0.868) is very similar to our Logistic Regression model (0.868). This shows that there isn't any predictive gain from using a KNN model. In practice, the Logistic Regression model provides more clear and interpretable coefficients, which allows us to directly assess how each factor influences a person's loan approval odds. Since KNN is a black-box model, and instance based, it offers little insight into fetaure importance.