K-Nearest Neighbours Analysis

Introduction

This notebook contains a reproduction of the KNN Credit score Algorithim analysis by Willian Trindade Leite and Gui Gui. 


Steps taken:
    Prepapre Data 
    Split data into train and test sets using original metrics
    Evaluate model results
    Repeat steps above using a different split of data
    Compare results between the orginal and our data set

In [16]:
# Import necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns

In [17]:
# Read data from .csv file
data=pd.read_csv("/Users/elgun/Desktop/Reproducible-Research_Project/data/Credit.csv")
credit=data.copy()

In [18]:
# Define the feature columns and the target column
features = ['duration', 'amount', 'installment', 'residence', 'age', 'cards', 'liable']
target = 'Default'

In [19]:
# Extract the feature matrix and target vector
X = data[features]
y = data[target]

In [20]:
# Standardize the numerical variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [21]:
# Split the data into training and testing sets with a 75% training split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, train_size=0.75, random_state=42)


In [22]:
# Display the shapes of the resulting datasets
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (750, 7)
y_train shape: (750,)
X_test shape: (250, 7)
y_test shape: (250,)


In [23]:
# K-Nearest Neighbors (KNN) Analysis
k_values = [1, 3, 5, 7, 9, 11, 13, 15, 17]

In [24]:
# Initialize a dictionary to store the models and results
knn_models = {}
pcp_values = {}

In [25]:
# Evaluate KNN for different values of k
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)  # Train the model
    knn_models[k] = knn  # Store the trained model
    y_pred = knn.predict(X_test)  # Predict on the test set
    pcp_values[k] = accuracy_score(y_test, y_pred)  # Compute accuracy

In [26]:
# Printing proportion of correct classifications for each k
print("Proportion of Correct Classifications (PCP):")
for k, pcp in pcp_values.items():
    print(f"K={k}: {pcp:.5f}")

Proportion of Correct Classifications (PCP):
K=1: 0.70800
K=3: 0.67200
K=5: 0.66400
K=7: 0.66400
K=9: 0.68400
K=11: 0.69600
K=13: 0.68800
K=15: 0.69200
K=17: 0.70400


In [27]:
# Cross-validation results
cv_results = {}

In [28]:
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(knn, X_scaled, y, cv=10)  # 10-fold cross-validation
    cv_results[k] = cv_scores.mean()

In [29]:
# Print cross-validation results
print("Cross-Validation Accuracy (Mean):")
for k, cv_score in cv_results.items():
    print(f"K={k}: {cv_score:.5f}")

Cross-Validation Accuracy (Mean):
K=1: 0.62700
K=3: 0.66500
K=5: 0.65600
K=7: 0.66400
K=9: 0.67500
K=11: 0.68100
K=13: 0.68300
K=15: 0.69600
K=17: 0.69500


The results are similar but not identical to the original model