# Exercice 2: Classification system with KNN - To Loan or Not To Loan
**Oscar Savioz, Daniel Ribeiro Cabral & Bastien Veuthey**

## Imports

Import some useful libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from scipy.stats import mode

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [2]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [3]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [4]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [5]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [6]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [7]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [8]:
y = y.astype(int)

Split dataset into train and test sets.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [10]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
        self.x = None
        self.y = None
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.x = X
        self.y = y
        pass
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        preds = np.random.randint(2, size=len(X))
        return preds

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [11]:
def accuracy_score(y_true, y_pred):
    # Compare all y_true and y_pred values to check if they match, and sum them
    correct_predictions = sum(1 for true, pred in zip(y_true, y_pred) if true == pred)
    total_samples = len(y_true)

    # Compute accuracy
    return correct_predictions / total_samples

Compute the performance of the dummy classifier using the provided test set.

In [12]:
dummy_classifier = DummyClassifier()
dummy_classifier.fit(X_train, y_train)
preds = dummy_classifier.predict(X_test)

print(f"Accuracy : {accuracy_score(y_test, preds)}")

Accuracy : 0.5729166666666666


## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [13]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X_train = np.array(X)
        self.y_train = np.array(y)

    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        return np.sqrt(np.sum((a-b)**2))
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        X = np.array(X)
        y_preds = []

        # Loop on each test sample
        for item in X:
            item_distances = []
            # Loop on each train sample
            for item_train in self.X_train:
                # Compute euclidean distance of the test sample and the current train sample
                euclidean_dist = self._euclidian_distance(item_train, item)
                item_distances.append(euclidean_dist)

            # Sort distances (ascending) and get indexes of the k smallest values
            n_distances_indexes = np.argsort(item_distances)[:self.n_neighbors]
            y_train_labels = self.y_train[n_distances_indexes]
            # Add the most common class as the prediction for the test sample
            y_preds.append(mode(y_train_labels)[0])

        return np.array(y_preds)

Compute the performance of the system as a function of $k = 1...7$.

In [14]:
k_range = np.arange(1, 8, 1)

for k in k_range:
    model = KNNClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(f"Accuracy for k={k} : {accuracy_score(y_test, preds)}")

  y_preds.append(mode(y_train_labels)[0])


Accuracy for k=1 : 0.6979166666666666
Accuracy for k=2 : 0.6354166666666666
Accuracy for k=3 : 0.7916666666666666
Accuracy for k=4 : 0.7395833333333334
Accuracy for k=5 : 0.8125
Accuracy for k=6 : 0.78125
Accuracy for k=7 : 0.8020833333333334


The best performance is when k=5. It's possible that considering too much neighbors may results to less accurate predictions.

Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

In [15]:
features = ['TotalIncome', 'CreditHistory']

model = KNNClassifier(n_neighbors=3)
model.fit(X_train[features], y_train)
preds = model.predict(X_test[features])
print(f"Accuracy for k={3} : {accuracy_score(y_test, preds)}")

  y_preds.append(mode(y_train_labels)[0])


Accuracy for k=3 : 0.78125


Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

In [16]:
features = ['TotalIncome', 'CreditHistory', 'Married']

model = KNNClassifier(n_neighbors=3)
model.fit(X_train[features], y_train)
preds = model.predict(X_test[features])
print(f"Accuracy for k={3} : {accuracy_score(y_test, preds)}")

  y_preds.append(mode(y_train_labels)[0])


Accuracy for k=3 : 0.8645833333333334


Re-run the KNN algorithm using all features.

In [17]:
model = KNNClassifier(n_neighbors=3)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy for k={3} : {accuracy_score(y_test, preds)}")


  y_preds.append(mode(y_train_labels)[0])


Accuracy for k=3 : 0.7916666666666666


By using the same number of neighbors (k=3), we got the best results using the features [TotalIncome, CreditHistory, Married] that using only [TotalIncome, CreditHistory] or all the features.

When there is the same number of votes for both classes, the scipy.stats.mode() function return the first value seen by the function.
