##### Authors: Rafael Dousse, Eva Ray, Massimo Stefani

# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [313]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [314]:
url = "https://raw.githubusercontent.com/evarayHEIG/machle_labs/main/pw2/ex2-to-loan-or-not-to-loan/loandata.csv"
data = pd.read_csv(url)

Display the head of the data.

In [315]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [316]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [317]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [318]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [319]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [320]:
y = y.astype(int)

Split dataset into train and test sets.

In [321]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [322]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X_train = X
        self.y_train = y
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        return np.random.choice(self.y_train.unique(), size=len(X))

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [323]:
def accuracy_score(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

Compute the performance of the dummy classifier using the provided test set.

In [324]:
dummy_class = DummyClassifier()
dummy_class.fit(X_train, y_train)
y_pred = dummy_class.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Dummy classifier accuracy: {acc:.2f}")

Dummy classifier accuracy: 0.48


## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [337]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.k = n_neighbors
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X_train = X
        self.y_train = y
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        return np.sqrt(np.sum((a - b) ** 2, axis=1))
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        y_pred = []
        for i in range(len(X)):
            distances = self._euclidian_distance(self.X_train.values, X.iloc[i].values)
            # Find the k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            k_nearest_labels = self.y_train.iloc[k_indices]
            # Count the occurrences of each class in the k nearest neighbors
            counts = k_nearest_labels.value_counts()
            # Find the maximum occurrences
            max_count = counts.max()
            # Find the class(es) with the maximum occurrences
            candidates = counts[counts == max_count].index.tolist()

            predicted_class = None

            if len(candidates) == 1:
                # If there is only one candidate class, choose it
                predicted_class = candidates[0]
            else:
                # If there is more than one candidate class, choose the one that has the closest neighbor
                for idx in k_indices:
                    label = self.y_train.iloc[idx]
                    if label in candidates:
                        predicted_class = label
                        break
            y_pred.append(predicted_class)
        return np.array(y_pred)

    def score(self, X, y):
        y_pred = self.predict(X)
        return accuracy_score(y, y_pred)


Compute the performance of the system as a function of $k = 1...7$.

In [326]:
def evaluate_knn(X_train, X_test, y_train, y_test, k_values):
    accuracies = []
    for k in k_values:
        knn = KNNClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        accuracies.append(acc)
        print(f"KNN classifier accuracy with k={k}: {acc:.4f}")
    return accuracies

Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

In [327]:
k_values = range(1, 8)

features_c = ["TotalIncome", "CreditHistory"]
X_train_c = X_train[features_c]
X_test_c = X_test[features_c]
accuracies_c = evaluate_knn(X_train_c, X_test_c, y_train, y_test, k_values)

KNN classifier accuracy with k=1: 0.7708
KNN classifier accuracy with k=2: 0.7708
KNN classifier accuracy with k=3: 0.7812
KNN classifier accuracy with k=4: 0.7917
KNN classifier accuracy with k=5: 0.8229
KNN classifier accuracy with k=6: 0.8125
KNN classifier accuracy with k=7: 0.8125


Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

In [328]:
features_d = ["TotalIncome", "CreditHistory", "Married"]
X_train_d = X_train[features_d]
X_test_d = X_test[features_d]
accuracies_d = evaluate_knn(X_train_d, X_test_d, y_train, y_test, k_values)

KNN classifier accuracy with k=1: 0.7917
KNN classifier accuracy with k=2: 0.7917
KNN classifier accuracy with k=3: 0.8646
KNN classifier accuracy with k=4: 0.8438
KNN classifier accuracy with k=5: 0.7917
KNN classifier accuracy with k=6: 0.8021
KNN classifier accuracy with k=7: 0.8021


Re-run the KNN algorithm using all features.

In [329]:
accuracies_e = evaluate_knn(X_train, X_test, y_train, y_test, k_values)

KNN classifier accuracy with k=1: 0.6979
KNN classifier accuracy with k=2: 0.6979
KNN classifier accuracy with k=3: 0.7917
KNN classifier accuracy with k=4: 0.8125
KNN classifier accuracy with k=5: 0.8125
KNN classifier accuracy with k=6: 0.8229
KNN classifier accuracy with k=7: 0.8021


## Question e

What do you notice ? What can you tell about the relationship between the performance, the number of samples and the number of features when using this algorithm ?

**Answer:**

The best performances over all the configurations are as follows:

| Features used                              | k | Accuracy |
|--------------------------------------------|---|----------|
| `TotalIncome`, `CreditHistory`             | 5 | 0.82     |
| `TotalIncome`, `CreditHistory`, `Married`  | 3 | 0.86     |
| All features                               | 6 | 0.82     |

Over all the configurations, the best performance (0.86) is obtained with k=3 using `TotalIncome`, `CreditHistory` and `Married` features, while the worst performance (0.82) is obtained with k=5 using `TotalIncome` and `CreditHistory` features, and with k=6 using all features. This observation can lead to several conclusions:

- Having more information doesn't always lead to better performance. This is due to the fact that some features may be irrelevant or even detrimental to the classification task, introducing noise and confusion.
- Some features provide useful information for the classification task. In this case, the `Married` feature seems to be relevant, since the performance improves when it is added to the feature set.
- Using a bigger K do not necessarily lead to better performance. A smaller K can be more sensitive to local patterns in the data, while a larger K can smooth out these patterns and lead to a loss of information.
- The optimal value of k varies depending on the feature set used. This indicates that the choice of k is influenced by the characteristics of the data and the features selected.

## Question f

How is your system taking decisions when you have an equal number of votes for both classes with values of k = 2, 4, 6 ?

**Answer:**

To handle ties in the voting process, we choose the class that has a neighbor with the smallest distance to the query point.

## Let's do the same using cross-validation

In [330]:
from sklearn.model_selection import KFold

def cross_val_knn(X, y, k_values, n_folds=5):
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    results = {k: [] for k in k_values}
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        for k in k_values:
            knn = KNNClassifier(n_neighbors=k)
            knn.fit(X_train, y_train)
            y_pred = knn.predict(X_test)
            acc = accuracy_score(y_test, y_pred)
            results[k].append(acc)
    # Mean accuracy for each k
    print(f"Cross-validation results over {n_folds} folds:")
    for k in k_values:
        print(f"KNN (k={k}) - Moyenne accuracy sur {n_folds} folds: {np.mean(results[k]):.4f}")
    return results

In [333]:
accuracies_c_crossval = cross_val_knn(X[features_c], y, k_values)
accuracies_d_crossval = cross_val_knn(X[features_d], y, k_values)
accuracies_e_crossval = cross_val_knn(X, y, k_values)

Cross-validation results over 5 folds:
KNN (k=1) - Moyenne accuracy sur 5 folds: 0.6937
KNN (k=2) - Moyenne accuracy sur 5 folds: 0.6937
KNN (k=3) - Moyenne accuracy sur 5 folds: 0.7438
KNN (k=4) - Moyenne accuracy sur 5 folds: 0.7396
KNN (k=5) - Moyenne accuracy sur 5 folds: 0.7646
KNN (k=6) - Moyenne accuracy sur 5 folds: 0.7646
KNN (k=7) - Moyenne accuracy sur 5 folds: 0.7812
Cross-validation results over 5 folds:
KNN (k=1) - Moyenne accuracy sur 5 folds: 0.6813
KNN (k=2) - Moyenne accuracy sur 5 folds: 0.6813
KNN (k=3) - Moyenne accuracy sur 5 folds: 0.7646
KNN (k=4) - Moyenne accuracy sur 5 folds: 0.7500
KNN (k=5) - Moyenne accuracy sur 5 folds: 0.7708
KNN (k=6) - Moyenne accuracy sur 5 folds: 0.7729
KNN (k=7) - Moyenne accuracy sur 5 folds: 0.7854
Cross-validation results over 5 folds:
KNN (k=1) - Moyenne accuracy sur 5 folds: 0.7292
KNN (k=2) - Moyenne accuracy sur 5 folds: 0.7292
KNN (k=3) - Moyenne accuracy sur 5 folds: 0.7521
KNN (k=4) - Moyenne accuracy sur 5 folds: 0.7604
K

In [347]:
# Test avec les fonctions de scikit-learn pour voir la différence
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

k_values = [i for i in range (1,31)]
scores = []

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X, y, cv=5)
    scores.append(np.mean(score))
    print(f"KNN (k={k}) - Moyenne accuracy sur 5 folds: {np.mean(score):.4f}")

KNN (k=1) - Moyenne accuracy sur 5 folds: 0.7167
KNN (k=2) - Moyenne accuracy sur 5 folds: 0.6542
KNN (k=3) - Moyenne accuracy sur 5 folds: 0.7667
KNN (k=4) - Moyenne accuracy sur 5 folds: 0.7354
KNN (k=5) - Moyenne accuracy sur 5 folds: 0.7854
KNN (k=6) - Moyenne accuracy sur 5 folds: 0.7750
KNN (k=7) - Moyenne accuracy sur 5 folds: 0.8042
KNN (k=8) - Moyenne accuracy sur 5 folds: 0.7937
KNN (k=9) - Moyenne accuracy sur 5 folds: 0.8000
KNN (k=10) - Moyenne accuracy sur 5 folds: 0.7917
KNN (k=11) - Moyenne accuracy sur 5 folds: 0.7958
KNN (k=12) - Moyenne accuracy sur 5 folds: 0.7917
KNN (k=13) - Moyenne accuracy sur 5 folds: 0.8042
KNN (k=14) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=15) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=16) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=17) - Moyenne accuracy sur 5 folds: 0.8104
KNN (k=18) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=19) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=20) - Moyenne accuracy sur 5 folds: 0.8083
KNN (k=21