## CS-471: Machine Learning
### **Submitted By**:
#### **Name**: Ayesh Ahmad
#### **CMS**: 365966
#### **Class**: BESE-12A
---
## Lab 5
#### Implement the k-Nearest Neighbor Classifier. In the training and test data files, each row contains data about one instance of a plant category where four predictors/attributes are recorded for each plant (namely, leaf length, leaf width, flower length, and flower width), while “plant” is the target class which could be any one of the following at a time: “Arctica” or “Harlequin” or “Caroliniana”.

##### Imports

In [62]:
import time
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

##### Data Preprocessing and Analysis

In [40]:

def load_and_preprocess_data(train_file, test_file):
    train_data = pd.read_excel(train_file)
    test_data = pd.read_excel(test_file)
    
    # Filling empty values with 'Unknown'
    test_data['plant'].fillna('Unknown', inplace=True)
 
    # Perform data preprocessing
    combined_data = pd.concat([train_data, test_data], axis=0)
    scaler = StandardScaler()
    combined_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']] = scaler.fit_transform(combined_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']])
    train_data = combined_data.iloc[:len(train_data), :]
    test_data = combined_data.iloc[len(train_data):, :]
    train_data.reset_index(drop=True, inplace=True)
    test_data.reset_index(drop=True, inplace=True)
    
    # Separate features and labels
    X_train = train_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']]
    y_train = train_data['plant']
    X_test = test_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']]
    y_test = test_data['plant']
    
    return X_train, y_train, X_test, y_test

train_file = 'TrainingSet.xlsx'
test_file = 'TestingSet.xlsx'
X_train, y_train, X_test, y_test = load_and_preprocess_data(train_file, test_file)

print("Training data:")
print(X_train.head())
print("\nTraining labels:")
print(y_train.head())
print("\nTest data:")
print(X_test.head())
print("\nTest labels:")
print(y_test.head())

Training data:
   leaf.length  leaf.width  flower.length  flower.width
0    -0.537178    1.479398      -1.283389     -1.315444
1    -1.264185    0.788808      -1.226552     -1.315444
2    -1.264185   -0.131979      -1.340227     -1.447076
3    -1.870024   -0.131979      -1.510739     -1.447076
4    -0.052506    2.169988      -1.453901     -1.315444

Training labels:
0    Arctica
1    Arctica
2    Arctica
3    Arctica
4    Arctica
Name: plant, dtype: object

Test data:
   leaf.length  leaf.width  flower.length  flower.width
0    -1.748856   -0.362176      -1.340227     -1.315444
1    -1.506521    0.098217      -1.283389     -1.315444
2    -1.506521    0.788808      -1.340227     -1.183812
3    -1.385353    0.328414      -1.397064     -1.315444
4    -1.143017   -0.131979      -1.340227     -1.315444

Test labels:
0    Unknown
1    Unknown
2    Unknown
3    Unknown
4    Unknown
Name: plant, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['plant'].fillna('Unknown', inplace=True)
  test_data['plant'].fillna('Unknown', inplace=True)


##### Custom k-NN Classifier Implementation

In [41]:
class KNNClassifier:
    def __init__(self, k):
        self.k = k

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        predictions = []
        for i in range(len(X_test)):
            distances = np.sqrt(np.sum((self.X_train - X_test.iloc[i])**2, axis=1))
            nearest_indices = np.argsort(distances)[:self.k]
            nearest_labels = self.y_train.iloc[nearest_indices]
            most_common = Counter(nearest_labels).most_common(1)
            predictions.append(most_common[0][0])
        return predictions

##### Running the Custom k-NN Classifier

In [42]:
custom_preds = [] 
k_values = [3, 5, 7]

for k in k_values:
    knn = KNNClassifier(k)
    knn.fit(X_train, y_train)
    start_time = time.time()
    y_pred = knn.predict(X_test)
    end_time = time.time()
    print(f"Predictions for k={k}:")
    print(f"{y_pred}")
    custom_preds.append(y_pred)
    print(f"Time taken: {end_time - start_time:.6f} seconds\n")

Predictions for k=3:
['Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Harlequin', 'Harlequin', 'Arctica', 'Arctica', 'Arctica', 'Harlequin', 'Arctica', 'Harlequin', 'Harlequin', 'Carolinian', 'Harlequin', 'Carolinian', 'Carolinian', 'Harlequin', 'Harlequin', 'Carolinian', 'Harlequin', 'Carolinian', 'Harlequin', 'Harlequin', 'Carolinian', 'Carolinian', 'Carolinian', 'Carolinian']
Time taken: 0.019622 seconds

Predictions for k=5:
['Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Harlequin', 'Harlequin', 'Arctica', 'Arctica', 'Arctica', 'Harlequin', 'Arctica', 'Harlequin', 'Harlequin', 'Carolinian', 'Harlequin', 'Carolinian', 'Carolinian', 'Harlequin', 'Harlequin', 'Carolinian', 'Harlequin', 'Carolinian', 'Harlequin', 'Harlequin', 'Carolinian', 'Carolinian', 'Carolinian', 'Carolinian']
Time taken: 0.017727 seconds

Predictions for k=7:
['Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Arctica', 'Harlequin', 'Harlequin', 'Arctica', 'Arctica', 'A

##### Running the SciKit Learn k-NN Classifier

In [43]:
scikit_preds = []
k_values = [3, 5, 7]

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    start_time = time.time()
    y_pred = knn.predict(X_test)
    end_time = time.time()
    print(f"Predictions for k={k}:")
    print(f"{y_pred}")
    scikit_preds.append(y_pred)
    print(f"Time taken: {end_time - start_time:.6f} seconds\n")

Predictions for k=3:
['Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Harlequin'
 'Harlequin' 'Arctica' 'Arctica' 'Arctica' 'Harlequin' 'Arctica'
 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin' 'Carolinian'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Carolinian'
 'Carolinian' 'Carolinian']
Time taken: 0.001112 seconds

Predictions for k=5:
['Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Harlequin'
 'Harlequin' 'Arctica' 'Arctica' 'Arctica' 'Harlequin' 'Arctica'
 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin' 'Carolinian'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Carolinian'
 'Carolinian' 'Carolinian']
Time taken: 0.001063 seconds

Predictions for k=7:
['Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Harlequin'
 'Harlequin' 'Arctica' 'Arctica' 'Arctica' 'Harlequin' 'Arctica'
 'Harlequin' 'Harlequin' 'C

##### Comparing Custom and SciKit Learn KNN Classifier Predictions

In [61]:
if (custom_preds[0] == custom_preds[1]) and (custom_preds[1] == custom_preds[2]):
    print("Intra-Results for custom implementation match.")
if (scikit_preds[0] == scikit_preds[1]).all() and (scikit_preds[1] == scikit_preds[2]).all():
    print("Intra-Results for scikit learn implementation match.")
if (np.array_equal(custom_preds, scikit_preds)):
    print("Inter-Results match.")

Intra-Results for custom implementation match.
Intra-Results for scikit learn implementation match.
Inter-Results match.


##### Results and Conclusion

| k-value | Custom Execution Time (s) | SciKit Execution Time (s) | Absolute Difference (s) | Percentage Difference (%) |
| ------- | ------------------------- | ------------------------- | ----------------------- | -------------------------- |
| 3       | 0.019622                  | 0.001112                  | 0.018510                | 94.36                      |
| 5       | 0.017727                  | 0.001063                  | 0.016664                | 94.00                      |
| 7       | 0.022873                  | 0.005037                  | 0.017836                | 77.99                      |

As the compared predictions for all k-values match, we can infer the changing the k-value (i.e. between 3, 5, 7) had no varying effect on the predictions. This was crosschecked with skikit learn's own implementation of the k-NN classifier which had the exact same predicitions as the custom k-NN classifier.

What did matter in the comparison between scikit learn's implementation and our own is the execution time, wherein scikit learn is up to ~94% faster while producing the exact same results.