## Part 1(a) -  Implementing the simple Nearest Neighbour algorithm

This assignment code implements a simple Nearest Neighbour alorgithm to classify objects as (R) 'Rocks', (M) 'Metal Cylinder', based on the sonar data provided. 

In [1]:
# Importing Essential Libraries 
import pandas as pd 
import numpy as np 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Load Datasets 
sonartrain_data = pd.read_csv("sonar_train.csv")
sonartest_data = pd.read_csv("sonar_test.csv")

## Data Checking Preview 
print("Training Data:")
print(sonartrain_data.head())
print(sonartrain_data.info())

print("\nTest Data:")
print(sonartest_data.head())
print(sonartest_data.info())

## Preparing the Features and Labels 

# For the training data set 
trainY = sonartrain_data['Class'].values
trainX = sonartrain_data.drop(['Class'],axis=1).values

# for the test data set 
testY = sonartest_data['Class'].values
testX = sonartest_data.drop(['Class'],axis=1).values


Training Data:
       A1      A2      A3      A4      A5      A6      A7      A8      A9  \
0  0.0079  0.0086  0.0055  0.0250  0.0344  0.0546  0.0528  0.0958  0.1009   
1  0.0599  0.0474  0.0498  0.0387  0.1026  0.0773  0.0853  0.0447  0.1094   
2  0.0093  0.0269  0.0217  0.0339  0.0305  0.1172  0.1450  0.0638  0.0740   
3  0.0151  0.0320  0.0599  0.1050  0.1163  0.1734  0.1679  0.1119  0.0889   
4  0.0317  0.0956  0.1321  0.1408  0.1674  0.1710  0.0731  0.1401  0.2083   

      A10  ...     A52     A53     A54     A55     A56     A57     A58  \
0  0.1240  ...  0.0176  0.0127  0.0088  0.0098  0.0019  0.0059  0.0058   
1  0.0351  ...  0.0013  0.0005  0.0227  0.0209  0.0081  0.0117  0.0114   
2  0.1360  ...  0.0212  0.0091  0.0056  0.0086  0.0092  0.0070  0.0116   
3  0.1205  ...  0.0061  0.0015  0.0084  0.0128  0.0054  0.0011  0.0019   
4  0.3513  ...  0.0201  0.0248  0.0131  0.0070  0.0138  0.0092  0.0143   

      A59     A60  Class  
0  0.0059  0.0032      R  
1  0.0112  0.0100      

## Step 1: Implement the Minkowski Distance 
The distance is calculated between two point (x) and (y) - using a the set formula for the Minowski Distance. 
https://www.ibm.com/content/dam/connectedassets-adobe-cms/worldwide-content/cdp/cf/ul/g/9e/a8/MinkowskiDistance.png -- Formula 

In [2]:
## Computing the Minkowski Distance 
def minkowski_distance(x1, x2, q=2):
    return np.sum(np.abs(x1 - x2) ** q) ** (1 / q)

## Step 2: Implement the Nearest Neighbor Algorithm
For each record in the test dataset:
1. Computing the distance to all training records.
2. Identify the training record with the smallest distance.
3. Assign the class of the nearest neighbor to the test record.

In [1]:
## Implemeting the Nearest Neighbour Algorithm 
def nearest_neighbour(train_data, train_labels, test_data, q=2):

    predictions = []
    for test_sample in test_data:
        distances = [minkowski_distance(test_sample, train_sample, q) for train_sample in train_data]
        
        nearest_index = np.argmin(distances)
        predictions.append(train_labels[nearest_index])
    return predictions

## Step 3: Running the alogorithms for Manhattan and Euclidean Distance
1. Running and executing the algorithm for (q=1) - The Manhattan distance metric. 
2. Running and executing the algorithm for (q=2) - The Euclidean distance metric. 

Assigned each test record to the class of its single nearest neighbour in the training set. 

3. Output - display accuracy, recall, precision and the F1 score for the class 'M' (mental)

In [4]:
# Euclidean Distance -- (q=2)
y_pred_euclidean = nearest_neighbour(trainX, trainY, testX, q=2)

accuracy_euclidean = accuracy_score(testY, y_pred_euclidean)
recall_euclidean = recall_score(testY, y_pred_euclidean, pos_label="M")
precision_euclidean = precision_score(testY, y_pred_euclidean, pos_label="M")
f1_euclidean = f1_score(testY, y_pred_euclidean, pos_label="M")

print("Nearest Neighbour with Euclidean Distance (q=2) Results:")
print(f"Accuracy: {accuracy_euclidean:.4f}")
print(f"Recall (for class 'M'): {recall_euclidean:.4f}")
print(f"Precision (for class 'M'): {precision_euclidean:.4f}")
print(f"F1 Score (for class 'M'): {f1_euclidean:.4f}")

Nearest Neighbour with Euclidean Distance (q=2) Results:
Accuracy: 0.8986
Recall (for class 'M'): 0.9730
Precision (for class 'M'): 0.8571
F1 Score (for class 'M'): 0.9114


In [5]:
# Manhattan Distance -- (q=1)
y_pred_manhattan = nearest_neighbour(trainX, trainY, testX, q=1)

accuracy_manhattan = accuracy_score(testY, y_pred_manhattan)
recall_manhattan = recall_score(testY, y_pred_manhattan, pos_label="M")
precision_manhattan = precision_score(testY, y_pred_manhattan, pos_label="M")
f1_manhattan = f1_score(testY, y_pred_manhattan, pos_label="M")

print("\nNearest Neighbour with Manhattan Distance (q=1) Results:")
print(f"Accuracy: {accuracy_manhattan:.4f}")
print(f"Recall (for class 'M'): {recall_manhattan:.4f}")
print(f"Precision (for class 'M'): {precision_manhattan:.4f}")
print(f"F1 Score (for class 'M'): {f1_manhattan:.4f}")


Nearest Neighbour with Manhattan Distance (q=1) Results:
Accuracy: 0.8841
Recall (for class 'M'): 0.9459
Precision (for class 'M'): 0.8537
F1 Score (for class 'M'): 0.8974


Discussion of Results -- Analysis : Insights 

When evaluating the algorithm, two distance metrics were used - Manhattan and Eucerlian. The results are as followed:  

(q = 1)  Manhattan Distance Results:
Accuracy: 0.8841
Recall (for class 'M'): 0.9459
Precision (for class 'M'): 0.8537
F1 Score (for class 'M'): 0.8974

(q = 2) Euclidean Distance Results:
Accuracy: 0.8986
Recall (for class 'M'): 0.9730
Precision (for class 'M'): 0.8571
F1 Score (for class 'M'): 0.9114

The results show that Euclidean distance should be used by default for this application - as it outperforms compared to the Manhattem distance; nonetheless, the slight performance difference suggests that Manhattan distance is still a good/better option.