<a href="https://colab.research.google.com/github/akshpesa/FMML/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)
dataset = datasets.fetch_california_housing()
dataset.target = dataset.target.astype(int)

In [5]:
def NN1(traindata, trainlabel, query):
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label

def NN3(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argsort(dist)[:3]]
    final_label = np.bincount(label).argmax()
    return final_label

def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

def nn(traindata, trainlabel, testdata):
    predlabel = np.array([NN3(traindata, trainlabel, i) for i in testdata])
    return predlabel

In [6]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

In [7]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

In [10]:
#training the data based using 1 nearest neighbour and finding its accuracy
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm :", trainAccuracy*100, "%")

#training the data based on 3 nearest neighbours and finding its accuracy
trainpred1=nn(traindata, trainlabel, traindata)
trainAccuracy1 = Accuracy(trainlabel, trainpred1)
print("Training accuracy using 3 nearest neighbour algorithm :", trainAccuracy1*100, "%")

Training accuracy using nearest neighbour algorithm : 100.0 %
Training accuracy using 3 nearest neighbour algorithm : 61.35996119016818 %


In [11]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy of nearest neighbour :", testAccuracy*100, "%")

testpred1 = nn(alltraindata, alltrainlabel, testdata)
testAccuracy1 = Accuracy(testlabel, testpred1)

print("Test accuracy of 3 nearest neighbours :", testAccuracy1*100, "%")

Test accuracy of nearest neighbour : 34.91795366795367 %
Test accuracy of 3 nearest neighbours : 36.05212355212355 %


Exercise:


Q1)How does the accuracy of the 3 nearest neighbour classifier change with the number of splits?

A1)The accuracy increases

Q2)How is it affected by the split size?

A2)The accuracy increases with a larger split size


Q3)Compare the results with the 1 nearest neighbour classifier.

A3)The impact is less profound on the 1NN while it is more pronounced on a 3NN with increase in split size and no of splits

>Questions

Does averaging the validation accuracy across multiple splits give more consistent results?

Yes, it gives more consistent results


Does it give more accurate estimate of test accuracy?

Yes, it does with more number of splits and using cross validation.


What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?

Yes, but at a higher computing cost


Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?

No, as the dataset size needs to be large enough to get different varieties of splits for training and testing the data to get more accuracy.

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.
