## Introduction

K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification. It relies on the idea that similar data points tend to have similar labels or values. 

During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.

Next, the algorithm identifies the K nearest neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point.

Let see the below example to make it a better understanding

![Screenshot%202024-05-25%20at%2012.41.46%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.41.46%E2%80%AFPM.png)

Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.

![Screenshot%202024-05-25%20at%2012.42.46%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.42.46%E2%80%AFPM.png)

## Types of K-NN

1. K-NN Classifier: Used for classification tasks. The output is a class membership.

2. K-NN Regressor: Used for regression tasks. The output is the average or weighted average of the values of the K nearest neighbors.

## Why Do We Need a K-NN Algorithm?

K-NN is needed for its simplicity and effectiveness in various scenarios:

* It requires no training phase other than storing the training dataset, which makes it computationally inexpensive during the training phase.

* It adapts to the data distribution as it considers the actual data points in making predictions.

* It's particularly useful when you have a limited amount of data and when the decision boundary is very irregular.

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

![Screenshot%202024-05-25%20at%2012.46.58%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.46.58%E2%80%AFPM.png)

## When Do We Use the K-NN Algorithm?

* When the data set is small and contains few features.

* When the decision boundary is nonlinear.

* When interpretability of the decision process is crucial.

* For recommendation systems, anomaly detection, and pattern recognition tasks.

## How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:

![Screenshot%202024-05-25%20at%2012.55.32%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.55.32%E2%80%AFPM.png)

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

![Screenshot%202024-05-25%20at%2012.56.44%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.56.44%E2%80%AFPM.png)

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:

![Screenshot%202024-05-25%20at%2012.57.24%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%2012.57.24%E2%80%AFPM.png)

As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.

## How to Choose a K Value?

Choosing the value of K is critical:

Small K: Can be noisy and lead to overfitting.

Large K: Can smooth out the decision boundary and lead to underfitting.

Kvalue indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm.

![Screenshot%202024-05-25%20at%201.00.35%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%201.00.35%E2%80%AFPM.png)

As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A.

That’s how you can imagine that the K value has a powerful effect on KNN performance.

## How to Select the Optimal K Value?

Cross-Validation: Use cross-validation to evaluate the performance of different K values on a validation set.

Error Analysis: Plot the error rate for different K values and choose the K value with the lowest error rate.

* There are no pre-defined statistical methods to find the most favorable value of K.
* Initialize a random K value and start computing.
* Choosing a small value of K leads to unstable decision boundaries.
* The substantial K value is better for classification as it leads to smoothening the decision boundaries.
* Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.

Now you will get the idea of choosing the optimal K value by implementing the model.

## Calculating distance:

The first step is to calculate the distance between the new point and each training point. There are various methods for calculating this distance, of which the most commonly known methods are — Euclidian, Manhattan (for continuous) and Hamming distance (for categorical).

1. Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).

![Screenshot%202024-05-25%20at%201.05.20%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%201.05.20%E2%80%AFPM.png)

2. Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference.

![Screenshot%202024-05-25%20at%201.06.07%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%201.06.07%E2%80%AFPM.png)

3. Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0 . Otherwise D=1.

![Screenshot%202024-05-25%20at%201.06.49%E2%80%AFPM.png](attachment:Screenshot%202024-05-25%20at%201.06.49%E2%80%AFPM.png)

## Ways to Perform K-NN 

KNeighborsClassifier(n_neighbors=5, *, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

Brute Force: Calculate the distance between the test sample and all training samples, sort them, and select the top K.

Efficient Search Methods: Use data structures like KD-Trees or Ball Trees to reduce the number of distance calculations and speed up the search for nearest neighbors.

https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4