## KNN

K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. It operates on the principle that similar instances exist in close proximity to each other.

### KNN Classifier

KNN Algorithm can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we generally look at 3 important aspects:

1. Ease of interpreting output

2. Calculation time

3. Predictive Power

Let us take a few examples to  place KNN in the scale :

<img src="Model-comparison32221.png" width="450">


KNN classifier fairs across all parameters of consideration. It is commonly used for its ease of interpretation and low calculation time.

## How Does the KNN Algorithm Work?
Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS):

<img src="image-20.webp" width="450">

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K” in KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three data points on the plane. Refer to the following diagram for more details:

<img src="scenario2.png" width="450">

The three closest points to BS are all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand the factors to be considered to conclude the best K.

## Key Concepts of KNN:
1. ***Instance-Based Learning:*** KNN is a type of instance-based learning where the model makes predictions based on the entire training dataset. Unlike other algorithms, it doesn’t explicitly learn a model from the training data.

2. ***Lazy Learning:*** KNN is considered a lazy learner because it doesn't build a model until a prediction is requested. It simply stores the training dataset and performs computations at the time of prediction.

3. ***Distance Metric:*** The core idea behind KNN is to find the 'k' nearest neighbors to a given query point. The distance between instances is typically measured using metrics like Euclidean distance, Manhattan distance, or Minkowski distance.

## How KNN Works:

1. Choose the number of 'K' neighbors: Decide the number of neighbors, 'k', to consider for making the prediction. This is a crucial hyperparameter that can significantly impact the model's performance.

2. Compute Distance: Calculate the distance between the query point and all the points in the training dataset.

3. Identify Nearest Neighbors: Select the 'k' points that are closest to the query point.

4. Predict Output:

* ***For classification:*** The class label of the query point is determined by the majority class among the 'k' nearest neighbors.
* ***For regression:*** The predicted value is the average (or weighted average) of the values of the 'k' nearest neighbors.

### Advantages of KNN: 

* Simplicity: KNN is easy to understand and implement.
* No Training Phase: Since it’s a lazy learner, there’s no explicit training phase, making it suitable for scenarios where the training data is frequently updated.
* Versatility: Can be used for both classification and regression problems.

### Disadvantages of KNN:

* Computational Cost: KNN can be computationally expensive, especially with large datasets, because it requires calculating the distance between the query point and all points in the training set.
* Storage Requirements: Since KNN stores all training data, it can require significant storage space.
* Sensitivity to Irrelevant Features: KNN’s performance can degrade if the dataset contains irrelevant or redundant features. Feature scaling and selection are important pre-processing steps.
* Curse of Dimensionality: The algorithm can struggle in high-dimensional spaces, where the distance between points becomes less meaningful.

### Applications of KNN:
* Recommendation Systems: Used in collaborative filtering to recommend products based on user similarity.
* Image Recognition: KNN can classify images based on similarity to other images.
* Medical Diagnosis: Assists in diagnosing diseases by finding patients with similar symptoms and medical history.

----------------- --------------------

## Flow of KNN Algorithm 

<img src="knn-algorithm.png" width="300">

<img src="euclidean.png" width="470">

<img src="Flowchart-of-KNN-Method.ppm" width="363">

<img src="manhattan_distance.jpg" width="450">

<img src="manhattan_euclidean-distance.png" width="450">

---------------------------------------------
### KNN Regressor
---------------------------------------------

Prediction of new data point will be avg. of nearest K data points

### how to select the optimal K value?

* There are no pre-defined statistical methods to find the most favorable value of K.
* Initialize a random K value and start computing.
* Choosing a small value of K leads to unstable decision boundaries.
* The substantial K value is better for classification as it leads to smoothening the decision boundaries.
* Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.
* The small K value isn’t suitable for classification.
* ***The optimal K value usually found is the square root of N, where N is the total number of samples.***
* Use an error plot or accuracy plot to find the most favorable K value.
* KNN performs well with multi-label classes, but you must be aware of the outliers.
* KNN is used broadly in the area of pattern recognition and analytical evaluation.

### Variants of KNN

***KD-Tree KNN***

KD-Tree is a data structure that partitions the space to organize points in a k-dimensional space. This variant of KNN uses a KD-Tree to efficiently query the nearest neighbors, significantly reducing the computational cost for large datasets.

A kd — Tree (k-dimensional tree) is a data structure that recursively subdivides the space into regions associated with specific data points. The primary objective of a kd — Tree is to facilitate efficient multidimensional search operations, particularly nearest-neighbor searches. The algorithm constructs a binary tree in which each node represents a region in the multidimensional space, and the associated hyperplane is aligned with one of the coordinate axes. At each level of the tree, the algorithm selects a dimension to split the data, creating two child nodes. This process continues recursively until a termination condition is met, such as a predefined depth or a threshold number of points per node.

### Reference how KD Tree works
https://www.analyticsvidhya.com/blog/2017/11/information-retrieval-using-kdtree/ 

***Ball Tree KNN***

Similar to KD-Tree, Ball Tree is another data structure for organizing points in a metric space. It is more efficient than KD-Tree for high-dimensional data. Ball Tree KNN uses this structure to quickly find nearest neighbors.