## Definition:

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. 

## KNN is a lazy learner:

Source: [Link to source](https://sebastianraschka.com/faq/docs/lazy-knn.html#:~:text=K%2DNN%20is%20a%20lazy,(parameters)%20during%20training%20time.&text=A%20lazy%20learner%20does%20not%20have%20a%20training%20phase.)

K-NN is a lazy learner because it <b>doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.</b>

For example, the logistic regression algorithm learns its model weights (parameters) during training time. In contrast, there is no training time in K-NN. Although this may sound very convenient, this property doesn’t come without a cost: The “prediction” step in K-NN is relatively expensive! Each time we want to make a prediction, K-NN is searching for the nearest neighbor(s) in the entire training set! (Note that there are certain tricks such as BallTrees and KDtrees to speed this up a bit.)

To summarize: An eager learner has a model fitting or training step. A lazy learner does not have a training phase.

## Working of KNN:

Source: [Link to Source](https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm)

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps −

1. For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.

2. Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.

3. For each point in the test data do the following -

    3.1 Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.
    
    3.2 Now, based on the distance value, sort them in ascending order.

    3.3 Next, it will choose the top K rows from the sorted array.

    3.4 Now, it will assign a class to the test point based on most frequent class of these rows.
    
<img src="https://machinelearningknowledge.ai/wp-content/uploads/2018/08/KNN-Classification.gif">

## Distance metrics:
<img src="https://miro.medium.com/max/945/1*Lh6R4QArolCRdzF7jjSzDw.jpeg">

Usually, we use the Euclidean approach, which is the most widely used distance measure to calculate the distance between test samples and trained data values. 

## Assumptions of KNN:

Source: https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html

KNN is non parametric. Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you can imagine each training case as a "parameter" in the model.


## Data Cleaning:

1. Standardization: k-NN performs much better if all of the data have the same scale
2. k-NN works well with a small number of input variables, but struggles when the number of inputs is very large.
3. Sensitive to outliers: Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction.

## Cost Function:

Source: https://stats.stackexchange.com/questions/420416/does-knn-have-a-loss-function/420425#420425

k-NN <b>does not</b> have a loss function that can be minimized during training. In fact, this algorithm is not trained at all. The only "training" that happens for k-NN, is memorising the data (creating a local copy), so that during prediction you can do a search and majority vote. Technically, no function is fitted to the data, and so, no optimization is done (it cannot be trained using gradient descent).

## How to choose a K value?

Source: https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb

<img src="https://miro.medium.com/max/547/0*FakkqTKdMPDb3gof.jpg">

- As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A.
- That’s how you can imagine that the K value has a powerful effect on KNN performance.

### Then how to select the optimal K value?

- There are no pre-defined statistical methods to find the most favorable value of K.
- Initialize a random K value and start computing.
- Choosing a small value of K leads to unstable decision boundaries.
- A large K value is better for classification as it leads to smoothening the decision boundaries.
- Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.
          
## Pros vs Cons:

### Pros
1. It is very simple algorithm to understand and interpret.
2. It is very useful for nonlinear data because there is no assumption about data in this algorithm.
3. It is a versatile algorithm as we can use it for classification as well as regression.
4. It has relatively high accuracy but there are much better supervised learning models than KNN.

### Cons
1. It is computationally a bit expensive algorithm because it stores all the training data.
2. High memory storage required as compared to other supervised learning algorithms.
3. Prediction is slow in case of big N.
4. It is very sensitive to the scale of data as well as irrelevant features.