## What is k-NN Algorithm In Machine Learning?

k-NN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.

>Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.

## How to calculate K Nearest Neighbor (KNN)?

Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown below:

![image-6.png](attachment:image-6.png)

### Step 1 : Calculate Similarity based on distance function

There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. Manhattan distance is also very common for continuous variables.

![image-8.png](attachment:image-8.png)

The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight. New customer named 'Monica' has height 161cm and weight 61kg. Euclidean distance between first observation and new observation (monica) is as follows -
=SQRT((161-158)^2+(61-58)^2)
Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.

### Step 2 : Find K-Nearest Neighbors

Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had 'Large T shirt size' then your best guess for Monica is ‘Medium T shirt. See the calculation shown in the snapshot below:

![image-9.png](attachment:image-9.png)

In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are close to yellow circle. so the prediction for the new case is blue highlighted data point which is Medium T-shirt size.

![image-10.png](attachment:image-10.png)

## Assumptions of k-NN Algorithm

### 1. Standardization

When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance. For example, if one variable is based on height in cms, and the other is based on weight in kgs then height will influence more on the distance calculation. In order to make them comparable we need to standardize them which can be done by any of the following methods :

![image-11.png](attachment:image-11.png)

After standardization, 5th closest value got changed as height was dominating earlier before standardization. Hence, it is important to standardize predictors before running K-nearest neighbor algorithm.

![image-12.png](attachment:image-12.png)

### 2. Outlier

Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction.

## Why KNN is non-parametric?

Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you can imagine each training case as a "parameter" in the model.

## Difference between k-NN and k-Means Algorithms

**Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor. See the difference between them below:**

- k-Means is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning algorithm (dependent variable exists)
- k-Means is a clustering technique which tries to split data points into K-clusters such that the points in each cluster tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the classification of the K nearest points

## Can KNN be used for regression?

Yes, k-Nearest Neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied  when dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.

## Pros and Cons of KNN

### 1. Pros

- Easy to understand
- No assumptions about data
- Can be applied to both classification and regression
- Works easily on multi-class problems

### 2. Cons

- Memory Intensive / Computationally expensive
- Sensitive to scale of data
- Not work well on rare event (skewed) target variable
- Struggle when high number of independent variables