<h1 align="center">Nearest Neighbors Methods - The KNN Algorithm</h1>

### KNN Algorithm
- Basic Idea
- Formal Definition
- KNN Decision Boundary
- A supervised, non-parametric algorithm
- Used for classification and regression
- An Instance-based learning algorithm
- A lazy learning algorithm
- Characteristics of kNN
- Practical issues

### Similarity/Distance Metrics 
- Constraints/Properties on Distance Metrics
- Euclidean Distance
- Manhatten Distance
- Minkowski distance
- Chebyshev Distance
- Norm of a vector and Its Properties
- Cosine Distance 
- Practical Issues in computing distance

### The KNN algorithm and Implementation
- KNN regression and classification with examples
- Space and Time complexity
- Choosing the value of K - The Theory.
- Tuning the hyperparameter K - The method
- KNN: The good, the bad and the ugly
    
### Algorithm Convergence
- Error Convergence
- Learning Problem
- Bayes Optimal Classifier
- 1-NN Error as n → ∞


### KNN Enhancements
- Parzen Windows and Kernels (Fast KNN) 
- Performance of KNN Algorithm
- K-D Trees
- Locality-sensitive Hashing
- Inverted Lists

### The Curse of Dimensionality
- KNN Assumption
- Demonstration
- How does KNN work at all?

### Dimensionality Reduction(Optional)
- Why? and Benefits.
- Difference between Feature Selection and Feature Extraction
- Feature Selection methods
- Feature Extraction
- Principal Component Analysis
    - Geometric Intuition
    - Mathematical Formulation
    - How do we choose K?
    - Practical Consideration and Limitations

### Model Evaluation Techniques
- Classification Accuracy (0/1 Loss)
- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging

**KNN: Python Implementation**    
**KNN: Scikit-learn implementation**    
**Interview Questions.**   

# KNN Algorithm

### Basic Idea

- K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- It assumes the similarity between the new data points and available data points and put the new data points into the category that is most similar to the available categories.Where'k' in KNN is a parameter that refers to the number of nearest neighbours to include in the majority of the voting process.
- It stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
- **Example:** Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.

<img src="images/knn14.PNG" height=600px width=600px align='center'>

### Formal Definition

"The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning algorithm, which uses proximity to make classifications or predictions about the grouping of an individual data point." **(IBM)**

**Formal (and borderline incomprehensible) definition of k-NN:**

Test point: x

Define the set of the k nearest neighbors of x as Sx. Formally Sx is defined as **Sx ⊆ D s.t. |Sx|=k and ∀(x′,y′) ∈ D∖Sx,**

**dist(x,x′) ≥ max(x′′,y′′) ∈ Sx dist(x,x′′),**
 **(x′′,y′′) ∈ Sx**

(i.e. every point in D but not in Sx is at least as far away from x as the furthest point in Sx). We can then define the classifier h() as a function returning the most common label in Sx:

**h(x) = mode({y′′:(x′′,y′′) ∈ Sx}),**

where mode(⋅) means to select the label of the highest occurrence.
(Hint: In case of a draw, a good solution is to return the result of k-NN with smaller k)

<img src="images/knn1.PNG" height=600px width=600px align='center'>

### How does basic K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

**Step-1:** Select the number K of the neighbors

**Step-2:** Calculate the Euclidean distance of K number of neighbors.

**Step-3:** Take the K nearest neighbors as per the calculated Euclidean distance.

**Step-4:** Among these k neighbors, count the number of the data points in each category.

**Step-5:** Assign the new data points to that category for which the number of the neighbor is maximum.

**Step-6:** Our model is ready.

### KNN Decision Boundary

A decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.

- We can draw decision boundary for all classification algorithms.
- Decision boundary can be linear(if the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable as in SVM) or non linear(in case of Decision Tree). 

**Value of k**
- When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind” to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged. 
- On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.

#### k=1
<img src="images/knn15.PNG" height=400px width=400px align='center'>

#### k=15

<img src="images/knn16.PNG" height=350px width=350px align='center'>

### Supervised and Non-parametric Algorithm

Let’s first start by establishing some definitions and notations. We will use **x** to denote a feature (aka. predictor, attribute) and **y** to denote the target (aka. label, class) we are trying to predict.

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations **(x,y)** and would like to capture the relationship between x and y. More formally, our goal is to learn a function **h:X→Y** so that given an unseen observation x, **h(x)** can confidently predict the corresponding output y.

K-NN is a **non-parametric** algorithm, which means it does not make any assumption on underlying data. It means it makes no explicit assumptions about the functional form of h, avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.


### Classification and Regression

K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

Regression problems use a similar concept as classification problem, but in this case, the average the k nearest neighbors is taken to make a prediction about a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones.

**Example:** 

- **Classification:** KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have the characteristics similar to the defaulters one?

- **Regression:** KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.

# Distance Metrics

Distance metrics are a key part of several machine learning algorithms. These distance metrics are used in both supervised and unsupervised learning, generally to calculate the similarity between data points.
An effective distance metric improves the performance of our machine learning model, whether that’s for classification tasks or clustering.

For the algorithm to work best on a particular dataset we need to choose the most appropriate distance metric accordingly. There are a lot of different distance metrics available, but we are only going to talk about a few widely used ones. Euclidean distance function is the most popular one among all of them as it is set default in the SKlearn KNN classifier library in python.

Let’s start with the most commonly used distance metric – Euclidean Distance.

### Euclidean Distance

Euclidean Distance represents the shortest distance between two points.

Most machine learning algorithms including K-Means use this distance metric to measure the similarity between observations. Let’s say we have two points as shown below:

<img src="images/knn4.PNG" height=500px width=500px align='center'>

So, the Euclidean Distance between these two points A and B will be:

<img src="images/knn5.PNG" height=500px width=500px align='center'>

Here’s the formula for Euclidean Distance:

<img src="images/knn7.PNG" height=300px width=300px align='center'>

We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-dimensional space as:

<img src="images/knn8.PNG" height=300px width=300px align='center'>

Where,
- n = number of dimensions
- pi, qi = data points

### Example

Consider two attributes of a chemical, one is acid durability and other is its strenght. On the basis of these two features we classify chemical as good or bad.

In [4]:
import pandas as pd
import numpy as np

In [31]:
data={'Acid Durability':[7,7,3,1],'Strenght':[7,4,4,4], 'Output':['Bad','Bad','Good','Good']}
df=pd.DataFrame(data)
df

Unnamed: 0,Acid Durability,Strenght,Output
0,7,7,Bad
1,7,4,Bad
2,3,4,Good
3,1,4,Good


In [32]:
x1 = df['Acid Durability'].values
x1

array([7, 7, 3, 1], dtype=int64)

In [33]:
x2 = df['Strenght'].values
x2

array([7, 4, 4, 4], dtype=int64)

Lets set k=3 and take new point (3,7) and checks its output using Euclidean Distance formula

In [34]:
def computeED1(x1,x2):
    return np.sqrt((x1-3)**2+(x2-7)**2)

In [35]:
result = computeED1(x1,x2)
result

array([4.        , 5.        , 3.        , 3.60555128])

In [36]:
result.sort()
result

array([3.        , 3.60555128, 4.        , 5.        ])

In [37]:
k=3
print(k, "smallest 3 closest point")
print(result[:k])

3 smallest 3 closest point
[3.         3.60555128 4.        ]


In [39]:
df["Euclidean Distance"] = computeED1(x1,x2)
df

Unnamed: 0,Acid Durability,Strenght,Output,Euclidean Distance
0,7,7,Bad,4.0
1,7,4,Bad,5.0
2,3,4,Good,3.0
3,1,4,Good,3.605551


In [40]:
df["Rank ED"] = [3,4,1,2]
df

Unnamed: 0,Acid Durability,Strenght,Output,Euclidean Distance,Rank ED
0,7,7,Bad,4.0,3
1,7,4,Bad,5.0,4
2,3,4,Good,3.0,1
3,1,4,Good,3.605551,2


So, two smallest distance belong to "Good" that is why new point (3,7) is a Good chemical

### Manhatten Distance

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

                                        OR
The distance between two points is the sum of the absolute differences of their Cartesian coordinates.


This distance is also known as taxicab distance or city block distance, that is because the way this distance is calculated. This distance is preferred over Euclidean distance when we have a case of high dimensionality.

Again consider the above points. We can represent Manhattan Distance as:

<img src="images/knn6.PNG" height=500px width=500px align='center'>

Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take the sum of absolute distances in both the x and y directions. So, the Manhattan distance in a 2-dimensional space is given as:

<img src="images/knn9.PNG" height=300px width=300px align='center'>

And the generalized formula for an n-dimensional space is given as:

<img src="images/knn10.PNG" height=300px width=300px align='center'>

Where,
- n = number of dimensions
- pi, qi = data points

In [29]:
def computeMD(x1,x2):
    return (x1-3)**2+(x2-7)**2

In [41]:
df["Manhatten Distance"] = computeMD(x1,x2)
df

Unnamed: 0,Acid Durability,Strenght,Output,Euclidean Distance,Rank ED,Manhatten Distance
0,7,7,Bad,4.0,3,16
1,7,4,Bad,5.0,4,25
2,3,4,Good,3.0,1,9
3,1,4,Good,3.605551,2,13


In [42]:
df["Rank MD"] = [3,4,1,2]
df

Unnamed: 0,Acid Durability,Strenght,Output,Euclidean Distance,Rank ED,Manhatten Distance,Rank MD
0,7,7,Bad,4.0,3,16,3
1,7,4,Bad,5.0,4,25,4
2,3,4,Good,3.0,1,9,1
3,1,4,Good,3.605551,2,13,2


### Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

It is a metric intended for real-valued vector spaces. We can calculate Minkowski distance only in a normed vector space, which means in a space where distances can be represented as a vector that has a length and the lengths cannot be negative.

The formula for Minkowski Distance is given as:

<img src="images/knn11.PNG" height=300px width=300px align='center'>

This above formula for Minkowski distance is in generalized form and we can manipulate it to get different distance metrices.

The p value in the formula can be manipulated to give us different distances like:

- p = 1, when p is set to 1 we get Manhattan distance
- p = 2, when p is set to 2 we get Euclidean distance

## Algorithm Convergence

### Error Convergence


When number of data points in th training data increases, then error rate reaches to some threshold value. 
The error of 1-NN Classifier converges when number of points in data increases

### Learning Problem


For this purpose we denote entire data set as:

**D = {(x1,y1),(x2,y2),...(xn,yn) ⊆ X^d x Y}** 

We want to predict the label for input for which the label is unkown

We assume data points (xi,yi) are drawn from an unkown distribution. 

### Bayes Optimal Classifier

Best prediction: 

**y∗ = hopt = argmaxyP(y|x)**

Error of the BayesOpt classifier

**ϵBayesOpt = 1−P(hopt(x)|y) = 1−P(y∗|x)**

You can never do better than the Bayes Optimal Classifier.

### 1-NN Error as n → ∞

As n→∞, the 1-NN error is no more than twice the error of the Bayes Optimal classifier.

Let xNN be the nearest neighbor of our test point xt. As **n→∞**, **dist(xNN,x)→0**, i.e. **xNN→xt**. (This means the nearest neighbor is identical to xt.) You return the label of xNN. What is the probability that this is not the label of x? (This is the probability of drawing two different label of x)

**ϵNN = P(y∗|xt)(1−P(y∗|xNN)) + P(y∗|xNN)(1−P(y∗|xt)) ≤ (1−P(y∗|xNN) + (1−P(y∗|xt) = 2(1−P(y∗|xt) = 2ϵBayesOpt,**

where the inequality follows from P(y∗|x+)≤1 and P(y∗|xNN)≤1.

## KNN Enhancements

### Parzen Windows and Kernels (Fast KNN)


Instead of fix the number of neighbours, Parzen Windows fix the size of area or a region with fixed size of radius 

<img src="images/knn17.PNG" height=600px width=600px align='center'>

### Performance of KNN Algorithm

- No assumption about training data
- Non parametric approach
- Need to handle missing values
- Sensitive to outliers
- Computationally Expensive O(nd) n= no.of samples, d= dimensions
- Slow at testing time

### K-D Trees
We can make knn fast by reducing **n** and **d**. K-D Trees allows you to find m potential neighbours i.e **m << n**, if we have low dimensional and real value data.

**O(d log2 n)**,
inexact technique as you miss neighbours
only works when **d << n**

#### Steps:
- Pick random dimension
- Find median
- Split data
- Repeat

<img src="images/knn19.PNG" height=600px width=600px align='center'>

<img src="images/knn18.PNG" height=600px width=600px align='center'>

### Inverted List
If your data is high dimensional and sparse

**O(n'd')**

exact technique as you dont miss neighbours.

It is a data structures used by search engines

### Locally Sensitive Hashing
If your data is high dimensional and real value

**O(n'd')**

inexact technique as you miss neighbours only works when **n' << n**

#### Steps:
- Draw k random hyperplanes
- Space sliced into 2^k regions, each region has linear slides and are mutually exclusive
- When apoint x comes, compare it only to training points in that region

<img src="images/knn20.PNG" height=500px width=500px align='center'>

In [1]:
from sklearn import datasets

In [7]:
datasets.make_classification(n_samples=50,n_features=4, n_classes=2)

(array([[ 2.46860857e+00, -1.83256468e+00,  2.92913502e+00,
         -1.58752889e+00],
        [ 1.49560407e+00, -5.03016146e-01,  3.34799662e-01,
          1.41306917e+00],
        [-5.87512817e-01, -6.99464894e-03,  3.53586145e-01,
         -1.35523642e+00],
        [ 1.62804920e+00, -1.56355157e+00,  2.77343570e+00,
         -2.43525035e+00],
        [-1.07637151e+00,  8.97026797e-01, -1.50950251e+00,
          1.07541303e+00],
        [-1.07301040e+00,  3.55946035e-01, -2.28488201e-01,
         -1.03311348e+00],
        [ 1.86421051e+00, -5.72796908e-01,  2.88819837e-01,
          1.97327676e+00],
        [-7.97059544e-01,  6.51590924e-01, -1.08777124e+00,
          7.46826653e-01],
        [-3.43040211e-01,  1.28589509e-01, -1.08124996e-01,
         -2.72427209e-01],
        [ 2.91591039e+00, -2.20158151e+00,  3.54752570e+00,
         -2.01974420e+00],
        [-1.41274837e+00,  7.75902302e-01, -1.02935939e+00,
         -1.58565165e-01],
        [ 9.84334951e-01, -3.66231453e-01, 

In [1]:
import math

def euclidean_distance(x1, y1, x2, y2):
    return math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)

def manhattan_distance(x1, y1, x2, y2):
    return abs(x1 - x2) + abs(y1 - y2)

def cosine_similarity(x1, y1, x2, y2):
    dot_product = x1 * x2 + y1 * y2
    norm1 = math.sqrt(x1 ** 2 + y1 ** 2)
    norm2 = math.sqrt(x2 ** 2 + y2 ** 2)
    return dot_product / (norm1 * norm2)

# calculate Euclidean distance
distance = euclidean_distance(1, 2, 3, 4)
print("Euclidean distance:", distance)

# calculate Manhattan distance
distance = manhattan_distance(1, 2, 3, 4)
print("Manhattan distance:", distance)

# calculate cosine similarity
similarity = cosine_similarity(1, 2, 3, 4)
print("Cosine similarity:", similarity)


Euclidean distance: 2.8284271247461903
Manhattan distance: 4
Cosine similarity: 0.9838699100999074
