# K Nearest Neighbours

K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning algorithm used primarily for classification and regression tasks. It operates on the principle that similar data points are close to each other in the feature space. 

### **How KNN Works**
1. **Choose the Number of Neighbors (K)**: The first step is to determine the value of K, which represents the number of nearest neighbors to consider for making predictions.
2. **Calculate Distance**: For a given test instance, calculate the distance from the test instance to all training instances using a distance metric. The most common metric is Euclidean distance, $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$.   
3. **Identify Neighbors**: Sort the distances and select the K closest instances from the training dataset.
4. **Voting for Classification**: For classification tasks, the predicted class for the test instance is determined by majority voting among the K neighbors. For regression tasks, it is the average of the K neighbors' target values.


> The value of K is typically chosen through experimentation, often using techniques such as cross-validation. A common strategy is to select an odd value for K to avoid ties in the voting process.

## KNN using the Iris Dataset

- 4 features (numerical) 
    - Petal length
    - Petal width
    - Sepal length
    - Sepal width
- 1 target (categorical)
    - Iris setosa
    - Iris versicolor
    - Iris virginica

In [1]:
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
iris = datasets.load_iris()

print(iris.feature_names)
print(iris.target_names)
print(iris.data)
print(iris.target)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3

In [3]:
X, y = iris.data, iris.target

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

In [5]:
classifier = neighbors.KNeighborsClassifier(n_neighbors=3)

In [6]:
classifier.fit(X_train, y_train)

In [7]:
y_predict = classifier.predict(X_test)

print(y_test)
print(y_predict)

[0 0 2 2 1 1 0 0 1 2 1 0 0 0 1 1 0 2 0 1 0 0 1 2 2 2 2 1 2 2 2 2 1 1 2 2 2
 0 0 2 1 0 2 0 1]
[0 0 2 2 1 1 0 0 1 1 1 0 0 0 1 1 0 2 0 1 0 0 1 2 2 1 2 1 2 2 2 2 1 1 2 2 2
 0 0 2 1 0 2 0 2]


In [8]:
accuracy_score(y_test, y_predict)

0.9333333333333333

In [9]:
features = iris.data
print(features[0])

labels = iris.target
print(labels[0])

print(features[1], labels[1])

[5.1 3.5 1.4 0.2]
0
[4.9 3.  1.4 0.2] 0


In [10]:
result = classifier.predict([[3, 5, 4, 2]])
print(result)

[1]


In [11]:
result = classifier.predict([[5, 4, 2, 1]])
print(result)

[0]


In [12]:
result = classifier.predict([[4, 2, 5, 3]])
print(result)

[2]


In [13]:
print(datasets.load_iris().target_names[result])

['virginica']


KNN is widely used in various applications, including:
- **Recommendation Systems**: To recommend items based on user preferences.
- **Image Classification**: For classifying images based on visual similarity.
- **Anomaly Detection**: To identify outliers in datasets.
- **Pattern Recognition**: Used in handwriting recognition and similar tasks.

| Advantages | Disadvantages |
|------------|---------------|
| **Simplicity**: KNN is easy to understand and implement, making it a good choice for beginners. | **Computational Complexity**: KNN requires distance calculations between the test instance and all training instances, leading to high computational costs, especially with large datasets. |
| **No Training Phase**: KNN does not require a separate training phase, as it simply stores the training data. | **Storage Requirements**: Since KNN stores the entire training dataset, it can consume a significant amount of memory. |
| **Adaptability**: The algorithm can be adapted for both classification and regression tasks. | **Sensitivity to Noise**: KNN can be affected by noisy data and outliers, which may lead to incorrect predictions. |
| **Flexibility**: KNN can handle multi-class classification problems and is also applicable to datasets with arbitrary shapes. | **Choosing K**: The choice of K can significantly affect performance. A small value may lead to overfitting, while a large value may smooth over the class boundaries. |
