# K Nearest Neighbours

K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning algorithm used primarily for classification and regression tasks. It operates on the principle that similar data points are close to each other in the feature space. 

### **How KNN Works**
1. **Choose the Number of Neighbors (K)**: The first step is to determine the value of K, which represents the number of nearest neighbors to consider for making predictions.
2. **Calculate Distance**: For a given test instance, calculate the distance from the test instance to all training instances using a distance metric. The most common metric is Euclidean distance, $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$.   
3. **Identify Neighbors**: Sort the distances and select the K closest instances from the training dataset.
4. **Voting for Classification**: For classification tasks, the predicted class for the test instance is determined by majority voting among the K neighbors. For regression tasks, it is the average of the K neighbors' target values.


> The value of K is typically chosen through experimentation, often using techniques such as cross-validation. A common strategy is to select an odd value for K to avoid ties in the voting process.

## KNN using the Iris Dataset

- 4 features (numerical) 
    - Petal length
    - Petal width
    - Sepal length
    - Sepal width
- 1 target (categorical)
    - Iris setosa
    - Iris versicolor
    - Iris virginica

In [3]:
# 1. Import the libraries
from sklearn.neighbors import KNeighborsClassifier

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
# 2. Load the dataset
iris = datasets.load_iris()

print(iris.feature_names)
print(iris.target_names)
print(iris.data[:10])
print(iris.target[:10])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 0 0 0 0 0 0 0]


In [9]:
# 3. Separate the data into features and target
X, y = iris.data, iris.target

In [11]:
# 4. Train-Test Split: Separate the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7) # 70% training, 30% testing

In [12]:
# 5. Create the model
classifier = KNeighborsClassifier(n_neighbors=3)

In [13]:
# 6. Train the model
classifier.fit(X_train, y_train)

In [18]:
# Make predictions on the test data
y_predict = classifier.predict(X_test)

for i in range(10):
    print(f"Predicted: {iris.target_names[y_predict[i]]}, Actual: {iris.target_names[y_test[i]]}")

Predicted: virginica, Actual: virginica
Predicted: setosa, Actual: setosa
Predicted: versicolor, Actual: versicolor
Predicted: versicolor, Actual: versicolor
Predicted: setosa, Actual: setosa
Predicted: setosa, Actual: setosa
Predicted: setosa, Actual: setosa
Predicted: virginica, Actual: virginica
Predicted: setosa, Actual: setosa
Predicted: setosa, Actual: setosa


## Classification Metrics

Evaluating the performance of a classification model by comparing its predictions with actual labels

#### 1. Accuracy

The ratio of correctly predicted instances to the total number of instances.

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$
While simple and widely used, accuracy is misleading when dealing with imbalanced datasets where one class is much more frequent than the others.

#### 2. Precision
   
The ratio of correctly predicted positive observations to the total predicted positives.

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} 
$$

Precision is important when the cost of false positives is high, as it shows how many of the positive predictions made by the model are actually correct.

#### 3. Recall (Sensitivity or True Positive Rate)
   
The ratio of correctly predicted positive observations to all actual positives.

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} 
$$

Recall is crucial when the cost of false negatives is high (e.g., in medical diagnoses), as it measures how well the model captures all actual positive cases.

#### 4. F1 Score

The harmonic mean of precision and recall.

$$
\text{F1 Score} = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

F1 score provides a balance between precision and recall, especially useful in cases where an even balance is required. It’s especially useful for imbalanced classes.

#### 5. Specificity (True Negative Rate)

The ratio of correctly predicted negative observations to all actual negatives.

$$
\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} 
$$

Specificity is useful when the cost of false positives is high, focusing on the model’s ability to correctly identify negative cases.

#### 6. ROC Curve and AUC (Area Under Curve)

**ROC Curve** plots the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at different threshold levels.
**AUC (Area Under ROC Curve)** represents the overall ability of the model to distinguish between positive and negative classes.
   
The closer the AUC is to 1, the better the model is at correctly classifying positives and negatives across various thresholds. AUC is particularly useful in comparing multiple models.

#### 7. Confusion Matrix

A matrix displaying the counts of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) classifications.

-|Predicted <span style="color:cyan">P</span>ositive | Predicted <span style="color:orange">N</span>egative
---| --- | ---
**Actual Positive** | <span style="color:green">True <span style="color:cyan">P</span>ositives</span> | <span style="color:red">False <span style="color:orange">N</span>egatives</span>
**Actual Negative** | <span style="color:red">False <span style="color:cyan">P</span>ositives</span> | <span style="color:green">True <span style="color:orange">N</span>egatives</span>

In [20]:
# 7. Evaluate the model
accuracy_score(y_test, y_predict)

0.9555555555555556

In [31]:
# Make predictions 
result = classifier.predict([[3, 5, 4, 2]])
print(iris.target_names[result])

result = classifier.predict([[5, 4, 2, 1]])
print(iris.target_names[result])

result = classifier.predict([[4, 2, 5, 3]])
print(iris.target_names[result])


['versicolor']
['setosa']
['virginica']


KNN is widely used in various applications, including:
- **Recommendation Systems**: To recommend items based on user preferences.
- **Image Classification**: For classifying images based on visual similarity.
- **Anomaly Detection**: To identify outliers in datasets.
- **Pattern Recognition**: Used in handwriting recognition and similar tasks.

| Advantages | Disadvantages |
|------------|---------------|
| **Simplicity**: KNN is easy to understand and implement, making it a good choice for beginners. | **Computational Complexity**: KNN requires distance calculations between the test instance and all training instances, leading to high computational costs, especially with large datasets. |
| **No Training Phase**: KNN does not require a separate training phase, as it simply stores the training data. | **Storage Requirements**: Since KNN stores the entire training dataset, it can consume a significant amount of memory. |
| **Adaptability**: The algorithm can be adapted for both classification and regression tasks. | **Sensitivity to Noise**: KNN can be affected by noisy data and outliers, which may lead to incorrect predictions. |
| **Flexibility**: KNN can handle multi-class classification problems and is also applicable to datasets with arbitrary shapes. | **Choosing K**: The choice of K can significantly affect performance. A small value may lead to overfitting, while a large value may smooth over the class boundaries. |
