# K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a type of instance-based learning algorithm that can be used for both classification and regression tasks. The principle behind KNN is quite simple: it assumes that similar things exist in close proximity to each other. In other words, similar data points are near each other in the feature space.

## How KNN Works?

1. **Choose the number of `k` neighbors**: The number `k` in KNN represents the number of nearest neighbors we wish to take a vote from when predicting the label of an unseen data point.

2. **Calculate the Distance**: For each data point in the dataset, calculate the distance between the data point and the input. The distance can be Euclidean, Manhattan, Minkowski, etc.

3. **Find the Nearest Neighbors**: After calculating the distance, sort them in ascending order and choose the top `k` data points.

4. **Make a Decision**:
   - **For Classification**: Take a majority vote from the `k` neighbors. The class that has the highest number of votes will be the predicted class for the input data point.
   - **For Regression**: Take the average of the `k` neighbors' output values to get the predicted output for the input data point.

In the next sections, we'll dive deeper into the practical implementation of KNN using Python and explore its strengths and weaknesses.

# Practical Example with Python

In this section, we'll work with a sample dataset to demonstrate the application of the KNN algorithm for classification. We'll use the famous Iris dataset, which contains measurements for 150 iris flowers from three different species.

The steps we'll follow are:

1. **Data Loading and Visualization**
2. **Data Preprocessing**
3. **Model Training using KNN**
4. **Model Evaluation and Hyperparameter Tuning**

Let's start by loading and visualizing the Iris dataset.

In [None]:
from sklearn.datasets import load_iris
import seaborn as sns

import pandas as pd
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Visualize the first few rows of the dataset
iris_df.head()

In [None]:
# Visualize the distribution of the species based on sepal measurements
sns.pairplot(iris_df, hue='species', height=2.5)
plt.suptitle('Pairplot of Iris Dataset', y=1.02)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting the dataset into training and testing sets
X = iris_df.drop('species', axis=1)
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:5], X_test_scaled[:5]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Training the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

# Predicting on the test set
y_pred = knn.predict(X_test_scaled)

# Evaluating the classifier
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names)

accuracy, classification_rep

## KNN for Classification

For this demonstration, we'll use the famous Iris dataset, which is commonly used in pattern recognition literature. The dataset contains three classes of 50 instances each, where each class refers to a type of iris plant. Each instance has four attributes:

1. Sepal length (cm)
2. Sepal width (cm)
3. Petal length (cm)
4. Petal width (cm)

Our goal is to classify the iris plants into one of the three species based on these attributes.

Let's start by loading and visualizing the Iris dataset.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Visualize the dataset
sns.pairplot(iris_df, hue='species')
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Split the data into training and test sets
X = iris_df.drop('species', axis=1)
y = iris_df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate the classifier
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
confusion, report

## KNN for Regression

KNN isn't just limited to classification tasks. It can also be used for regression to predict a continuous value. The principle remains the same, but instead of voting for the most frequent class, the algorithm averages the values of the `k` nearest neighbors to predict a continuous output.

For this demonstration, we'll use the "Hours Studied vs. Exam Score" dataset that we used earlier for linear regression. We'll train a KNN regressor and compare its performance with the linear regression model.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Train a KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(X_train_regression, y_train_regression)

# Predict on the test set
y_pred_regression = knn_regressor.predict(X_test_regression)

# Evaluate the regressor
mse_knn = mean_squared_error(y_test_regression, y_pred_regression)
mse_knn

In [None]:
plt.figure(figsize=(10, 6))

# Plotting the actual values
plt.scatter(X_test_regression, y_test_regression, color='blue', label='Actual Values')

# Plotting the linear regression predictions
plt.plot(X_test_regression, y_pred_linear, color='red', label='Linear Regression Predictions')

# Plotting the KNN regressor predictions
plt.scatter(X_test_regression, y_pred_regression, color='green', label='KNN Regressor Predictions')

plt.title('Hours Studied vs. Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.legend()
plt.grid(True)
plt.show()

## Advantages and Disadvantages of KNN

Like every algorithm, KNN has its strengths and weaknesses. Let's discuss some of the advantages and disadvantages of the KNN algorithm.

### Advantages of KNN

1. **Simplicity**: KNN is straightforward and easy to understand. The algorithm relies on the basic principle that similar data points are close to each other in the feature space.

2. **No Training Phase**: KNN is a lazy learner, meaning it doesn't have a training phase. All computations are done during the prediction phase, making the prediction process slower but eliminating the need for training.

3. **Adaptability**: KNN makes no assumptions about the underlying data distribution, making it suitable for datasets that don't follow any specific distribution.

4. **Multifunctional**: KNN can be used for both classification and regression tasks.

5. **Robust to Noisy Data**: With a suitable choice of `k`, KNN can be robust to noisy data since it relies on the majority voting or averaging mechanism.

### Disadvantages of KNN

1. **Computationally Intensive**: Since KNN computes distances between data points during the prediction phase, it can be computationally intensive, especially for large datasets.

2. **Sensitive to Irrelevant Features**: KNN relies on distances, so it's sensitive to irrelevant or redundant features. Feature scaling and feature selection become crucial when working with KNN.

3. **Optimal k Determination**: Choosing the right value of `k` is critical. A small value of `k` can make the model sensitive to noise, while a large value can make it computationally expensive.

4. **Storage Requirements**: KNN requires storing the entire dataset, which can be a challenge for large datasets in terms of memory.

5. **Categorical Data**: Handling categorical data can be tricky with KNN since it's challenging to define a distance metric for categorical variables.