# KNN Classifier And Regressor Classification

## Introduction to K-Nearest Neighbor Classifier and Regressor

In this lecture, we will discuss the implementation of the k-nearest neighbor classifier and regressor. We will explore the important parameters related to these models. Using a simple dataset, we will solve a binary classification problem. These concepts will be revisited in end-to-end projects where hyperparameter tuning is also necessary.

--- 
# Part 1: K-Nearest Neighbor (KNN) Classifier

### Step 1: Import Libraries
We begin by importing the necessary libraries for classification.

In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 2: Create the Dataset
Using `make_classification`, we create a dataset with two classes. The dataset contains 1000 samples, each with three features. A random state is set to ensure reproducibility.

In [2]:
X, y = make_classification(
    n_samples=1000, 
    n_features=3, 
    n_informative=3, 
    n_redundant=0, 
    n_classes=2, 
    random_state=42
)

### Step 3: Examine the Dataset
Examining the feature matrix `X`, we observe three features per data point.

In [3]:
print("Feature matrix shape:", X.shape)

Feature matrix shape: (1000, 3)


In [4]:
print("First 5 data points:\n", X[:5])

First 5 data points:
 [[ 0.52170639  0.72685697  2.50410248]
 [-1.01971039 -0.74986039 -0.71346315]
 [-1.89222477 -1.03217141  1.54176061]
 [ 0.62580061 -0.86628346 -0.35092274]
 [ 1.03461232  1.64138782  3.18254045]]


In [5]:
print("Target vector shape:", y.shape)

Target vector shape: (1000,)


### Step 4: Perform Train-Test Split
Next, we perform a train-test split on the dataset.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Understanding KNN Parameters

The `KNeighborsClassifier` has several important parameters. The default number of neighbors, denoted as `n_neighbors`, is five. Other important parameters include `weights`, `algorithm`, and `p`.

### Understanding the `p` Parameter
The `p` parameter is an integer with a default value of two. This value determines the distance metric used:
* When `p=2`, the **Euclidean distance** is used.
* When `p=1`, the **Manhattan distance** is used.

Selecting the appropriate `p` value depends on the dataset and requires hyperparameter tuning.

### Algorithms for Neighbor Search
The `algorithm` parameter can take values such as `ball_tree`, `kd_tree`, or `brute`:
* **Ball tree** and **KD tree** construct binary trees to minimize the number of distance calculations.
* This reduces time complexity and improves efficiency.
* The `auto` option selects the best algorithm based on the dataset during the `fit` method.

### `weights` Parameter
The `weights` parameter can be set to `uniform` or `distance`. This affects how neighbor contributions are weighted during classification or regression. 
* `uniform`: All neighbors are weighted equally.
* `distance`: Closer neighbors have a greater influence.

## Implementing K-Nearest Neighbor Classifier

### Step 5: Initialize the Classifier
We instantiate `KNeighborsClassifier` with `n_neighbors=5`, `algorithm='auto'`, and the default `p=2` for Euclidean distance.

In [7]:
classifier = KNeighborsClassifier(n_neighbors=5, algorithm='auto', p=2)

### Step 6: Fit the Classifier
We then fit the classifier on the training data.

In [8]:
classifier.fit(X_train, y_train)

### Step 7: Perform Predictions
Predictions are made on the test set.

In [9]:
y_pred = classifier.predict(X_test)

### Step 8: Evaluate the Model (Confusion Matrix)
Evaluation metrics are used to assess performance. First, the confusion matrix.

In [10]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[152   8]
 [  5 135]]


### Step 9: Evaluate the Model (Accuracy Score)
Next, we check the accuracy. An accuracy of approximately 90.6% is achieved.

In [11]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score:", accuracy)

Accuracy Score: 0.9566666666666667


### Step 10: Evaluate the Model (Classification Report)
Finally, we view the detailed classification report.

In [12]:
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.95      0.96       160
           1       0.94      0.96      0.95       140

    accuracy                           0.96       300
   macro avg       0.96      0.96      0.96       300
weighted avg       0.96      0.96      0.96       300



## Task: Hyperparameter Tuning with GridSearchCV
You are encouraged to perform hyperparameter tuning using `GridSearchCV` to find the optimal value of `k` (i.e., `n_neighbors`). For example, changing `k` from 5 to 6 may affect accuracy. Let's test that specific case.

In [13]:
# Example: Testing k=6
classifier_k6 = KNeighborsClassifier(n_neighbors=6)
classifier_k6.fit(X_train, y_train)
y_pred_k6 = classifier_k6.predict(X_test)
accuracy_k6 = accuracy_score(y_test, y_pred_k6)
print(f"Accuracy with k=6: {accuracy_k6:.4f}")

Accuracy with k=6: 0.9600


--- 
# Part 2: K-Nearest Neighbor (KNN) Regressor

### Step 1: Import Additional Libraries
Now we import the libraries needed for regression.

In [14]:
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

### Step 2: Create Regression Dataset
Similarly, for regression tasks, we create a dataset with two features and 1000 data points.

In [15]:
X_reg, y_reg = make_regression(
    n_samples=1000, 
    n_features=2, 
    noise=10, 
    random_state=42
)

### Step 3: Train-Test Split for Regression
After creating the data, we perform a train-test split.

In [16]:
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

### Step 4: Initialize the Regressor
We instantiate the `KNeighborsRegressor` with `n_neighbors=6` and `algorithm='auto'`.

In [17]:
regressor = KNeighborsRegressor(n_neighbors=6, algorithm='auto')

### Step 5: Fit the Regressor

In [18]:
regressor.fit(X_reg_train, y_reg_train)

### Step 6: Make Predictions

In [19]:
y_reg_pred = regressor.predict(X_reg_test)

### Step 7: Evaluate the Regressor (R-squared)
Predictions are made on the test set, and evaluation metrics are calculated. An R-squared value of approximately 91% is achieved.

In [20]:
r2 = r2_score(y_reg_test, y_reg_pred)
print(f"R-squared (R2) Score: {r2:.4f}")

R-squared (R2) Score: 0.9194


### Step 8: Evaluate the Regressor (MAE)
We also check the Mean Absolute Error.

In [21]:
mae = mean_absolute_error(y_reg_test, y_reg_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

Mean Absolute Error (MAE): 9.1528


### Step 9: Evaluate the Regressor (MSE)
And the Mean Squared Error.

In [22]:
mse = mean_squared_error(y_reg_test, y_reg_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

Mean Squared Error (MSE): 129.8552


--- 
## Summary

In this lecture, we covered the fundamentals of k-nearest neighbor classifiers and regressors. Understanding parameters such as `n_neighbors`, `weights`, `algorithm`, and `p` is crucial for effective model implementation. Hyperparameter tuning plays a vital role in optimizing model performance. These concepts will be applied in end-to-end projects involving various classification and regression algorithms to identify the best-performing models.

## Key Takeaways

* The k-nearest neighbor (KNN) algorithm uses parameters such as number of neighbors, weights, algorithm type, and the p value to influence classification and regression outcomes.
* The `p` value determines the distance metric: `p=2` corresponds to Euclidean distance, and `p=1` corresponds to Manhattan distance.
* Algorithms like ball tree, KD tree, and brute force optimize neighbor searches by reducing time complexity.
* Hyperparameter tuning, such as using `GridSearchCV` to find the best `k` value, is essential for improving model accuracy.