#Q1: What is the KNN algorithm?

K-Nearest Neighbors (KNN): It is a simple and effective supervised learning algorithm used for classification and regression tasks. The principle behind KNN is to find the K data points in the training set that are closest to a given test data point and make predictions based on the majority class (for classification) or the average (for regression) of these K neighbors.

In [1]:
#1
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Create KNN classifier with K=3
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn_classifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 1.0


#Q2: How do you choose the value of K in KNN?

Choosing K in KNN:
The value of K has a significant impact on the performance of the KNN algorithm. A smaller K makes the model sensitive to noise, while a larger K may smooth out patterns. It's common to use cross-validation to find the optimal K.

In [2]:
#2
from sklearn.model_selection import cross_val_score

# Define a range of K values to try
k_values = list(range(1, 21))

# Perform 10-fold cross-validation for each K
cv_scores = []
for k in k_values:
    knn_classifier = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn_classifier, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find the optimal K
optimal_k = k_values[cv_scores.index(max(cv_scores))]

print(f"Optimal K: {optimal_k}")

Optimal K: 11


#Q3: What is the difference between KNN classifier and KNN regressor?

Difference between KNN Classifier and Regressor:

KNN Classifier: Used for classification tasks, where the output is a class label.
KNN Regressor: Used for regression tasks, where the output is a continuous value.

In scikit-learn, the only difference is in the class you import:

from sklearn.neighbors import KNeighborsClassifier  # for classification

from sklearn.neighbors import KNeighborsRegressor  # for regression

#Q4 Q4: How do you measure the performance of KNN? 

Measuring KNN Performance:
Common metrics for classification include accuracy, precision, recall, F1 score, and for regression, metrics like Mean Squared Error (MSE) or R-squared can be used.

In [6]:
#4
from sklearn.metrics import classification_report, confusion_matrix

# Load the iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Create KNN classifier with K=3
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn_classifier.predict(X_test)

# Assuming you have a trained KNN classifier (knn_classifier)
y_pred = knn_classifier.predict(X_test)

# Classification report and confusion matrix
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


#Q5: What is the curse of dimensionality in KNN?

Curse of Dimensionality in KNN:
As the number of features (dimensions) increases, the distance between data points increases, making the notion of proximity less meaningful.

In [8]:
#5
# Create a synthetic dataset with varying dimensions
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Generate a synthetic dataset
X_high_dimension, y_high_dimension = make_classification(n_samples=100, n_features=1000, n_informative=10, n_classes=2, random_state=42)

# Split the data into training and testing sets
X_train_high_dimension, X_test_high_dimension, y_train_high_dimension, y_test_high_dimension = train_test_split(
    X_high_dimension, y_high_dimension, test_size=0.2, random_state=42
)

# Use KNN on the high-dimensional dataset
knn_classifier_high_dimension = KNeighborsClassifier(n_neighbors=3)
knn_classifier_high_dimension.fit(X_train_high_dimension, y_train_high_dimension)

#Q6 How do you handle missing values in KNN?

Handling Missing Values in KNN:
Impute missing values before applying KNN. Simple imputation methods include mean, median, or KNN imputation itself.

In [10]:
#6
from sklearn.impute import SimpleImputer

# Assuming X_train has missing values
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

#Q7: Compare and contrast the performance of the KNN classifier and regressor. Which one is better for which type of problem?

Comparison of KNN Classifier and Regressor:

It depends on the problem:

Use KNN Classifier for categorical target variables.

Use KNN Regressor for continuous target variables.

In [18]:
#7

# Example for classification
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Initialize KNN Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_classifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


#Example for Regressor

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Load diabetes Housing dataset
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=42)

# Initialize KNN Regressor
knn_regressor = KNeighborsRegressor(n_neighbors=3)

# Train the model
knn_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_regressor.predict(X_test)

# Evaluate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Accuracy: 1.0
Mean Squared Error: 3364.3932584269664


#Q8: What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks, and how can these be addressed? 

Strengths and Weaknesses of KNN:

Strengths: Simple, easy to understand, and effective for small to medium-sized datasets.

Weaknesses: Sensitive to outliers, computation-intensive for large datasets.

In [19]:
#8
# Example of handling outliers
from sklearn.neighbors import LocalOutlierFactor

outlier_detector = LocalOutlierFactor(n_neighbors=20)
outliers = outlier_detector.fit_predict(X_train)
X_train_no_outliers = X_train[outliers == 1]
y_train_no_outliers = y_train[outliers == 1]

#Q9 What is the difference between Euclidean distance and Manhattan distance in KNN? 

Difference between Euclidean and Manhattan distance:

Euclidean distance: Straight-line distance between two points.

Manhattan distance: Sum of the absolute differences between corresponding coordinates.

In [21]:
#9 
# Example using scipy for distance calculation
from scipy.spatial.distance import euclidean, cityblock

point1 = [1, 2, 3]
point2 = [4, 5, 6]

euclidean_distance = euclidean(point1, point2)
manhattan_distance = cityblock(point1, point2)

print(f"Euclidean Distance: {euclidean_distance}")
print(f"Manhattan Distance: {manhattan_distance}")

Euclidean Distance: 5.196152422706632
Manhattan Distance: 9


#Q10  What is the role of feature scaling in KNN?

Role of Feature Scaling in KNN:
Feature scaling ensures that all features contribute equally to the distance computations.

In [23]:
#10
from sklearn.preprocessing import StandardScaler

# Example using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN on scaled data
knn_classifier_scaled = KNeighborsClassifier(n_neighbors=3)
knn_classifier_scaled.fit(X_train_scaled, y_train)
accuracy_scaled = knn_classifier_scaled.score(X_test_scaled, y_test)
print(f"Accuracy with Scaling: {accuracy_scaled}")

Accuracy with Scaling: 0.0
