# Assignment 8: K-Nearest Neighbours (KNN) Classification

## Dataset: Zoo Animal Classification

**Objective:** Classify animals into types based on their features.

**Topics Covered:**
- KNN Algorithm
- Distance Metrics
- Choosing K value
- Model Evaluation

---
## Step 1: Import Libraries and Load Data

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Load the dataset
df = pd.read_csv('Zoo.csv')

print("Dataset loaded successfully!")
print("Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

---
## Step 2: Data Analysis and Visualization

In [None]:
# Check data info
print("=== Dataset Info ===")
print(df.dtypes)
print("\n=== Missing Values ===")
print(df.isnull().sum())

In [None]:
# Target variable distribution (animal type)
print("=== Animal Type Distribution ===")
type_counts = df['type'].value_counts().sort_index()
print(type_counts)

# Animal types: 1=Mammal, 2=Bird, 3=Reptile, 4=Fish, 5=Amphibian, 6=Bug, 7=Invertebrate
type_names = ['Mammal', 'Bird', 'Reptile', 'Fish', 'Amphibian', 'Bug', 'Invertebrate']

plt.figure(figsize=(10, 6))
plt.bar(type_names, type_counts.values, color='steelblue', edgecolor='black')
plt.xlabel('Animal Type')
plt.ylabel('Count')
plt.title('Distribution of Animal Types')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('type_distribution.png')
plt.show()

In [None]:
# Feature analysis - check binary features
feature_cols = ['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 
                'predator', 'toothed', 'backbone', 'breathes', 'venomous', 
                'fins', 'legs', 'tail', 'domestic', 'catsize']

print("=== Feature Summary ===")
for col in feature_cols:
    unique_vals = df[col].unique()
    print(col + ":", sorted(unique_vals))

In [None]:
# Correlation heatmap
plt.figure(figsize=(14, 10))
correlation = df[feature_cols + ['type']].corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_matrix.png')
plt.show()

In [None]:
# Check for outliers in 'legs' column (only non-binary feature)
print("=== Legs Distribution ===")
print(df['legs'].value_counts().sort_index())

plt.figure(figsize=(8, 5))
plt.boxplot(df['legs'])
plt.title('Boxplot of Legs')
plt.ylabel('Number of Legs')
plt.savefig('legs_boxplot.png')
plt.show()

---
## Step 3: Data Preprocessing

In [None]:
# Prepare features and target
print("=== Preparing Data ===")

# Features (X) - all columns except 'animal name' and 'type'
X = df[feature_cols]

# Target (y)
y = df['type']

print("Features shape:", X.shape)
print("Target shape:", y.shape)

In [None]:
# Scale features (important for KNN since it uses distance)
print("=== Feature Scaling ===")

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Before scaling:")
print("  Mean:", round(X['legs'].mean(), 2))
print("  Std:", round(X['legs'].std(), 2))

print("\nAfter scaling:")
print("  Mean:", round(X_scaled[:, feature_cols.index('legs')].mean(), 2))
print("  Std:", round(X_scaled[:, feature_cols.index('legs')].std(), 2))

In [None]:
# Split data into training and testing sets (80-20)
print("=== Train-Test Split ===")

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

---
## Step 4: KNN Implementation and K Selection

In [None]:
# Find optimal K value
print("=== Finding Optimal K Value ===")

k_values = range(1, 21)
accuracies = []

for k in k_values:
    # Create KNN model
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Train
    knn.fit(X_train, y_train)
    
    # Predict and evaluate
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    
    print("K =", k, "| Accuracy:", round(acc, 4))

# Find best K
best_k = k_values[accuracies.index(max(accuracies))]
print("\nBest K:", best_k, "with accuracy:", round(max(accuracies), 4))

In [None]:
# Plot K vs Accuracy
plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o', color='blue')
plt.axvline(x=best_k, color='red', linestyle='--', label='Best K = ' + str(best_k))
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('K Value vs Accuracy')
plt.legend()
plt.grid(True)
plt.savefig('k_vs_accuracy.png')
plt.show()

In [None]:
# Compare different distance metrics
print("=== Comparing Distance Metrics ===")

metrics = ['euclidean', 'manhattan', 'minkowski']

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=best_k, metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(metric.capitalize() + ":", round(acc, 4))

---
## Step 5: Train Final Model and Evaluate

In [None]:
# Train final model with best K
print("=== Training Final Model ===")

final_knn = KNeighborsClassifier(n_neighbors=best_k, metric='euclidean')
final_knn.fit(X_train, y_train)

# Make predictions
y_pred = final_knn.predict(X_test)

print("Model trained with K =", best_k)

In [None]:
# Evaluate model
print("=== Model Evaluation ===")

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", round(accuracy, 4))
print("Precision:", round(precision, 4))
print("Recall:", round(recall, 4))
print("F1-Score:", round(f1, 4))

In [None]:
# Confusion Matrix
print("=== Confusion Matrix ===")

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=type_names, yticklabels=type_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.savefig('confusion_matrix.png')
plt.show()

In [None]:
# Classification Report
print("=== Classification Report ===")
print(classification_report(y_test, y_pred, target_names=type_names))

---
## Step 6: Decision Boundary Visualization

In [None]:
# For visualization, use only 2 features (simplification)
print("=== Decision Boundary (2D Visualization) ===")

# Select 2 most important features for visualization
X_2d = X[['milk', 'feathers']].values

# Scale
scaler_2d = StandardScaler()
X_2d_scaled = scaler_2d.fit_transform(X_2d)

# Split
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(
    X_2d_scaled, y, test_size=0.2, random_state=42
)

# Train KNN with 2D data
knn_2d = KNeighborsClassifier(n_neighbors=best_k)
knn_2d.fit(X_train_2d, y_train_2d)

In [None]:
# Create meshgrid for decision boundary
x_min = X_2d_scaled[:, 0].min() - 1
x_max = X_2d_scaled[:, 0].max() + 1
y_min = X_2d_scaled[:, 1].min() - 1
y_max = X_2d_scaled[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Predict on meshgrid
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(12, 8))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='viridis')
plt.scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], c=y, cmap='viridis', edgecolors='black')
plt.xlabel('Milk (scaled)')
plt.ylabel('Feathers (scaled)')
plt.title('KNN Decision Boundary (K = ' + str(best_k) + ')')
plt.colorbar(label='Animal Type')
plt.savefig('decision_boundary.png')
plt.show()

---
## Interview Questions

### 1. Key Hyperparameters in KNN

| Hyperparameter | Description |
|----------------|-------------|
| **n_neighbors (K)** | Number of neighbors to consider. Smaller K = more complex, larger K = smoother boundary |
| **weights** | 'uniform' (all equal) or 'distance' (closer neighbors have more weight) |
| **metric** | Distance metric to use (euclidean, manhattan, etc.) |
| **p** | Power parameter for Minkowski metric (p=1 is Manhattan, p=2 is Euclidean) |
| **algorithm** | Algorithm to compute nearest neighbors (ball_tree, kd_tree, brute) |

### 2. Distance Metrics in KNN

| Metric | Formula | Use Case |
|--------|---------|----------|
| **Euclidean** | sqrt(sum((x1-x2)^2)) | Most common, good for continuous data |
| **Manhattan** | sum(abs(x1-x2)) | Good when features are on different scales |
| **Minkowski** | (sum(abs(x1-x2)^p))^(1/p) | Generalization of Euclidean and Manhattan |
| **Hamming** | Count of different bits | Good for binary/categorical data |
| **Cosine** | 1 - cos(angle) | Good for text/document similarity |

---
## Summary

In this assignment, we:

1. **Loaded and analyzed** the Zoo dataset
2. **Preprocessed data** by scaling features
3. **Found optimal K** by testing values from 1 to 20
4. **Compared distance metrics** (Euclidean, Manhattan, Minkowski)
5. **Evaluated the model** using accuracy, precision, recall, F1-score
6. **Visualized decision boundaries** using 2D projection