# Module 2: K-Nearest Neighbors Practice

## Introduction
In this notebook, you'll learn how to implement a K-Nearest Neighbors (KNN) classifier using scikit-learn on a simple 2D dataset.

## Initial Knowledge Check
1. What is the main idea behind K-Nearest Neighbors?
2. How does the choice of `k` influence model bias and variance?
3. Describe one advantage and one disadvantage of KNN.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('./data/knearest.csv')
X = df[['feature1', 'feature2']]
y = df['target']
df.head()

## 2. Exploratory Data Analysis
Visualize the two features and color points by their class.


In [None]:
import matplotlib.pyplot as plt

plt.scatter(X['feature1'], X['feature2'], c=y, cmap='viridis', edgecolor='k')
plt.title('2D Scatter Plot of KNN Demo Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

## 3. Train a KNN Classifier
We'll train a KNN model with `k=3` and evaluate on the same data.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Instantiate and train
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Predict and evaluate
y_pred = knn.predict(X)
acc = accuracy_score(y, y_pred)
print(f"Accuracy with k=3: {acc:.2f}")

## 4. Exercise for the Student
**Task:**  
Find the optimal value of `k` between 1 and 20.  
- Use a simple loop to train KNN models for each `k`.  
- Compute the training accuracy for each.  
- Plot accuracy vs. `k`.  
- Which `k` gives the highest accuracy?  
- **Bonus:** Split the data into a train/test set (e.g., 80/20) and compare training vs. test accuracy curves.


## 5. Solution
Below is one possible solution, including the bonus part.


In [None]:
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Search for best k
train_acc = []
test_acc = []
ks = range(1, 21)

for k in ks:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    train_acc.append(accuracy_score(y_train, model.predict(X_train)))
    test_acc.append(accuracy_score(y_test,  model.predict(X_test)))

# Plot results
plt.plot(ks, train_acc, label='Train Accuracy')
plt.plot(ks, test_acc,  label='Test Accuracy')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN: Train vs Test Accuracy')
plt.legend()
plt.show()

# Best k
best_k = ks[test_acc.index(max(test_acc))]
print(f"Optimal k based on test set: {best_k}")

---
### Next Steps
- Revisit the theory section on distance metrics: try using "manhattan" or "chebyshev" instead of the default Euclidean.
- Prepare for Decision Trees by thinking about how KNN and tree-based models differ in handling non-linear boundaries.
