# k-NN Classification on the Iris Dataset

This notebook walks through a beginner-friendly k-NN classification pipeline using scikit-learn.

## Intuition
- k-NN is a distance-based classifier: a point is labeled based on its nearest neighbors.
- We use **Euclidean distance** and weight neighbors by inverse distance so closer points matter more.
- **Scaling is mandatory** because features with larger numeric ranges can dominate distance calculations. We standardize features to zero mean and unit variance with `StandardScaler`.

In [None]:
import pandas as pd
from app.data import load_iris_dataset
from app.preprocess import split_data
from app.model import create_knn_pipeline
from app.evaluate import evaluate_classification
from app.visualize import plot_confusion_matrix, plot_decision_boundary

X, y = load_iris_dataset()
X_train, X_test, y_train, y_test = split_data(X, y)

pipeline = create_knn_pipeline()
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
metrics, cm = evaluate_classification(y_test.to_numpy(), y_pred)

print('Metrics (macro-averaged):')
for name, value in metrics.items():
    print(f'- {name}: {value:.3f}')

plot_confusion_matrix(cm, ['setosa', 'versicolor', 'virginica'], 'outputs/notebook_confusion_matrix.svg')
plot_decision_boundary(X.to_numpy(), y.to_numpy(), pipeline, 'outputs/notebook_decision_boundary.svg')

pd.DataFrame(metrics, index=['score'])

## Visualizations
- `outputs/notebook_confusion_matrix.svg`: Confusion matrix heatmap.
- `outputs/notebook_decision_boundary.svg`: 2D PCA projection with k-NN decision regions.

## Conclusions
- Scaling is required for reliable distance comparisons.
- k-NN performs well on this small, low-dimensional dataset.
- Try experimenting with `n_neighbors`, `weights`, and `metric` to see how decision boundaries change.