# K-Nearest Neighbors (KNN) - Wine Quality Dataset

## Overview
K-Nearest Neighbors is a non-parametric, instance-based supervised learning algorithm.
Predictions are made based on the majority class (for classification) or average value (for regression)
of the k nearest neighbors in the feature space.

This notebook uses KNN to classify wines as high-quality or low-quality.


# 1. Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# 2. Load Dataset

In [3]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=';')

X = data.drop('quality', axis=1).values
y = (data['quality'] >= 6).astype(int).values  # Binary classification

data.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


# 3. Preprocessing

In [4]:
# Split dataset and scale features (KNN is distance-based, scaling is required)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# 4. Train KNN Model

In [5]:
# Initialize and train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predictions
y_pred = knn.predict(X_test_scaled)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Test Accuracy: {accuracy:.4f}")


KNN Test Accuracy: 0.7406


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


# 5. Evaluation

In [6]:
# Confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Confusion Matrix:
 [[108  41]
 [ 42 129]]

Classification Report:

              precision    recall  f1-score   support

           0       0.72      0.72      0.72       149
           1       0.76      0.75      0.76       171

    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320



# 6. Testing

In [7]:
# 1. Shape checks
assert X_train_scaled.shape[0] == y_train.shape[0], "Training data mismatch"
assert X_test_scaled.shape[0] == y_test.shape[0], "Testing data mismatch"

# 2. Prediction shape
assert y_pred.shape == y_test.shape, "Prediction shape mismatch"

# 3. Accuracy sanity
assert 0.5 <= accuracy <= 1.0, f"Unexpected accuracy: {accuracy}"

# 4. k neighbors check
assert knn.n_neighbors == 5, "Incorrect number of neighbors"

# 5. Reproducibility
knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(X_train_scaled, y_train)
y_pred2 = knn2.predict(X_test_scaled)

np.testing.assert_array_equal(y_pred, y_pred2)

print("All tests passed ✅")


All tests passed ✅


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


# 7. Summary & Discussion

- K-Nearest Neighbors achieved strong classification performance on the Wine Quality dataset.
- Feature scaling was essential since KNN is distance-based.
- The model is simple, non-parametric, and interpretable.
- KNN can be sensitive to noisy features and large datasets.
- Inline testing ensures the model is correct, reproducible, and robust.

