# KNN Proof-of-Concept (PoC)

This repository demonstrates a **K-Nearest Neighbors (KNN)** classification example using a small synthetic dataset. The goal is to show how KNN can predict whether a customer is likely to buy a product based on simple features.

---

## Dataset

The dataset (`knn_poc_dataset.csv`) contains:

| Feature       | Description                              |
|---------------|------------------------------------------|
| Age           | Customer age (years)                     |
| AnnualIncome  | Customer annual income (in thousands)   |
| SpendingScore | Customer spending score (1–100)         |
| BuysProduct   | Target label (0 = No, 1 = Yes)          |

Sample:

```csv
Age,AnnualIncome,SpendingScore,BuysProduct
22,45,77,1
25,29,65,1
47,54,23,0
52,88,17,0
46,76,45,0

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/knn_poc_dataset.csv")

print(df.head())
# dataset contains shopping features that contribute to if the shopper makes a purchase or not (label)

   Age  AnnualIncome  SpendingScore  BuysProduct
0   22            45             77            1
1   25            29             65            1
2   47            54             23            0
3   52            88             17            0
4   46            76             45            0


In [None]:
# split data into features (X) and target (y)
X = df[["Age", "AnnualIncome", "SpendingScore"]]
y = df["BuysProduct"]

# split into train and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# scale X_train and X_test since each feature has different scales
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # make sure to use .transform() so it applies the same scaling WITHOUT refitting

# K-Nearest Neighbors (KNN) - Plain Language Explanation

K-Nearest Neighbors (KNN) is a **simple, intuitive machine learning algorithm** used for classification (and sometimes regression). Here’s how it works in plain language:

---

## How KNN Works

1. **Think of points on a map**  
   - Each data point is like a dot in space, based on its features (e.g., age, income, spending score).  
   - Points that are “closer together” are more similar.

2. **New data comes in**  
   - Suppose we have a new point (e.g., a new customer).  
   - We want to predict its label (e.g., will they buy the product?).

3. **Find the nearest neighbors**  
   - Measure the distance between the new point and all existing points.  
   - Common distance metric: **Euclidean distance** (straight-line distance).  
   - Pick the **k closest points**. These are the “nearest neighbors.”

4. **Vote among the neighbors**  
   - For classification:
     - Each neighbor “votes” for its class (e.g., buys = 1, doesn’t buy = 0).  
     - The class with the most votes is assigned to the new point.  
   - For regression:
     - Take the average of the neighbors’ values.

5. **Output the prediction**  
   - The new point is labeled based on the majority vote (classification) or average (regression).

---

## Key Points

- **Choice of k**:  
  - Small k → sensitive to noise (may overfit).  
  - Large k → smoother predictions, may miss small patterns.

- **Scaling matters**:  
  - KNN uses distances. Features with larger scales dominate unless we scale all features.

- **Non-parametric**:  
  - KNN doesn’t assume any underlying distribution. It “learns” from the data directly.

- **Lazy learner**:  
  - KNN doesn’t really train a model in the traditional sense. It just stores the data and computes distances at prediction time.

---

## Intuition

- Imagine a group of people in a park, and you want to predict if a new person likes soccer.  
- You look at the **nearest k people** to them: if most of them like soccer, you predict that the new person probably likes soccer too.  
- That’s essentially how KNN works, just with numbers instead of people.



In [None]:
# build KNN model
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

In [None]:
# make predictions
y_pred = knn.predict(X_test_scaled)

In [None]:
# now let's check accuracy and the classification report
from sklearn.metrics import accuracy_score, classification_report

# accuracy: tells what % of correct predictions
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

# classification report: shows precision, recall, f1-score, and support metrics
# precision: when my model predicts label 1 for ex, how often is it right? TP / (TP + FP)
# recall: of all the actual label 1s, how many did my model correctly label as 1? TP / (TP + FN)
# f1-score: how balanced is my model of correctly labelling TP and avoiding false alarms? (2 * ((precision * recall) / (precision + recall)))
# support: number of true predictions for each class
cr = classification_report(y_test, y_pred)
print("\nClassification Report:\n", cr)

Accuracy:  1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         3

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



In [None]:
# next, let's experiment with different k values, not just k=5

for k in range(1,11):
  knn = KNeighborsClassifier(n_neighbors=k)
  knn.fit(X_train_scaled, y_train)
  y_pred = knn.predict(X_test_scaled)
  print(f"k={k}, Accuracy = {accuracy_score(y_test, y_pred)}")

# definitely do this with a non synthetic dataset to see realistic results

k=1, Accuracy = 1.0
k=2, Accuracy = 1.0
k=3, Accuracy = 1.0
k=4, Accuracy = 1.0
k=5, Accuracy = 1.0
k=6, Accuracy = 1.0
k=7, Accuracy = 1.0
k=8, Accuracy = 1.0
k=9, Accuracy = 1.0
k=10, Accuracy = 1.0
