# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### K-Nearest Neighbors (KNN) - Interactive Deep Dive
**Instructor:** Amir Charkhi | **Dataset:** Wine Quality (UCI)

---

## üìö What You'll Learn

- How KNN makes predictions using neighbors
- Distance metrics and their impact
- Finding optimal K value
- Feature scaling importance
- Real-world application: Wine classification

---

## üéØ The Big Idea: Vote by Your Neighbors

Imagine you move to a new neighborhood and want to know if you'll like it:

```
You: üè†?       Ask K=3 nearest neighbors:
              
              üòä (Happy) - 50m away
              üòä (Happy) - 80m away  
              üòä (Happy) - 120m away
              üò¢ (Unhappy) - 500m away
              
Vote: 3 Happy, 0 Unhappy ‚Üí You'll probably be happy! üòä
```

**KNN is that simple:**
1. Find K closest training examples
2. Let them vote
3. Majority wins!

**No training phase!** The model just memorizes data and compares at prediction time.

---

## 1. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Plotly for interactive visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

---
## 2. Load Wine Quality Data

**Dataset:** Wine Recognition Data

**Context:** Chemical analysis of wines from 3 different cultivars in Italy:
- 13 features: alcohol, acidity, color intensity, etc.
- **Goal:** Classify wine into one of 3 types
- **178 samples** from 3 wine regions

**Real-world impact:** Wine quality control, authenticity verification üç∑

In [None]:
# Load data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='wine_class')

# Create readable labels
y_labels = y.map({0: 'Class 0', 1: 'Class 1', 2: 'Class 2'})

# Display info
info_df = pd.DataFrame({
    'Metric': ['Total Samples', 'Features', 'Classes', 'Class 0 (Samples)', 'Class 1 (Samples)', 'Class 2 (Samples)'],
    'Value': [
        len(X),
        X.shape[1],
        len(np.unique(y)),
        (y == 0).sum(),
        (y == 1).sum(),
        (y == 2).sum()
    ]
})

print("üç∑ Wine Classification Dataset")
print("="*50)
for _, row in info_df.iterrows():
    print(f"{row['Metric']:.<30} {row['Value']}")
print("="*50)

In [None]:
# Preview features
X.head()

### üìä Class Distribution

In [None]:
# Interactive class distribution
class_counts = y_labels.value_counts().sort_index()

fig = go.Figure(data=[
    go.Bar(
        x=class_counts.index,
        y=class_counts.values,
        text=class_counts.values,
        textposition='auto',
        marker_color=['#E74C3C', '#3498DB', '#2ECC71']
    )
])

fig.update_layout(
    title='Wine Class Distribution',
    xaxis_title='Wine Class',
    yaxis_title='Number of Samples',
    template='plotly_white',
    height=400
)

fig.show()

---
## 3. Feature Analysis

Let's explore how different chemical properties separate wine classes.

In [None]:
import plotly.express as px

# Create visualization dataframe
viz_df = X[['alcohol', 'flavanoids', 'color_intensity', 'proline']].copy()
viz_df['wine_class'] = y_labels

# Convert categorical labels to numeric codes
viz_df['wine_class_num'] = viz_df['wine_class'].astype('category').cat.codes

# Parallel coordinates plot (continuous color only)
fig = px.parallel_coordinates(
    viz_df,
    dimensions=['alcohol', 'flavanoids', 'color_intensity', 'proline'],
    color='wine_class_num',
    color_continuous_scale=['#E74C3C', '#3498DB', '#2ECC71'],  # red ‚Üí blue ‚Üí green
    labels={'wine_class_num': 'Wine Class'},
    title='Feature Patterns Across Wine Classes'
)

fig.update_layout(height=500)
fig.show()


**üí° Observation:** Different classes show distinct patterns! KNN will use these patterns.

### üîç Key Features Scatter Plot

In [None]:
# 3D scatter for top features
fig = px.scatter_3d(
    viz_df,
    x='alcohol',
    y='flavanoids',
    z='proline',
    color='wine_class',
    title='3D Feature Space: Can Neighbors Help?',
    opacity=0.7
)

fig.update_layout(height=600)
fig.show()

**üí° Rotate the plot!** Notice how classes cluster together. This is perfect for KNN!

---
## 4. Prepare Data

### üéØ Critical: Feature Scaling for KNN

**Why is scaling ESSENTIAL for KNN?**

KNN uses distance. Features with large ranges dominate!

```
Example (without scaling):
  Alcohol: 12.5 vs 13.0    ‚Üí Difference: 0.5
  Proline: 500 vs 1000     ‚Üí Difference: 500
  
Distance calculation dominated by Proline!
Alcohol's contribution is negligible!

After scaling (all features on same scale):
  Alcohol: 0.2 vs 0.3      ‚Üí Difference: 0.1  
  Proline: 0.4 vs 0.8      ‚Üí Difference: 0.4
  
Both features contribute fairly!
```

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### üìä Visualize Scaling Impact

In [None]:
# Compare before/after scaling
comparison_df = pd.DataFrame({
    'Feature': X.columns[:6],
    'Original Range': [X_train[col].max() - X_train[col].min() for col in X.columns[:6]],
    'Scaled Range': [X_train_scaled[:, i].max() - X_train_scaled[:, i].min() for i in range(6)]
})

fig = go.Figure()

fig.add_trace(go.Bar(
    name='Before Scaling',
    x=comparison_df['Feature'],
    y=comparison_df['Original Range'],
    marker_color='lightcoral'
))

fig.add_trace(go.Bar(
    name='After Scaling',
    x=comparison_df['Feature'],
    y=comparison_df['Scaled Range'],
    marker_color='lightgreen'
))

fig.update_layout(
    title='Feature Ranges: Before vs After Scaling',
    yaxis_title='Range',
    barmode='group',
    template='plotly_white',
    height=400
)

fig.show()

**üí° After scaling:** All features have similar ranges ‚Üí Fair contribution to distance!

---
## 5. Understanding Distance Metrics

### üìè How to Measure "Closeness"?

**Euclidean Distance (Default):** Straight-line distance
```
d = ‚àö[(x‚ÇÅ-x‚ÇÇ)¬≤ + (y‚ÇÅ-y‚ÇÇ)¬≤]

Point A: (1, 2)
Point B: (4, 6)  

d = ‚àö[(1-4)¬≤ + (2-6)¬≤] = ‚àö[9 + 16] = 5
```

**Manhattan Distance:** City-block distance (taxi cab route)
```
d = |x‚ÇÅ-x‚ÇÇ| + |y‚ÇÅ-y‚ÇÇ|

d = |1-4| + |2-6| = 3 + 4 = 7
```

**Minkowski Distance:** Generalization (p=1 ‚Üí Manhattan, p=2 ‚Üí Euclidean)

In [None]:
# Compare distance metrics
distance_metrics = ['euclidean', 'manhattan', 'minkowski']
results = []

for metric in distance_metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    accuracy = knn.score(X_test_scaled, y_test)
    results.append({'Metric': metric, 'Accuracy': accuracy})

results_df = pd.DataFrame(results)

fig = go.Figure(data=[
    go.Bar(
        x=results_df['Metric'],
        y=results_df['Accuracy'],
        text=[f"{x:.2%}" for x in results_df['Accuracy']],
        textposition='auto',
        marker_color=['#E74C3C', '#3498DB', '#2ECC71']
    )
])

fig.update_layout(
    title='Distance Metric Comparison (K=5)',
    xaxis_title='Distance Metric',
    yaxis_title='Accuracy',
    yaxis_range=[0, 1],
    template='plotly_white',
    height=400
)

fig.show()

**üí° Insight:** Euclidean typically works well for continuous features (like ours).

---
## 6. Finding Optimal K

### üéØ The K Dilemma

**Small K (e.g., K=1):**
```
‚úÖ Captures fine details
‚ùå Sensitive to noise
‚ùå Overfitting risk

Example: One noisy neighbor can mislead!
```

**Large K (e.g., K=50):**
```
‚úÖ Robust to noise
‚úÖ Smooth boundaries
‚ùå Misses local patterns
‚ùå Underfitting risk

Example: Too many neighbors ‚Üí majority class always wins!
```

**Goal:** Find the "sweet spot" K!

In [None]:
# Test different K values
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

In [None]:
# Interactive K vs Accuracy plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(k_values),
    y=train_scores,
    mode='lines+markers',
    name='Training Accuracy',
    line=dict(color='lightblue', width=3),
    marker=dict(size=8)
))

fig.add_trace(go.Scatter(
    x=list(k_values),
    y=test_scores,
    mode='lines+markers',
    name='Test Accuracy',
    line=dict(color='orange', width=3),
    marker=dict(size=8)
))

# Mark optimal K
optimal_k = list(k_values)[np.argmax(test_scores)]
optimal_accuracy = max(test_scores)

fig.add_vline(
    x=optimal_k,
    line_dash="dash",
    line_color="red",
    annotation_text=f"Optimal K={optimal_k}",
    annotation_position="top"
)

fig.update_layout(
    title='Finding Optimal K: Bias-Variance Trade-off',
    xaxis_title='K (Number of Neighbors)',
    yaxis_title='Accuracy',
    template='plotly_white',
    height=500,
    hovermode='x unified'
)

fig.show()

print(f"\nüéØ Optimal K: {optimal_k}")
print(f"üìä Test Accuracy: {optimal_accuracy:.2%}")

**üí° Key Observations:**

- **K=1:** Perfect training accuracy (memorization!), but test accuracy lower
- **Small K:** More variance, sensitive to noise
- **Optimal K:** Best test performance (generalization)
- **Large K:** Training and test converge (too simple)

---
## 7. Train Final Model

In [None]:
# Train with optimal K
final_knn = KNeighborsClassifier(n_neighbors=optimal_k)
final_knn.fit(X_train_scaled, y_train)

# Predictions
y_pred = final_knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

### üìä Confusion Matrix

In [None]:
# Interactive confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Class 0', 'Class 1', 'Class 2'],
    y=['Class 0', 'Class 1', 'Class 2'],
    text=cm,
    texttemplate='%{text}',
    textfont={"size": 16},
    colorscale='Blues',
    showscale=True
))

fig.update_layout(
    title=f'KNN Confusion Matrix (K={optimal_k})<br>Overall Accuracy: {accuracy:.2%}',
    xaxis_title='Predicted Class',
    yaxis_title='Actual Class',
    height=500
)

fig.show()

### üìà Performance by Class

In [None]:
# Classification report visualization
report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1', 'Class 2'], output_dict=True)

metrics_df = pd.DataFrame({
    'Class': ['Class 0', 'Class 1', 'Class 2'],
    'Precision': [report['Class 0']['precision'], report['Class 1']['precision'], report['Class 2']['precision']],
    'Recall': [report['Class 0']['recall'], report['Class 1']['recall'], report['Class 2']['recall']],
    'F1-Score': [report['Class 0']['f1-score'], report['Class 1']['f1-score'], report['Class 2']['f1-score']]
})

fig = go.Figure()

for metric in ['Precision', 'Recall', 'F1-Score']:
    fig.add_trace(go.Bar(
        name=metric,
        x=metrics_df['Class'],
        y=metrics_df[metric],
        text=[f"{x:.2%}" for x in metrics_df[metric]],
        textposition='auto'
    ))

fig.update_layout(
    title='Performance Metrics by Wine Class',
    xaxis_title='Wine Class',
    yaxis_title='Score',
    yaxis_range=[0, 1.1],
    barmode='group',
    template='plotly_white',
    height=450
)

fig.show()

---
## 8. Decision Boundaries Visualization

Let's see how KNN creates boundaries in 2D feature space.

In [None]:
# Use two features for visualization
feature_1, feature_2 = 'alcohol', 'proline'
idx1 = list(X.columns).index(feature_1)
idx2 = list(X.columns).index(feature_2)

X_train_2d = X_train_scaled[:, [idx1, idx2]]
X_test_2d = X_test_scaled[:, [idx1, idx2]]

# Train KNN on 2D
knn_2d = KNeighborsClassifier(n_neighbors=optimal_k)
knn_2d.fit(X_train_2d, y_train)

# Create mesh
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

# Predict on mesh
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

In [None]:
# Interactive decision boundary
fig = go.Figure()

# Decision regions
fig.add_trace(go.Contour(
    x=xx[0],
    y=yy[:, 0],
    z=Z,
    colorscale=[
        [0, 'rgba(231, 76, 60, 0.3)'],
        [0.5, 'rgba(52, 152, 219, 0.3)'],
        [1, 'rgba(46, 204, 113, 0.3)']
    ],
    showscale=False,
    hoverinfo='skip',
    contours=dict(start=0, end=2, size=1)
))

# Training points
for class_val, class_name, color in [(0, 'Class 0', '#E74C3C'), (1, 'Class 1', '#3498DB'), (2, 'Class 2', '#2ECC71')]:
    mask = y_train == class_val
    fig.add_trace(go.Scatter(
        x=X_train_2d[mask, 0],
        y=X_train_2d[mask, 1],
        mode='markers',
        name=class_name,
        marker=dict(color=color, size=10, line=dict(width=1, color='white'))
    ))

fig.update_layout(
    title=f'KNN Decision Boundaries (K={optimal_k})<br>{feature_1} vs {feature_2}',
    xaxis_title=f'{feature_1} (scaled)',
    yaxis_title=f'{feature_2} (scaled)',
    template='plotly_white',
    height=600,
    hovermode='closest'
)

fig.show()

**üí° Observations:**
- Boundaries are **irregular** (not straight lines like logistic regression)
- Boundaries adapt to local data structure
- "Majority vote" creates these regions

---
## 9. Visualize Prediction Process

Let's see how KNN makes a prediction for a single test point!

In [None]:
# Pick a test point
test_point_idx = 0
test_point = X_test_2d[test_point_idx:test_point_idx+1]
true_label = y_test.iloc[test_point_idx]

# Find K nearest neighbors
distances, indices = knn_2d.kneighbors(test_point)
neighbor_labels = y_train.iloc[indices[0]]

# Make prediction
prediction = knn_2d.predict(test_point)[0]

In [None]:
# Visualize the prediction process
fig = go.Figure()

# All training points (faded)
for class_val, color in [(0, '#E74C3C'), (1, '#3498DB'), (2, '#2ECC71')]:
    mask = y_train == class_val
    fig.add_trace(go.Scatter(
        x=X_train_2d[mask, 0],
        y=X_train_2d[mask, 1],
        mode='markers',
        name=f'Class {class_val} (train)',
        marker=dict(color=color, size=6, opacity=0.3),
        showlegend=False
    ))

# K nearest neighbors (highlighted)
neighbors_2d = X_train_2d[indices[0]]
for i, (neighbor, label, dist) in enumerate(zip(neighbors_2d, neighbor_labels, distances[0])):
    color = {0: '#E74C3C', 1: '#3498DB', 2: '#2ECC71'}[label]
    fig.add_trace(go.Scatter(
        x=[neighbor[0]],
        y=[neighbor[1]],
        mode='markers',
        name=f'Neighbor {i+1} (Class {label}, d={dist:.2f})',
        marker=dict(color=color, size=15, line=dict(width=2, color='black'))
    ))
    
    # Draw line to test point
    fig.add_trace(go.Scatter(
        x=[test_point[0, 0], neighbor[0]],
        y=[test_point[0, 1], neighbor[1]],
        mode='lines',
        line=dict(color='gray', width=1, dash='dash'),
        showlegend=False,
        hoverinfo='skip'
    ))

# Test point
fig.add_trace(go.Scatter(
    x=[test_point[0, 0]],
    y=[test_point[0, 1]],
    mode='markers',
    name=f'Test Point (True: Class {true_label}, Pred: Class {prediction})',
    marker=dict(
        color='yellow',
        size=20,
        symbol='star',
        line=dict(width=2, color='black')
    )
))

# Vote count
vote_counts = neighbor_labels.value_counts().to_dict()
vote_text = ', '.join([f"Class {k}: {v} votes" for k, v in sorted(vote_counts.items())])

fig.update_layout(
    title=f'KNN Prediction Process (K={optimal_k})<br>{vote_text}<br>Prediction: Class {prediction}',
    xaxis_title=f'{feature_1} (scaled)',
    yaxis_title=f'{feature_2} (scaled)',
    template='plotly_white',
    height=600,
    hovermode='closest'
)

fig.show()

**üí° This is how KNN works:**
1. Find K=5 closest neighbors (connected by gray lines)
2. Each neighbor "votes" for its class
3. Majority class wins!
4. No complex math, just distance and voting

---
## 10. Computational Considerations

### ‚ö° KNN Trade-offs

**Training Time:** Very fast! ‚úÖ
- Just stores data
- No model "learning"
- O(1) complexity

**Prediction Time:** Can be slow! ‚ö†Ô∏è
- Calculate distance to ALL training points
- O(n √ó d) where n=samples, d=features
- Problem for large datasets

In [None]:
import time

# Compare prediction times
timing_results = []

for k in [1, 5, 15, 50]:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_scaled, y_train)
    
    # Time predictions
    start = time.time()
    for _ in range(100):
        _ = knn_temp.predict(X_test_scaled)
    avg_time = (time.time() - start) / 100
    
    timing_results.append({'K': k, 'Avg Prediction Time (ms)': avg_time * 1000})

timing_df = pd.DataFrame(timing_results)

fig = go.Figure(data=[
    go.Bar(
        x=timing_df['K'].astype(str),
        y=timing_df['Avg Prediction Time (ms)'],
        text=[f"{x:.2f} ms" for x in timing_df['Avg Prediction Time (ms)']],
        textposition='auto',
        marker_color='lightcoral'
    )
])

fig.update_layout(
    title='Prediction Time vs K',
    xaxis_title='K (Number of Neighbors)',
    yaxis_title='Avg Time (milliseconds)',
    template='plotly_white',
    height=400
)

fig.show()

---
## 11. Key Takeaways

### ‚úÖ What We Learned:

**1. KNN Core Concepts:**
- **Lazy learning:** No training phase, just memorize data
- **Instance-based:** Uses actual training examples
- **Non-parametric:** No assumptions about data distribution
- **Majority voting:** Prediction based on neighbors

**2. Key Hyperparameters:**
- **K (neighbors):**
  - Small K ‚Üí Complex boundary, overfitting risk
  - Large K ‚Üí Simple boundary, underfitting risk
  - Odd K avoids ties in binary classification
- **Distance metric:**
  - Euclidean (default, works well)
  - Manhattan (for grid-like spaces)
  - Minkowski (generalization)

**3. Critical Requirements:**
- ‚ö†Ô∏è **MUST scale features!**
- Clean data (outliers strongly impact)
- Consider curse of dimensionality

**4. When to Use KNN:**
- ‚úÖ Small to medium datasets (< 10K samples)
- ‚úÖ Low dimensional data (< 20 features)
- ‚úÖ Non-linear boundaries
- ‚úÖ Need simple, interpretable model
- ‚úÖ Continuous numeric features

**5. When NOT to Use KNN:**
- ‚ùå Large datasets (slow predictions)
- ‚ùå High dimensional data (curse of dimensionality)
- ‚ùå Need fast predictions
- ‚ùå Categorical features (distance unclear)
- ‚ùå Unbalanced classes

**6. Advantages:**
- Simple to understand
- No training time
- Naturally handles multi-class
- Adapts to new data easily

**7. Disadvantages:**
- Slow predictions
- Memory intensive (stores all data)
- Sensitive to irrelevant features
- Requires feature scaling

---

### üç∑ Real-World Application:

Our KNN model achieved **{:.1%} accuracy** on wine classification:
- Successfully distinguished 3 wine types
- Optimal K={} neighbors
- Clear decision boundaries based on chemical properties

This demonstrates KNN's effectiveness for:
- Quality control
- Product authentication
- Multi-class problems with clear clusters

---

### üéØ Best Practices:

1. **Always scale features** - KNN is distance-based
2. **Try odd K values** - Avoid tie-breaking issues
3. **Use cross-validation** - Find optimal K
4. **Consider dataset size** - KNN doesn't scale well
5. **Remove irrelevant features** - They add noise to distance
6. **Use KD-trees or Ball-trees** - For faster neighbor search (if needed)

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*