# Tier 2: k-Nearest Neighbors (k-NN)

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** f14d64a5-c6b6-4cca-b462-101d78787dab

---

## Citation
Brandon Deloatch, "Tier 2: k-Nearest Neighbors (k-NN)," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** f14d64a5-c6b6-4cca-b462-101d78787dab
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [1]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Scikit-learn imports
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, roc_curve, auc
from sklearn.metrics import precision_recall_curve, mean_absolute_error
from sklearn.datasets import make_classification, make_regression

# Distance metrics
from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances
from sklearn.neighbors import NearestNeighbors

import warnings
warnings.filterwarnings('ignore')

print(" Tier 2: k-Nearest Neighbors (k-NN) - Libraries Loaded Successfully!")
print("=" * 75)
print("Available k-NN Techniques:")
print("• k-NN Classification - Instance-based categorical prediction")
print("• k-NN Regression - Local averaging for continuous targets")
print("• Distance Metrics - Euclidean, Manhattan, Minkowski analysis")
print("• Optimal k Selection - Cross-validation and elbow method")
print("• Feature Scaling - Standardization and normalization impact")
print("• Neighborhood Analysis - Local pattern and density exploration")

 Tier 2: k-Nearest Neighbors (k-NN) - Libraries Loaded Successfully!
Available k-NN Techniques:
• k-NN Classification - Instance-based categorical prediction
• k-NN Regression - Local averaging for continuous targets
• Distance Metrics - Euclidean, Manhattan, Minkowski analysis
• Optimal k Selection - Cross-validation and elbow method
• Feature Scaling - Standardization and normalization impact
• Neighborhood Analysis - Local pattern and density exploration


In [4]:
# Generate Comprehensive Datasets for k-NN Analysis
np.random.seed(42)

def generate_knn_datasets():
    """Generate datasets optimized for k-NN analysis with local patterns"""

    # 1. REGRESSION DATASET - House Price Prediction with Spatial Components
    n_samples = 1000

    # Geographic coordinates (normalized to 0-100 range)
    latitude = np.random.uniform(40.0, 45.0, n_samples)
    longitude = np.random.uniform(-75.0, -70.0, n_samples)

    # Normalize coordinates
    lat_norm = (latitude - latitude.min()) / (latitude.max() - latitude.min()) * 100
    lon_norm = (longitude - longitude.min()) / (longitude.max() - longitude.min()) * 100

    # House characteristics
    house_size = np.random.gamma(shape=2, scale=1200, size=n_samples) + 800
    bedrooms = np.random.poisson(lam=3, size=n_samples) + 1
    bathrooms = np.random.poisson(lam=2, size=n_samples) + 1
    age_years = np.random.exponential(scale=15, size=n_samples) + 1

    # Neighborhood quality (spatially correlated)
    neighborhood_centers = [(20, 20), (50, 80), (80, 30), (30, 70)]
    neighborhood_quality = np.zeros(n_samples)

    for i in range(n_samples):
        # Distance to nearest high-quality neighborhood center
        distances = [np.sqrt((lat_norm[i] - center[0])**2 + (lon_norm[i] - center[1])**2)
                     for center in neighborhood_centers]
        min_distance = min(distances)

        # Quality decreases with distance (local similarity)
        neighborhood_quality[i] = max(0, 10 - min_distance/5) + np.random.normal(0, 1)
        neighborhood_quality[i] = np.clip(neighborhood_quality[i], 1, 10)

    # School district ratings (spatially clustered)
    school_rating = np.zeros(n_samples)
    school_centers = [(40, 40), (70, 70)]

    for i in range(n_samples):
        distances = [np.sqrt((lat_norm[i] - center[0])**2 + (lon_norm[i] - center[1])**2)
                     for center in school_centers]
        min_distance = min(distances)

        school_rating[i] = max(1, 10 - min_distance/8) + np.random.normal(0, 0.5)
        school_rating[i] = np.clip(school_rating[i], 1, 10)

    # Generate prices with local effects (perfect for k-NN)
    price = (house_size * 150 +
             bedrooms * 15000 +
             bathrooms * 10000 -
             age_years * 2000 +
             neighborhood_quality * 25000 +
             school_rating * 20000 +
             np.random.normal(0, 30000, n_samples))

    price = np.maximum(price, 100000) # Minimum price

    # Create regression DataFrame
    regression_df = pd.DataFrame({
        'latitude': lat_norm,
        'longitude': lon_norm,
        'house_size': house_size,
        'bedrooms': bedrooms,
        'bathrooms': bathrooms,
        'age_years': age_years,
        'neighborhood_quality': neighborhood_quality,
        'school_rating': school_rating,
        'price': price
    })

    # 2. CLASSIFICATION DATASET - Customer Segment Prediction
    # Generate customer features with local clustering patterns
    customer_age = np.random.normal(45, 15, n_samples)
    customer_age = np.clip(customer_age, 18, 80)

    annual_income = np.random.lognormal(mean=10.8, sigma=0.6, size=n_samples)
    annual_income = np.clip(annual_income, 30000, 200000)

    spending_score = np.random.beta(a=2, b=2, size=n_samples) * 100

    # Create clustered segments based on age and income
    segments = np.zeros(n_samples, dtype=int)

    for i in range(n_samples):
        # Young, high income, high spending
        if customer_age[i] < 35 and annual_income[i] > 70000 and spending_score[i] > 60:
            segments[i] = 0 # "Premium Young"
        # Middle age, moderate income, family-oriented
        elif 35 <= customer_age[i] <= 55 and 50000 <= annual_income[i] <= 90000:
            segments[i] = 1 # "Family Focused"
        # Older, high income, conservative spending
        elif customer_age[i] > 55 and annual_income[i] > 60000 and spending_score[i] < 50:
            segments[i] = 2 # "Conservative Savers"
        # Budget conscious across all ages
        else:
            segments[i] = 3 # "Budget Conscious"

    # Add some noise to make it more realistic
    flip_probability = 0.1
    noise_mask = np.random.random(n_samples) < flip_probability
    segments[noise_mask] = np.random.randint(0, 4, size=noise_mask.sum())

    # Additional features
    years_customer = np.random.exponential(scale=3, size=n_samples) + 0.5
    monthly_purchases = np.random.poisson(lam=8, size=n_samples) + 1

    # Purchase categories (related to segments)
    electronics_purchases = np.zeros(n_samples)
    clothing_purchases = np.zeros(n_samples)
    grocery_purchases = np.zeros(n_samples)

    for i in range(n_samples):
        if segments[i] == 0: # Premium Young
            electronics_purchases[i] = np.random.poisson(5) + 2
            clothing_purchases[i] = np.random.poisson(4) + 1
            grocery_purchases[i] = np.random.poisson(3) + 1
        elif segments[i] == 1: # Family Focused
            electronics_purchases[i] = np.random.poisson(2) + 1
            clothing_purchases[i] = np.random.poisson(6) + 2
            grocery_purchases[i] = np.random.poisson(8) + 3
        elif segments[i] == 2: # Conservative Savers
            electronics_purchases[i] = np.random.poisson(1) + 1
            clothing_purchases[i] = np.random.poisson(2) + 1
            grocery_purchases[i] = np.random.poisson(4) + 2
        else: # Budget Conscious
            electronics_purchases[i] = np.random.poisson(1) + 1
            clothing_purchases[i] = np.random.poisson(3) + 1
            grocery_purchases[i] = np.random.poisson(6) + 2

    # Create classification DataFrame
    classification_df = pd.DataFrame({
        'customer_age': customer_age,
        'annual_income': annual_income,
        'spending_score': spending_score,
        'years_customer': years_customer,
        'monthly_purchases': monthly_purchases,
        'electronics_purchases': electronics_purchases,
        'clothing_purchases': clothing_purchases,
        'grocery_purchases': grocery_purchases,
        'segment': segments
    })

    return regression_df, classification_df

# Generate datasets
print(" Generating k-NN optimized datasets...")
regression_df, classification_df = generate_knn_datasets()

print(f"Regression Dataset Shape: {regression_df.shape}")
print(f"Classification Dataset Shape: {classification_df.shape}")

print("\nRegression Dataset (House Price Prediction):")
print(regression_df.head())
print("\nRegression Target Statistics:")
print(regression_df['price'].describe())

print("\nClassification Dataset (Customer Segmentation):")
print(classification_df.head())
print("\nSegment Distribution:")
segment_names = ['Premium Young', 'Family Focused', 'Conservative Savers', 'Budget Conscious']
segment_counts = classification_df['segment'].value_counts().sort_index()
for i, count in enumerate(segment_counts):
 print(f"• {segment_names[i]}: {count} ({count/len(classification_df):.1%})")

 Generating k-NN optimized datasets...
Regression Dataset Shape: (1000, 9)
Classification Dataset Shape: (1000, 9)

Regression Dataset (House Price Prediction):
    latitude  longitude   house_size  bedrooms  bathrooms  age_years  \
0  37.173493  18.260941  1724.876868         4          3   5.784113   
1  95.075462  54.073995  1773.031403         5          4   6.406421   
2  73.095408  87.304912  4575.424578         3          5  15.213369   
3  59.696013  73.179075  1795.921752         4          3   9.652686   
4  15.213426  80.641091  2767.080219         3          3  20.459690   

   neighborhood_quality  school_rating         price  
0              6.495781       7.961903  6.323413e+05  
1              4.600409       6.409431  5.936042e+05  
2              4.810393       8.277374  1.083134e+06  
3              7.856057       7.794546  6.735220e+05  
4              5.764633       4.388408  6.840693e+05  

Regression Target Statistics:
count    1.000000e+03
mean     8.031014e+05
s

In [6]:
# 1. k-NN REGRESSION ANALYSIS
print(" 1. k-NN REGRESSION ANALYSIS")
print("=" * 33)

# Prepare regression data
reg_features = ['latitude', 'longitude', 'house_size', 'bedrooms', 'bathrooms',
 'age_years', 'neighborhood_quality', 'school_rating']
X_reg = regression_df[reg_features]
y_reg = regression_df['price']

# Split data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
 X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"Training set: {X_reg_train.shape}")
print(f"Test set: {X_reg_test.shape}")

# Feature scaling comparison
scalers = {
 'StandardScaler': StandardScaler(),
 'MinMaxScaler': MinMaxScaler(),
 'No Scaling': None
}

scaling_results = {}

for scaler_name, scaler in scalers.items():
    if scaler is not None:
        X_reg_train_scaled = scaler.fit_transform(X_reg_train)
        X_reg_test_scaled = scaler.transform(X_reg_test)
    else:
        X_reg_train_scaled = X_reg_train.values
        X_reg_test_scaled = X_reg_test.values

    # Fit k-NN with default k=5
    knn_reg = KNeighborsRegressor(n_neighbors=5)
    knn_reg.fit(X_reg_train_scaled, y_reg_train)

    # Predictions
    y_reg_pred = knn_reg.predict(X_reg_test_scaled)

    # Metrics
    mse = mean_squared_error(y_reg_test, y_reg_pred)
    r2 = r2_score(y_reg_test, y_reg_pred)
    mae = mean_absolute_error(y_reg_test, y_reg_pred)

    scaling_results[scaler_name] = {
        'MSE': mse,
        'R2': r2,
        'MAE': mae,
        'RMSE': np.sqrt(mse)
    }

print(" Feature Scaling Impact on k-NN Regression:")
scaling_df = pd.DataFrame(scaling_results).T
print(scaling_df.round(4))

# Choose best scaling method
best_scaler_name = scaling_df['R2'].idxmax()
best_scaler = scalers[best_scaler_name]

print(f"\n Best scaling method: {best_scaler_name}")

# Scale data with best scaler
if best_scaler is not None:
    X_reg_train_final = best_scaler.fit_transform(X_reg_train)
    X_reg_test_final = best_scaler.transform(X_reg_test)
else:
    X_reg_train_final = X_reg_train.values
    X_reg_test_final = X_reg_test.values

# Visualize scaling impact
fig_scaling = go.Figure()

metrics = ['R2', 'MSE', 'MAE']
colors = ['blue', 'red', 'green']

for i, metric in enumerate(metrics):
    # Normalize metrics for comparison (R2 is already 0-1, others need normalization)
    if metric == 'R2':
        values = scaling_df[metric]
    else:
        values = 1 / (1 + scaling_df[metric] / scaling_df[metric].max()) # Inverse normalization

    fig_scaling.add_trace(
        go.Bar(
            x=scaling_df.index,
            y=values,
            name=metric,
            marker_color=colors[i],
            opacity=0.7,
            yaxis=f'y{i+1}' if i > 0 else 'y',
            offsetgroup=i
        )
    )

fig_scaling.update_layout(
 title="Feature Scaling Impact on k-NN Performance",
 xaxis_title="Scaling Method",
 barmode='group',
 height=500
)
fig_scaling.show()

# k-value optimization
print(f"\n Optimal k Selection:")

k_values = range(1, 31)
k_scores_cv = []
k_scores_train = []
k_scores_test = []

for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)

    # Cross-validation score
    cv_scores = cross_val_score(knn_reg, X_reg_train_final, y_reg_train, cv=5, scoring='r2')
    k_scores_cv.append(cv_scores.mean())

    # Training and test scores
    knn_reg.fit(X_reg_train_final, y_reg_train)
    k_scores_train.append(knn_reg.score(X_reg_train_final, y_reg_train))
    k_scores_test.append(knn_reg.score(X_reg_test_final, y_reg_test))

# Find optimal k
optimal_k_cv = k_values[np.argmax(k_scores_cv)]
optimal_k_test = k_values[np.argmax(k_scores_test)]

print(f"• Optimal k (CV): {optimal_k_cv} (R² = {max(k_scores_cv):.4f})")
print(f"• Optimal k (Test): {optimal_k_test} (R² = {max(k_scores_test):.4f})")

# Plot k optimization
fig_k_opt = go.Figure()

fig_k_opt.add_trace(
    go.Scatter(
        x=list(k_values),
        y=k_scores_train,
        mode='lines+markers',
        name='Training R²',
        line=dict(color='blue'),
        hovertemplate="k: %{x}<br>Training R²: %{y:.4f}<extra></extra>"
    )
)

fig_k_opt.add_trace(
    go.Scatter(
        x=list(k_values),
        y=k_scores_cv,
        mode='lines+markers',
        name='CV R²',
        line=dict(color='green'),
        hovertemplate="k: %{x}<br>CV R²: %{y:.4f}<extra></extra>"
    )
)

fig_k_opt.add_trace(
    go.Scatter(
        x=list(k_values),
        y=k_scores_test,
        mode='lines+markers',
        name='Test R²',
        line=dict(color='red'),
        hovertemplate="k: %{x}<br>Test R²: %{y:.4f}<extra></extra>"
    )
)

# Mark optimal k
fig_k_opt.add_vline(
    x=optimal_k_cv,
    line_dash="dash",
    line_color="green",
    annotation_text=f"Optimal k (CV) = {optimal_k_cv}"
)

fig_k_opt.update_layout(
    title="k-NN Regression: Optimal k Selection",
    xaxis_title="Number of Neighbors (k)",
    yaxis_title="R² Score",
    height=500
)
fig_k_opt.show()

# Fit final model with optimal k
knn_reg_optimal = KNeighborsRegressor(n_neighbors=optimal_k_cv)
knn_reg_optimal.fit(X_reg_train_final, y_reg_train)

# Final predictions
y_reg_pred_optimal = knn_reg_optimal.predict(X_reg_test_final)

# Final metrics
final_mse = mean_squared_error(y_reg_test, y_reg_pred_optimal)
final_r2 = r2_score(y_reg_test, y_reg_pred_optimal)
final_mae = mean_absolute_error(y_reg_test, y_reg_pred_optimal)

print(f"\n Final k-NN Regression Performance:")
print(f"• k = {optimal_k_cv}")
print(f"• Test R²: {final_r2:.4f}")
print(f"• Test RMSE: ${np.sqrt(final_mse):,.0f}")
print(f"• Test MAE: ${final_mae:,.0f}")

# Actual vs Predicted plot
fig_pred_reg = go.Figure()

fig_pred_reg.add_trace(
 go.Scatter(
 x=y_reg_test,
 y=y_reg_pred_optimal,
 mode='markers',
 marker=dict(color='blue', opacity=0.6),
 name='Predictions',
 hovertemplate="Actual: $%{x:,.0f}<br>Predicted: $%{y:,.0f}<extra></extra>"
 )
)

# Perfect prediction line
min_price = min(y_reg_test.min(), y_reg_pred_optimal.min())
max_price = max(y_reg_test.max(), y_reg_pred_optimal.max())

fig_pred_reg.add_trace(
 go.Scatter(
 x=[min_price, max_price],
 y=[min_price, max_price],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Perfect Prediction',
 hovertemplate="Perfect Line<extra></extra>"
 )
)

fig_pred_reg.update_layout(
 title=f"k-NN Regression: Actual vs Predicted (k={optimal_k_cv})",
 xaxis_title="Actual Price ($)",
 yaxis_title="Predicted Price ($)",
 height=500
)
fig_pred_reg.show()

 1. k-NN REGRESSION ANALYSIS
Training set: (800, 8)
Test set: (200, 8)
 Feature Scaling Impact on k-NN Regression:
                         MSE      R2          MAE         RMSE
StandardScaler  1.412389e+10  0.8020   93426.0394  118843.9618
MinMaxScaler    3.494309e+10  0.5100  143271.4442  186930.7152
No Scaling      6.454660e+09  0.9095   61705.2943   80340.8977

 Best scaling method: No Scaling



 Optimal k Selection:
• Optimal k (CV): 4 (R² = 0.8743)
• Optimal k (Test): 3 (R² = 0.9105)
• Optimal k (CV): 4 (R² = 0.8743)
• Optimal k (Test): 3 (R² = 0.9105)



 Final k-NN Regression Performance:
• k = 4
• Test R²: 0.9101
• Test RMSE: $80,055
• Test MAE: $62,340


In [7]:
# 2. k-NN CLASSIFICATION ANALYSIS
print(" 2. k-NN CLASSIFICATION ANALYSIS")
print("=" * 35)

# Prepare classification data
class_features = ['customer_age', 'annual_income', 'spending_score', 'years_customer',
 'monthly_purchases', 'electronics_purchases', 'clothing_purchases', 'grocery_purchases']
X_class = classification_df[class_features]
y_class = classification_df['segment']

# Split data
X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(
 X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

print(f"Training set: {X_class_train.shape}")
print(f"Test set: {X_class_test.shape}")

# Check class distribution
print("\nClass distribution (training):")
train_distribution = y_class_train.value_counts().sort_index()
segment_names = ['Premium Young', 'Family Focused', 'Conservative Savers', 'Budget Conscious']
for i, count in enumerate(train_distribution):
 print(f"• {segment_names[i]}: {count} ({count/len(y_class_train):.1%})")

# Scale features for classification
scaler_class = StandardScaler()
X_class_train_scaled = scaler_class.fit_transform(X_class_train)
X_class_test_scaled = scaler_class.transform(X_class_test)

# k-value optimization for classification
print(f"\n Optimal k Selection for Classification:")

k_values = range(1, 31)
k_accuracy_cv = []
k_accuracy_train = []
k_accuracy_test = []

for k in k_values:
 knn_class = KNeighborsClassifier(n_neighbors=k)

 # Cross-validation score
 cv_scores = cross_val_score(knn_class, X_class_train_scaled, y_class_train, cv=5, scoring='accuracy')
 k_accuracy_cv.append(cv_scores.mean())

 # Training and test scores
 knn_class.fit(X_class_train_scaled, y_class_train)
 k_accuracy_train.append(knn_class.score(X_class_train_scaled, y_class_train))
 k_accuracy_test.append(knn_class.score(X_class_test_scaled, y_class_test))

# Find optimal k
optimal_k_class_cv = k_values[np.argmax(k_accuracy_cv)]
optimal_k_class_test = k_values[np.argmax(k_accuracy_test)]

print(f"• Optimal k (CV): {optimal_k_class_cv} (Accuracy = {max(k_accuracy_cv):.4f})")
print(f"• Optimal k (Test): {optimal_k_class_test} (Accuracy = {max(k_accuracy_test):.4f})")

# Plot k optimization for classification
fig_k_class = go.Figure()

fig_k_class.add_trace(
 go.Scatter(
 x=list(k_values),
 y=k_accuracy_train,
 mode='lines+markers',
 name='Training Accuracy',
 line=dict(color='blue'),
 hovertemplate="k: %{x}<br>Training Accuracy: %{y:.4f}<extra></extra>"
 )
)

fig_k_class.add_trace(
 go.Scatter(
 x=list(k_values),
 y=k_accuracy_cv,
 mode='lines+markers',
 name='CV Accuracy',
 line=dict(color='green'),
 hovertemplate="k: %{x}<br>CV Accuracy: %{y:.4f}<extra></extra>"
 )
)

fig_k_class.add_trace(
 go.Scatter(
 x=list(k_values),
 y=k_accuracy_test,
 mode='lines+markers',
 name='Test Accuracy',
 line=dict(color='red'),
 hovertemplate="k: %{x}<br>Test Accuracy: %{y:.4f}<extra></extra>"
 )
)

# Mark optimal k
fig_k_class.add_vline(
 x=optimal_k_class_cv,
 line_dash="dash",
 line_color="green",
 annotation_text=f"Optimal k (CV) = {optimal_k_class_cv}"
)

fig_k_class.update_layout(
 title="k-NN Classification: Optimal k Selection",
 xaxis_title="Number of Neighbors (k)",
 yaxis_title="Accuracy Score",
 height=500
)
fig_k_class.show()

# Fit final classification model
knn_class_optimal = KNeighborsClassifier(n_neighbors=optimal_k_class_cv)
knn_class_optimal.fit(X_class_train_scaled, y_class_train)

# Predictions
y_class_pred = knn_class_optimal.predict(X_class_test_scaled)
y_class_proba = knn_class_optimal.predict_proba(X_class_test_scaled)

# Classification metrics
class_accuracy = accuracy_score(y_class_test, y_class_pred)
print(f"\n Final k-NN Classification Performance:")
print(f"• k = {optimal_k_class_cv}")
print(f"• Test Accuracy: {class_accuracy:.4f}")

print(f"\nDetailed Classification Report:")
print(classification_report(y_class_test, y_class_pred, target_names=segment_names))

# Confusion Matrix
cm = confusion_matrix(y_class_test, y_class_pred)

# Create interactive confusion matrix
fig_cm = ff.create_annotated_heatmap(
 z=cm,
 x=segment_names,
 y=segment_names,
 annotation_text=cm,
 colorscale='Blues',
 showscale=True
)

fig_cm.update_layout(
 title=f"k-NN Classification Confusion Matrix (k={optimal_k_class_cv})",
 xaxis_title="Predicted Segment",
 yaxis_title="Actual Segment",
 height=500
)
fig_cm.show()

# Feature importance analysis (using distance-based importance)
print(f"\n Feature Impact Analysis:")

# Calculate feature ranges for importance estimation
feature_ranges = X_class_train_scaled.std(axis=0)
feature_importance_proxy = feature_ranges / feature_ranges.sum()

feature_importance_df = pd.DataFrame({
 'Feature': class_features,
 'Importance_Proxy': feature_importance_proxy
}).sort_values('Importance_Proxy', ascending=False)

print("Feature Impact (based on standard deviation):")
for _, row in feature_importance_df.iterrows():
 print(f"• {row['Feature']}: {row['Importance_Proxy']:.3f}")

# Plot feature importance
fig_feat_imp = go.Figure()

fig_feat_imp.add_trace(
 go.Bar(
 x=feature_importance_df['Feature'],
 y=feature_importance_df['Importance_Proxy'],
 marker_color='skyblue',
 hovertemplate="Feature: %{x}<br>Importance: %{y:.3f}<extra></extra>"
 )
)

fig_feat_imp.update_layout(
 title="k-NN Feature Impact Analysis",
 xaxis_title="Features",
 yaxis_title="Importance Proxy (Std. Deviation)",
 xaxis_tickangle=-45,
 height=500
)
fig_feat_imp.show()

 2. k-NN CLASSIFICATION ANALYSIS
Training set: (800, 8)
Test set: (200, 8)

Class distribution (training):
• Premium Young: 30 (3.8%)
• Family Focused: 126 (15.8%)
• Conservative Savers: 58 (7.2%)
• Budget Conscious: 586 (73.2%)

 Optimal k Selection for Classification:
• Optimal k (CV): 10 (Accuracy = 0.8700)
• Optimal k (Test): 4 (Accuracy = 0.9100)
• Optimal k (CV): 10 (Accuracy = 0.8700)
• Optimal k (Test): 4 (Accuracy = 0.9100)



 Final k-NN Classification Performance:
• k = 10
• Test Accuracy: 0.8850

Detailed Classification Report:
                     precision    recall  f1-score   support

      Premium Young       0.83      0.71      0.77         7
     Family Focused       0.83      0.78      0.81        32
Conservative Savers       1.00      0.36      0.53        14
   Budget Conscious       0.89      0.97      0.93       147

           accuracy                           0.89       200
          macro avg       0.89      0.70      0.76       200
       weighted avg       0.89      0.89      0.87       200




 Feature Impact Analysis:
Feature Impact (based on standard deviation):
• annual_income: 0.125
• spending_score: 0.125
• monthly_purchases: 0.125
• electronics_purchases: 0.125
• clothing_purchases: 0.125
• grocery_purchases: 0.125
• years_customer: 0.125
• customer_age: 0.125


In [9]:
# 3. DISTANCE METRICS COMPARISON
print(" 3. DISTANCE METRICS COMPARISON")
print("=" * 33)

# Test different distance metrics
distance_metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
metric_results = {}

print("Testing different distance metrics on classification task:")

for metric in distance_metrics:
    if metric == 'minkowski':
        # Test different p values for Minkowski distance
        for p in [1, 2, 3]:
            knn_metric = KNeighborsClassifier(
                n_neighbors=optimal_k_class_cv,
                metric='minkowski',
                p=p
            )

            knn_metric.fit(X_class_train_scaled, y_class_train)
            accuracy = knn_metric.score(X_class_test_scaled, y_class_test)

            metric_name = f'minkowski_p{p}'
            metric_results[metric_name] = accuracy
            print(f"• {metric_name}: {accuracy:.4f}")
    else:
        knn_metric = KNeighborsClassifier(
            n_neighbors=optimal_k_class_cv,
            metric=metric
        )

        knn_metric.fit(X_class_train_scaled, y_class_train)
        accuracy = knn_metric.score(X_class_test_scaled, y_class_test)

        metric_results[metric] = accuracy
        print(f"• {metric}: {accuracy:.4f}")

# Find best metric
best_metric = max(metric_results, key=metric_results.get)
print(f"\n Best distance metric: {best_metric} (Accuracy: {metric_results[best_metric]:.4f})")

# Visualize distance metrics comparison
fig_metrics = go.Figure()

fig_metrics.add_trace(
    go.Bar(
        x=list(metric_results.keys()),
        y=list(metric_results.values()),
        marker_color='lightcoral',
        hovertemplate="Metric: %{x}<br>Accuracy: %{y:.4f}<extra></extra>"
    )
)

fig_metrics.update_layout(
    title="Distance Metrics Comparison (k-NN Classification)",
    xaxis_title="Distance Metric",
    yaxis_title="Test Accuracy",
    xaxis_tickangle=-45,
    height=500
)
fig_metrics.show()

# Distance visualization for a sample of points
print(f"\n Distance Calculation Examples:")

# Take first 5 test samples
sample_X = X_class_test_scaled[:5]
sample_y = y_class_test.iloc[:5]

# Calculate distances to all training points for each metric
distance_examples = {}

for i, (idx, y_true) in enumerate(zip(sample_X, sample_y)):
 print(f"\nSample {i+1} (True segment: {segment_names[y_true]}):")

 # Euclidean distances
 euclidean_dist = euclidean_distances([idx], X_class_train_scaled)[0]
 nearest_euclidean = np.argsort(euclidean_dist)[:3]

 # Manhattan distances
 manhattan_dist = manhattan_distances([idx], X_class_train_scaled)[0]
 nearest_manhattan = np.argsort(manhattan_dist)[:3]

 print(f" Euclidean - 3 nearest neighbors: {y_class_train.iloc[nearest_euclidean].values}")
 print(f" Manhattan - 3 nearest neighbors: {y_class_train.iloc[nearest_manhattan].values}")

 # Prediction with each metric
 knn_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
 knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')

 knn_euclidean.fit(X_class_train_scaled, y_class_train)
 knn_manhattan.fit(X_class_train_scaled, y_class_train)

 pred_euclidean = knn_euclidean.predict([idx])[0]
 pred_manhattan = knn_manhattan.predict([idx])[0]

 print(f" Euclidean prediction: {segment_names[pred_euclidean]}")
 print(f" Manhattan prediction: {segment_names[pred_manhattan]}")

 3. DISTANCE METRICS COMPARISON
Testing different distance metrics on classification task:
• euclidean: 0.8850
• manhattan: 0.9000
• chebyshev: 0.8450
• minkowski_p1: 0.9000
• minkowski_p2: 0.8850
• minkowski_p3: 0.8850

 Best distance metric: manhattan (Accuracy: 0.9000)



 Distance Calculation Examples:

Sample 1 (True segment: Budget Conscious):
 Euclidean - 3 nearest neighbors: [3 3 3]
 Manhattan - 3 nearest neighbors: [3 3 3]
 Euclidean prediction: Budget Conscious
 Manhattan prediction: Budget Conscious

Sample 2 (True segment: Budget Conscious):
 Euclidean - 3 nearest neighbors: [3 3 3]
 Manhattan - 3 nearest neighbors: [3 3 3]
 Euclidean prediction: Budget Conscious
 Manhattan prediction: Budget Conscious

Sample 3 (True segment: Premium Young):
 Euclidean - 3 nearest neighbors: [0 0 3]
 Manhattan - 3 nearest neighbors: [0 0 0]
 Euclidean prediction: Premium Young
 Manhattan prediction: Premium Young

Sample 4 (True segment: Budget Conscious):
 Euclidean - 3 nearest neighbors: [3 3 1]
 Manhattan - 3 nearest neighbors: [3 1 3]
 Euclidean prediction: Budget Conscious
 Manhattan prediction: Budget Conscious

Sample 5 (True segment: Budget Conscious):
 Euclidean - 3 nearest neighbors: [3 3 3]
 Manhattan - 3 nearest neighbors: [3 3 3]
 Euclidean predi

In [11]:
# 4. DIMENSIONALITY AND CURSE OF DIMENSIONALITY
print(" 4. DIMENSIONALITY ANALYSIS")
print("=" * 29)

# Test performance with different numbers of features
feature_subsets = {
 '2D': ['annual_income', 'spending_score'],
 '3D': ['annual_income', 'spending_score', 'customer_age'],
 '4D': ['annual_income', 'spending_score', 'customer_age', 'years_customer'],
 '6D': ['annual_income', 'spending_score', 'customer_age', 'years_customer',
 'monthly_purchases', 'electronics_purchases'],
 'Full (8D)': class_features
}

dimensionality_results = {}

print("Testing k-NN performance across different dimensionalities:")

for dim_name, features in feature_subsets.items():
 # Prepare data
 X_dim = X_class_train[features]
 X_dim_test = X_class_test[features]

 # Scale
 scaler_dim = StandardScaler()
 X_dim_scaled = scaler_dim.fit_transform(X_dim)
 X_dim_test_scaled = scaler_dim.transform(X_dim_test)

 # Fit k-NN
 knn_dim = KNeighborsClassifier(n_neighbors=optimal_k_class_cv)
 knn_dim.fit(X_dim_scaled, y_class_train)

 # Evaluate
 accuracy = knn_dim.score(X_dim_test_scaled, y_class_test)

 # Cross-validation for robustness
 cv_scores = cross_val_score(knn_dim, X_dim_scaled, y_class_train, cv=5)

 dimensionality_results[dim_name] = {
 'test_accuracy': accuracy,
 'cv_mean': cv_scores.mean(),
 'cv_std': cv_scores.std(),
 'n_features': len(features)
 }

 print(f"• {dim_name}: Test Acc = {accuracy:.4f}, CV = {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Visualize dimensionality impact
dim_df = pd.DataFrame(dimensionality_results).T
dim_df['dimension_label'] = dim_df.index

fig_dim = make_subplots(
 rows=1, cols=2,
 subplot_titles=['Test Accuracy vs Dimensions', 'Cross-Validation Performance'],
 specs=[[{"secondary_y": False}, {"secondary_y": False}]]
)

# Test accuracy plot
fig_dim.add_trace(
 go.Scatter(
 x=dim_df['n_features'],
 y=dim_df['test_accuracy'],
 mode='lines+markers',
 name='Test Accuracy',
 line=dict(color='blue'),
 hovertemplate="Dimensions: %{x}<br>Accuracy: %{y:.4f}<extra></extra>"
 ),
 row=1, col=1
)

# CV performance with error bars
fig_dim.add_trace(
 go.Scatter(
 x=dim_df['n_features'],
 y=dim_df['cv_mean'],
 error_y=dict(type='data', array=dim_df['cv_std'], visible=True),
 mode='lines+markers',
 name='CV Mean ± Std',
 line=dict(color='red'),
 hovertemplate="Dimensions: %{x}<br>CV Mean: %{y:.4f}<extra></extra>"
 ),
 row=1, col=2
)

fig_dim.update_layout(
 title="k-NN Performance vs Dimensionality",
 height=500
)

fig_dim.update_xaxes(title_text="Number of Features", row=1, col=1)
fig_dim.update_xaxes(title_text="Number of Features", row=1, col=2)
fig_dim.update_yaxes(title_text="Accuracy", row=1, col=1)
fig_dim.update_yaxes(title_text="CV Accuracy", row=1, col=2)

fig_dim.show()

# Analyze curse of dimensionality
print(f"\n Curse of Dimensionality Analysis:")
print("As dimensions increase:")

best_2d_acc = dimensionality_results['2D']['test_accuracy']
best_full_acc = dimensionality_results['Full (8D)']['test_accuracy']
performance_change = ((best_full_acc - best_2d_acc) / best_2d_acc) * 100

print(f"• 2D performance: {best_2d_acc:.4f}")
print(f"• Full dimensional performance: {best_full_acc:.4f}")
print(f"• Performance change: {performance_change:+.1f}%")

if performance_change > 5:
 print("• Result: Adding dimensions IMPROVED performance")
 print("• Interpretation: Additional features contain valuable signal")
elif performance_change < -5:
 print("• Result: Adding dimensions HURT performance")
 print("• Interpretation: Curse of dimensionality effect observed")
else:
 print("• Result: Minimal impact from additional dimensions")
 print("• Interpretation: Marginal information in extra features")

# Distance concentration analysis
print(f"\n Distance Concentration in High Dimensions:")

# Calculate average distances in different dimensions
for dim_name, features in feature_subsets.items():
    if len(features) >= 2: # Skip if too few features
        X_sample = X_class_train_scaled[:100, :len(features)] # First 100 samples

        # Calculate pairwise distances
        distances = euclidean_distances(X_sample, X_sample)
        # Remove diagonal (zero distances)
        distances = distances[np.triu_indices_from(distances, k=1)]

        mean_dist = distances.mean()
        std_dist = distances.std()
        cv_dist = std_dist / mean_dist # Coefficient of variation

        print(f"• {dim_name}: Mean distance = {mean_dist:.3f}, CV = {cv_dist:.3f}")

print("\nNote: Lower CV indicates distance concentration (curse of dimensionality)")

 4. DIMENSIONALITY ANALYSIS
Testing k-NN performance across different dimensionalities:
• 2D: Test Acc = 0.7400, CV = 0.7300 ± 0.0127
• 3D: Test Acc = 0.8700, CV = 0.8450 ± 0.0083
• 4D: Test Acc = 0.8600, CV = 0.8225 ± 0.0064
• 6D: Test Acc = 0.8400, CV = 0.8138 ± 0.0165
• Full (8D): Test Acc = 0.8850, CV = 0.8700 ± 0.0073



 Curse of Dimensionality Analysis:
As dimensions increase:
• 2D performance: 0.7400
• Full dimensional performance: 0.8850
• Performance change: +19.6%
• Result: Adding dimensions IMPROVED performance
• Interpretation: Additional features contain valuable signal

 Distance Concentration in High Dimensions:
• 2D: Mean distance = 1.699, CV = 0.578
• 3D: Mean distance = 2.255, CV = 0.431
• 4D: Mean distance = 2.690, CV = 0.398
• 6D: Mean distance = 3.280, CV = 0.358
• Full (8D): Mean distance = 3.804, CV = 0.303

Note: Lower CV indicates distance concentration (curse of dimensionality)


In [14]:
# 5. NEIGHBORHOOD ANALYSIS AND VISUALIZATION
print(" 5. NEIGHBORHOOD ANALYSIS AND VISUALIZATION")
print("=" * 44)

# Analyze neighborhood composition for different values of k
print("Analyzing neighborhood composition for customer segmentation:")

# Select a few interesting test samples for analysis
interesting_samples = []
for segment in range(4):
    # Find samples of each segment
    segment_indices = np.where(y_class_test == segment)[0]
    if len(segment_indices) > 0:
        interesting_samples.append(segment_indices[0])

print(f"Analyzing {len(interesting_samples)} representative samples...")

# Fit nearest neighbors model for analysis
nn_analyzer = NearestNeighbors(n_neighbors=15, metric='euclidean')
nn_analyzer.fit(X_class_train_scaled)

neighborhood_analysis = {}

for i, sample_idx in enumerate(interesting_samples):
    sample = X_class_test_scaled[sample_idx:sample_idx+1]
    true_segment = y_class_test.iloc[sample_idx]

    # Find nearest neighbors
    distances, indices = nn_analyzer.kneighbors(sample)
    neighbor_segments = y_class_train.iloc[indices[0]]

    print(f"\nSample {i+1} - True Segment: {segment_names[true_segment]}")

    # Analyze neighborhood composition for different k values
    for k in [1, 3, 5, 10, 15]:
        k_neighbors = neighbor_segments[:k]
        segment_counts = k_neighbors.value_counts().sort_index()

        print(f" k={k}: ", end="")
        composition = []
        for seg in range(4):
            count = segment_counts.get(seg, 0)
            if count > 0:
                composition.append(f"{segment_names[seg][:3]}({count})")
        print(" | ".join(composition))

        # Make prediction for this k
        majority_vote = k_neighbors.mode()[0] if len(k_neighbors.mode()) > 0 else k_neighbors.iloc[0]
        correct = "" if majority_vote == true_segment else ""
        print(f" Prediction: {segment_names[majority_vote]} {correct}")

    neighborhood_analysis[f"Sample_{i+1}"] = {
        'true_segment': true_segment,
        'neighbors': neighbor_segments,
        'distances': distances[0]
    }

# Visualize neighborhood diversity
print(f"\n Neighborhood Diversity Analysis:")

# Calculate neighborhood purity for different k values
k_values_analysis = [1, 3, 5, 7, 10, 15, 20]
purity_scores = []

for k in k_values_analysis:
    total_purity = 0
    n_samples = min(100, len(X_class_test_scaled)) # Limit for performance

    for i in range(n_samples):
        sample = X_class_test_scaled[i:i+1]
        true_segment = y_class_test.iloc[i]

        # Find k nearest neighbors
        distances, indices = nn_analyzer.kneighbors(sample, n_neighbors=k)
        neighbor_segments = y_class_train.iloc[indices[0]]

        # Calculate purity (fraction of neighbors with same segment as majority)
        majority_segment = neighbor_segments.mode()[0]
        purity = (neighbor_segments == majority_segment).sum() / k
        total_purity += purity

    avg_purity = total_purity / n_samples
    purity_scores.append(avg_purity)
    print(f"• k={k}: Average neighborhood purity = {avg_purity:.3f}")

# Plot neighborhood purity
fig_purity = go.Figure()

fig_purity.add_trace(
 go.Scatter(
 x=k_values_analysis,
 y=purity_scores,
 mode='lines+markers',
 name='Neighborhood Purity',
 line=dict(color='purple'),
 hovertemplate="k: %{x}<br>Purity: %{y:.3f}<extra></extra>"
 )
)

fig_purity.update_layout(
 title="Neighborhood Purity vs k Value",
 xaxis_title="Number of Neighbors (k)",
 yaxis_title="Average Neighborhood Purity",
 height=400
)
fig_purity.show()

# Distance distribution analysis
print(f"\n Distance Distribution Analysis:")

# Analyze distance distributions for each segment
segment_distances = {seg: [] for seg in range(4)}

# Sample 50 points from each segment for analysis
for segment in range(4):
    segment_mask = y_class_train == segment
    segment_data = X_class_train_scaled[segment_mask]

    if len(segment_data) > 1:
        # Sample points
        sample_size = min(50, len(segment_data))
        sampled_indices = np.random.choice(len(segment_data), sample_size, replace=False)
        sampled_data = segment_data[sampled_indices]

        # Calculate pairwise distances within segment
        distances = euclidean_distances(sampled_data, sampled_data)
        # Get upper triangle (avoid duplicates and zeros)
        distances = distances[np.triu_indices_from(distances, k=1)]
        segment_distances[segment] = distances

# Plot distance distributions
fig_dist = go.Figure()

colors = ['blue', 'red', 'green', 'orange']
for segment in range(4):
    if len(segment_distances[segment]) > 0:
        fig_dist.add_trace(
            go.Histogram(
                x=segment_distances[segment],
                name=segment_names[segment],
                opacity=0.7,
                marker_color=colors[segment],
                nbinsx=30
            )
        )

fig_dist.update_layout(
    title="Intra-Segment Distance Distributions",
    xaxis_title="Euclidean Distance",
    yaxis_title="Frequency",
    barmode='overlay',
    height=500
)
fig_dist.show()

# Calculate and display statistics
print(f"\nIntra-segment distance statistics:")
for segment in range(4):
    if len(segment_distances[segment]) > 0:
        distances = segment_distances[segment]
        mean_dist = distances.mean()
        std_dist = distances.std()
        print(f"• {segment_names[segment]}: Mean = {mean_dist:.3f}, Std = {std_dist:.3f}")

 5. NEIGHBORHOOD ANALYSIS AND VISUALIZATION
Analyzing neighborhood composition for customer segmentation:
Analyzing 4 representative samples...

Sample 1 - True Segment: Premium Young
 k=1: Pre(1)
 Prediction: Premium Young 
 k=3: Pre(2) | Bud(1)
 Prediction: Premium Young 
 k=5: Pre(4) | Bud(1)
 Prediction: Premium Young 
 k=10: Pre(6) | Fam(3) | Bud(1)
 Prediction: Premium Young 
 k=15: Pre(10) | Fam(3) | Bud(2)
 Prediction: Premium Young 

Sample 2 - True Segment: Family Focused
 k=1: Fam(1)
 Prediction: Family Focused 
 k=3: Fam(3)
 Prediction: Family Focused 
 k=5: Fam(5)
 Prediction: Family Focused 
 k=10: Fam(10)
 Prediction: Family Focused 
 k=15: Fam(13) | Bud(2)
 Prediction: Family Focused 

Sample 3 - True Segment: Conservative Savers
 k=1: Con(1)
 Prediction: Conservative Savers 
 k=3: Con(1) | Bud(2)
 Prediction: Budget Conscious 
 k=5: Con(3) | Bud(2)
 Prediction: Conservative Savers 
 k=10: Fam(1) | Con(4) | Bud(5)
 Prediction: Budget Conscious 
 k=15: Fam(1) | Con(5) | 


 Distance Distribution Analysis:



Intra-segment distance statistics:
• Premium Young: Mean = 3.731, Std = 1.091
• Family Focused: Mean = 3.603, Std = 1.027
• Conservative Savers: Mean = 3.261, Std = 0.966
• Budget Conscious: Mean = 3.564, Std = 0.980


In [16]:
# 6. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print(" 6. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 54)

# Model interpretability analysis
print(" k-NN Model Interpretability Analysis:")

# Feature importance based on prediction sensitivity
def calculate_feature_sensitivity(model, X_baseline, feature_names, n_samples=100):
    """Calculate feature sensitivity by perturbing each feature"""

    # Sample baseline predictions
    sample_indices = np.random.choice(len(X_baseline), n_samples, replace=False)
    X_sample = X_baseline[sample_indices]
    baseline_predictions = model.predict(X_sample)

    sensitivities = {}

    for i, feature in enumerate(feature_names):
        # Perturb this feature by adding noise
        X_perturbed = X_sample.copy()
        noise_std = X_sample[:, i].std() * 0.1 # 10% of feature std
        X_perturbed[:, i] += np.random.normal(0, noise_std, len(X_sample))

        # Get new predictions
        perturbed_predictions = model.predict(X_perturbed)

        # Calculate sensitivity (prediction change rate)
        change_rate = (baseline_predictions != perturbed_predictions).mean()
        sensitivities[feature] = change_rate

    return sensitivities

# Calculate feature sensitivities
sensitivities = calculate_feature_sensitivity(
    knn_class_optimal, X_class_train_scaled, class_features
)

print("Feature Sensitivity Analysis (prediction change rate with 10% noise):")
sensitivity_df = pd.DataFrame(list(sensitivities.items()),
    columns=['Feature', 'Sensitivity']).sort_values('Sensitivity', ascending=False)

for _, row in sensitivity_df.iterrows():
    print(f"• {row['Feature']}: {row['Sensitivity']:.3f}")

# Visualize feature sensitivity
fig_sensitivity = go.Figure()

fig_sensitivity.add_trace(
    go.Bar(
        x=sensitivity_df['Feature'],
        y=sensitivity_df['Sensitivity'],
        marker_color='lightgreen',
        hovertemplate="Feature: %{x}<br>Sensitivity: %{y:.3f}<extra></extra>"
    )
)

fig_sensitivity.update_layout(
    title="k-NN Feature Sensitivity Analysis",
    xaxis_title="Features",
    yaxis_title="Prediction Change Rate",
    xaxis_tickangle=-45,
    height=500
)
fig_sensitivity.show()

# Segment-specific insights
print(f"\n Segment-Specific Business Insights:")

# Analyze prediction confidence by segment
segment_confidence = {}
segment_sample_analysis = {}

for segment in range(4):
    # Get test samples for this segment
    segment_mask = y_class_test == segment
    if segment_mask.sum() > 0:
        segment_X = X_class_test_scaled[segment_mask]
        segment_predictions = knn_class_optimal.predict(segment_X)
        segment_probabilities = knn_class_optimal.predict_proba(segment_X)

        # Calculate confidence (max probability)
        confidences = segment_probabilities.max(axis=1)
        accuracy = (segment_predictions == segment).mean()

        segment_confidence[segment] = {
            'accuracy': accuracy,
            'avg_confidence': confidences.mean(),
            'min_confidence': confidences.min(),
            'max_confidence': confidences.max()
        }

        # Sample characteristics analysis
        segment_features = X_class_test[segment_mask].mean()
        segment_sample_analysis[segment] = segment_features

print("Segment Prediction Performance:")
for segment in range(4):
    if segment in segment_confidence:
        conf = segment_confidence[segment]
        print(f"\n• {segment_names[segment]}:")
        print(f" - Accuracy: {conf['accuracy']:.3f}")
        print(f" - Avg Confidence: {conf['avg_confidence']:.3f}")
        print(f" - Confidence Range: {conf['min_confidence']:.3f} - {conf['max_confidence']:.3f}")

# Strategic recommendations based on analysis
print(f"\n STRATEGIC RECOMMENDATIONS:")

print(f"\n1. OPTIMAL MODEL CONFIGURATION:")
print(f" • Use k = {optimal_k_class_cv} neighbors for customer segmentation")
print(f" • Apply StandardScaler for feature preprocessing")
print(f" • Expected classification accuracy: {class_accuracy:.1%}")
print(f" • Use {best_metric} distance metric for optimal performance")

print(f"\n2. FEATURE ENGINEERING INSIGHTS:")
most_sensitive = sensitivity_df.iloc[0]['Feature']
least_sensitive = sensitivity_df.iloc[-1]['Feature']
print(f" • Most impactful feature: {most_sensitive}")
print(f" • Least impactful feature: {least_sensitive}")
print(f" • Consider feature selection to reduce dimensionality")
print(f" • Focus data collection efforts on high-sensitivity features")

print(f"\n3. SEGMENT-SPECIFIC STRATEGIES:")

# Find most/least predictable segments
if segment_confidence:
 segment_accuracies = {seg: conf['accuracy'] for seg, conf in segment_confidence.items()}
 most_predictable = max(segment_accuracies, key=segment_accuracies.get)
 least_predictable = min(segment_accuracies, key=segment_accuracies.get)

 print(f" • Most predictable segment: {segment_names[most_predictable]} ({segment_accuracies[most_predictable]:.1%} accuracy)")
 print(f" - Implement automated targeting for this segment")
 print(f" - High confidence in model recommendations")

 print(f" • Least predictable segment: {segment_names[least_predictable]} ({segment_accuracies[least_predictable]:.1%} accuracy)")
 print(f" - Requires additional data collection")
 print(f" - Consider manual review for predictions")

print(f"\n4. IMPLEMENTATION RECOMMENDATIONS:")
print(f" • Real-time prediction latency: Very fast (simple distance calculation)")
print(f" • Memory requirements: Store all training data ({len(X_class_train)} samples)")
print(f" • Model updates: Retrain when significant data drift detected")
print(f" • Scalability: Consider approximate nearest neighbors for large datasets")

print(f"\n5. BUSINESS VALUE DRIVERS:")

# Calculate potential business impact
baseline_accuracy = max(y_class_train.value_counts()) / len(y_class_train) # Majority class baseline
improvement = (class_accuracy - baseline_accuracy) * 100

print(f" • Model improves over random assignment by {improvement:.1f} percentage points")
print(f" • Enables personalized marketing campaigns")
print(f" • Supports customer lifetime value prediction")
print(f" • Facilitates inventory planning by segment")

# ROI estimation
print(f"\n6. ROI ESTIMATION:")
print(f" • If applied to customer base of 10,000:")
print(f" - Correctly classified customers: ~{int(class_accuracy * 10000)}")
print(f" - Misclassified customers: ~{int((1-class_accuracy) * 10000)}")
print(f" • Assuming $50 value per correct classification:")
print(f" - Annual value: ${int(class_accuracy * 10000 * 50):,}")
print(f" • Model development cost amortized over high-volume predictions")

print(f"\n7. NEXT STEPS:")
print(f" • Deploy model for A/B testing on customer subset")
print(f" • Monitor prediction confidence and flag low-confidence cases")
print(f" • Collect feedback to validate segment assignments")
print(f" • Consider ensemble methods combining k-NN with other algorithms")
print(f" • Implement automated model retraining pipeline")

print(f"\n" + "="*75)
print(f" k-NN LEARNING SUMMARY:")
print(f" Mastered instance-based learning principles")
print(f" Optimized k-value selection through cross-validation")
print(f" Analyzed impact of distance metrics and scaling")
print(f" Understood curse of dimensionality effects")
print(f" Performed neighborhood analysis and interpretability")
print(f" Generated actionable business insights and ROI estimates")
print(f"="*75)

 6. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
 k-NN Model Interpretability Analysis:
Feature Sensitivity Analysis (prediction change rate with 10% noise):
• customer_age: 0.010
• annual_income: 0.010
• spending_score: 0.010
• years_customer: 0.010
• monthly_purchases: 0.010
• clothing_purchases: 0.010
• grocery_purchases: 0.010
• electronics_purchases: 0.000



 Segment-Specific Business Insights:
Segment Prediction Performance:

• Premium Young:
 - Accuracy: 0.714
 - Avg Confidence: 0.714
 - Confidence Range: 0.500 - 0.900

• Family Focused:
 - Accuracy: 0.781
 - Avg Confidence: 0.766
 - Confidence Range: 0.400 - 1.000

• Conservative Savers:
 - Accuracy: 0.357
 - Avg Confidence: 0.707
 - Confidence Range: 0.500 - 1.000

• Budget Conscious:
 - Accuracy: 0.966
 - Avg Confidence: 0.880
 - Confidence Range: 0.500 - 1.000

 STRATEGIC RECOMMENDATIONS:

1. OPTIMAL MODEL CONFIGURATION:
 • Use k = 10 neighbors for customer segmentation
 • Apply StandardScaler for feature preprocessing
 • Expected classification accuracy: 88.5%
 • Use manhattan distance metric for optimal performance

2. FEATURE ENGINEERING INSIGHTS:
 • Most impactful feature: customer_age
 • Least impactful feature: electronics_purchases
 • Consider feature selection to reduce dimensionality
 • Focus data collection efforts on high-sensitivity features

3. SEGMENT-SPECIFIC STRATEGI