# SageMaker K-Nearest Neighbors (k-NN) Exercise

This notebook demonstrates Amazon SageMaker's **k-Nearest Neighbors (k-NN)** algorithm for classification and regression.

## What You'll Learn
1. How to prepare data for k-NN
2. How to train a k-NN model with indexing
3. How to use k-NN for classification and regression

## What is k-NN?

k-NN is a **non-parametric** algorithm that:
- **Classification**: Predicts class based on majority vote of k nearest neighbors
- **Regression**: Predicts value based on average of k nearest neighbors

**SageMaker's Implementation:**
- Uses efficient index structures for fast querying
- Supports dimension reduction for large feature spaces
- Scales to large datasets

## Use Cases

| Application | Type |
|-------------|------|
| Product recommendation | Classification/Similarity |
| Anomaly detection | Classification |
| Image similarity search | Nearest neighbor search |
| Price prediction | Regression |

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "knn"

# Dataset parameters
NUM_SAMPLES = 5000
NUM_FEATURES = 20
NUM_CLASSES = 3
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Classification Data

In [None]:
# Generate classification dataset
X, y = make_classification(
    n_samples=NUM_SAMPLES,
    n_features=NUM_FEATURES,
    n_informative=15,
    n_redundant=3,
    n_classes=NUM_CLASSES,
    n_clusters_per_class=2,
    random_state=RANDOM_STATE
)

# Convert to float32
X = X.astype(np.float32)
y = y.astype(np.float32)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {NUM_FEATURES}")
print(f"Classes: {NUM_CLASSES}")
print(f"\nClass distribution (train):")
unique, counts = np.unique(y_train, return_counts=True)
for c, n in zip(unique, counts):
    print(f"  Class {int(c)}: {n}")

In [None]:
# Visualize first two dimensions
fig, ax = plt.subplots(figsize=(10, 7))

for c in range(NUM_CLASSES):
    mask = y_train == c
    ax.scatter(X_train[mask, 0], X_train[mask, 1], 
              label=f'Class {c}', alpha=0.6)

ax.set_xlabel('Feature 0')
ax.set_ylabel('Feature 1')
ax.set_title('Training Data (First 2 Features)')
ax.legend()
plt.show()

## Step 3: Prepare Data for k-NN

k-NN expects CSV format with label in the first column.

In [None]:
# Create CSV data with label first
os.makedirs('data/knn', exist_ok=True)

# Training data: label, features
train_data = np.column_stack([y_train, X_train])
np.savetxt('data/knn/train.csv', train_data, delimiter=',')

# Test data: label, features  
test_data = np.column_stack([y_test, X_test])
np.savetxt('data/knn/test.csv', test_data, delimiter=',')

print("Data files created:")
for f in os.listdir('data/knn'):
    size = os.path.getsize(f'data/knn/{f}') / 1024
    print(f"  data/knn/{f} ({size:.1f} KB)")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

for split in ['train', 'test']:
    s3_key = f"{PREFIX}/{split}/{split}.csv"
    s3_client.upload_file(f'data/knn/{split}.csv', BUCKET_NAME, s3_key)
    print(f"Uploaded: s3://{BUCKET_NAME}/{s3_key}")

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
test_uri = f"s3://{BUCKET_NAME}/{PREFIX}/test"

## Step 4: Train k-NN Model

### Key Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `k` | Number of neighbors | Required |
| `predictor_type` | `classifier` or `regressor` | Required |
| `sample_size` | Number of samples for index | Total dataset |
| `feature_dim` | Number of features | Required |
| `index_type` | `faiss.Flat`, `faiss.IVFFlat`, `faiss.IVFPQ` | faiss.Flat |
| `dimension_reduction_type` | `sign`, `fjlt` (dimension reduction) | None |
| `dimension_reduction_target` | Target dimension | feature_dim |

In [None]:
# Get k-NN container image
knn_image = retrieve(
    framework='knn',
    region=region,
    version='1'
)

print(f"k-NN Image URI: {knn_image}")

In [None]:
# Create k-NN estimator
knn_estimator = Estimator(
    image_uri=knn_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='knn'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    "k": 5,                          # Number of neighbors
    "predictor_type": "classifier",  # or "regressor"
    "feature_dim": NUM_FEATURES,
    "sample_size": len(X_train),
    "index_type": "faiss.Flat",      # Exact search
}

knn_estimator.set_hyperparameters(**hyperparameters)

print("k-NN hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting k-NN training job...")
print("This will take approximately 3-5 minutes.\n")

knn_estimator.fit(
    {
        'train': train_uri,
        'test': test_uri
    },
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = knn_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {knn_estimator.model_data}")

## Step 5: Deploy and Test Model

In [None]:
# Deploy the model
print("Deploying k-NN model...")
print("This will take approximately 5-7 minutes.\n")

knn_predictor = knn_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'knn-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {knn_predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
knn_predictor.serializer = CSVSerializer()
knn_predictor.deserializer = JSONDeserializer()

def predict(data, predictor, batch_size=100):
    """
    Get predictions for data.
    """
    predictions = []
    
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        response = predictor.predict(batch)
        
        for pred in response['predictions']:
            predictions.append(pred['predicted_label'])
    
    return np.array(predictions)

In [None]:
# Get predictions on test set
print("Getting predictions...")
y_pred = predict(X_test, knn_predictor)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)

print("\n" + "=" * 50)
print("CLASSIFICATION RESULTS")
print("=" * 50)
print(f"\nAccuracy: {accuracy:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

## Step 6: Effect of k Value

In [None]:
# Note: In production, you would train separate models with different k values
# Here we demonstrate the concept

print("""
Effect of k Value:
==================

| k Value | Characteristics |
|---------|----------------|
| Small k (1-3) | More sensitive to noise, sharper boundaries |
| Medium k (5-10) | Balanced, good for most cases |
| Large k (20+) | Smoother boundaries, may miss local patterns |

Choosing k:
- Use cross-validation to find optimal k
- Rule of thumb: k = sqrt(n) for classification
- Odd k avoids ties in binary classification
- Larger datasets can use larger k

For this dataset:
- Training samples: {}
- Suggested k range: 5-20
- sqrt(n) â‰ˆ {:.0f}
""".format(len(X_train), np.sqrt(len(X_train))))

## Step 7: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {knn_predictor.endpoint_name}")
knn_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: CSV with label in first column

2. **Key Hyperparameters**:
   - `k`: Number of neighbors
   - `predictor_type`: classifier or regressor
   - `index_type`: Index structure for search

3. **Index Types**:
   - `faiss.Flat`: Exact search (small datasets)
   - `faiss.IVFFlat`: Approximate (medium datasets)
   - `faiss.IVFPQ`: Approximate + compression (large datasets)

4. **Output**:
   - Classification: Predicted label
   - Regression: Predicted value

### Instance Recommendations

| Task | Instance Types |
|------|----------------|
| Training | ml.m5.large, ml.c5.xlarge (CPU), ml.p2.xlarge (GPU) |
| Inference | ml.m5.large, ml.c5.large |

### When to Use k-NN

| Good for | Not ideal for |
|----------|---------------|
| Small-medium datasets | Very large datasets |
| Non-linear boundaries | High-dimensional sparse data |
| When interpretability matters | When training speed is critical |
| Multi-class classification | Streaming data |

### Next Steps

- Try different k values with cross-validation
- Use dimension reduction for high-dimensional data
- Experiment with different index types
- Apply to regression problems