# K-Nearest Neighbors Example
Here is demonstrated the power of Random Forest, a critical ensemble method, by applying them to both a classification task and a regression task using implementations from the rice_ml library.

In [16]:
import numpy as np
from sklearn.datasets import load_iris, make_regression
from rice_ml.supervised_learning.k_nearest_neighbors import KNNClassifier, KNNRegressor
from rice_ml.processing.post_processing import accuracy_score, mse, r2_score
from rice_ml.processing.preprocessing import train_test_split

## Part 1: KNN Classification

### 2.1 Load Data and Data Preparation

In [4]:
# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

print(f"Total Samples in Dataset: {X.shape[0]}")
print(f"Number of Features: {X.shape[1]}")
print(f"Target Classes: {iris.target_names}")

Total Samples in Dataset: 150
Number of Features: 4
Target Classes: ['setosa' 'versicolor' 'virginica']


### 2.2 Data Pre-Processing: Scaling and Splitting
KNN is highly sensitive to feature sclaing. Therefore, we standardize the features (mean=0, variance = 1) before splitting the data for training and testing. This ensures all features contribute equally to the distance calculation.

In [8]:
# Simple Standardization (Z-score scaling)
X_mean = np.mean(X, axis = 0)
X_std = np.std(X, axis = 0)

# Avoiding divisioin by zero for constant features
X_std[X_std == 0] = 1
X_scaled = (X - X_mean) / X_std

# Split the dataset into training (80%) and testing (20%) sets
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X_scaled, y, test_size = 0.2, random_state=67
)

# Verify the split integrity
print(f"\nTraining Set Size: {X_train_cls.shape[0]} samples")
print(f"Testing Set Size: {X_test_cls.shape[0]} samples")


Training Set Size: 120 samples
Testing Set Size: 30 samples


### 2.3 Initialize and Train the Model
Initialize the KNNClassifier using the standard Euclidean metric.\
The KNN model is considered a lazy learner because the training phase involves nothing more than simply storing the training data. The computational work is deferred entirely to the prediction phase.

In [9]:
# 1. Initialize the KNearestNeighbors Classifier
# Set k=5 (a common odd number) and use the Euclidean distance (default).
knn_cls = KNNClassifier(n_neighbors=5, metric='euclidean', weights='distance') 

print("\nKNN is a 'lazy' model. Training involves storing the data...")

# 2. Fit the model to the training data
# Validates and stores X_train and y_train internally.
knn_cls.fit(X_train_cls, y_train_cls)

print("Training Complete. Data stored for prediction.")


KNN is a 'lazy' model. Training involves storing the data...
Training Complete. Data stored for prediction.


### 2.4 Prediction and Evaluation
Use the fitted KNN model to find the 5 nearest neighbors for each test sample, predict its class via majority vote,, and evaluate the accuracy.

In [10]:
# 1. Generate predictions on the test set
print("Starting prediction on test set...")
y_pred_cls = knn_cls.predict(X_test_cls)
print("Prediction Complete.")

# 2. Calculate the Accuracy Score
accuracy = accuracy_score(y_test_cls, y_pred_cls)

print(f"\n--- Evaluation Results (k=5) ---")
print(f"KNN Accuracy: {accuracy:.4f}")

Starting prediction on test set...
Prediction Complete.

--- Evaluation Results (k=5) ---
KNN Accuracy: 1.0000


## Part 2: KNN Regression
Use a synthetic regression dataset to train KNNRegressor, demonstrating prediction of continuous values.
### 3.1 Load Data and Preparation
Generate a simple dataset where the target variable is continuous. Also, standardize the features.  

In [13]:
# Create synthetic regression data
X_reg, y_reg = make_regression(n_samples=150, n_features=3, noise=10.0, random_state=67)

# Standardization for regression features
X_reg_scaled = (X_reg - np.mean(X_reg, axis=0)) / np.std(X_reg, axis=0)

print(f"Total Regression Samples: {X_reg_scaled.shape[0]}")

Total Regression Samples: 150


### 3.2 Splitting the Dataset
The scaled regression data is split for independent testing

In [14]:
# Split the regression dataset
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg_scaled, y_reg, test_size=0.2, random_state=67
)

# Verification
print(f"\nTraining Set Size: {X_train_reg.shape[0]} samples")


Training Set Size: 120 samples


### 3.3 Initialize and Train the Model
Initialize the KNNRegressor, which predicts by calculating the weighted average of the neighbor's target values.

In [15]:
# 1. Initialize the KNNRegressor
# Using k=10 with the Manhattan distance and uniform weighting
knn_reg = KNNRegressor(
    n_neighbors=10, 
    metric='manhattan', # Testing a different metric
    weights='uniform' 
) 

print("\nInitializing KNN Regressor...")

# 2. Fit the model (store data)
knn_reg.fit(X_train_reg, y_train_reg)

print("Training Complete.")


Initializing KNN Regressor...
Training Complete.


### 3.4 Prediction and Evaluation
Predict the continuous targets and evaluate performance using Mean Squared Error (MSE) and R-squared ($R^2$).

In [17]:
# 1. Generate continuous predictions
y_pred_reg = knn_reg.predict(X_test_reg)

# 2. Calculate evaluation metrics

mean_squared_error = mse(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"\n--- Regression Results (k=10, Uniform Weights) ---")
print(f"Mean Squared Error (MSE): {mean_squared_error:.2f}")
print(f"R-squared (R2) Score: {r2:.4f}")


--- Regression Results (k=10, Uniform Weights) ---
Mean Squared Error (MSE): 1770.64
R-squared (R2) Score: 0.8486
