# Lab 5: KNN Regression
----------------------------------

**Goals**:
 - Practice KNN regression in preparation for this week's homework.
 - Practice using cross-validation to find the optimal hyperparameter (useful for competitions).

 For this lab, we will use the archived Kaggle Flood Prediction competition here:

https://www.kaggle.com/competitions/playground-series-s4e5

 Please join this competition, download the `train.csv` and `test.csv` files, and place them in the same folder as this notebook.

## Data Preprocessing

First, we need to standardize the features. Please fill in the code marked `TODO`.

In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
import numpy as np
import time
import matplotlib.pyplot as plt

# Load train and test CSV files
# We'll only fit the model to the first 10,000 samples of the training data to save time.
train_data = pd.read_csv('train.csv')[:10000]
test_data = pd.read_csv('test.csv')

# Drop id column from the training dataset
train_data = train_data.drop(['id'], axis=1)

# Separate features (X) and target (y) from training data
X_train = train_data.drop('FloodProbability', axis=1)
y_train = train_data['FloodProbability']

# Create a Pipeline object with StandardScaler and KNeighborsRegressor
# model = TODO

# Train the model on the training data
model.fit(X_train, y_train)

## Cross Validation

Now, we want to determine the cross validation accuracy for varying values of `k`.

Complete the following code to plot the average validation R^2 vs. `k` and the average evaluation time per sample vs. `k`.

In [None]:
# Define the hyperparameter range for n_neighbors
n_neighbors_values = range(1, 21)  # Testing for neighbors from 1 to 20

# Store results
n_neighbors_list = []
r2_scores = []
times_per_sample = []

# Perform cross-validation over different values of n_neighbors
for n_neighbors in n_neighbors_values:
    # Create a Pipeline object with StandardScaler and KNeighborsRegressor
    # model = TODO

    # Measure the time taken for cross-validation
    start_time = time.time()

    # Perform cross-validation and calculate mean validation accuracy
    # Perform 5-fold cross-validation using the cross_val_score function
    #
    # mean_validation_score = TODO

    # Calculate elapsed time and seconds per sample
    elapsed_time = time.time() - start_time
    seconds_per_sample = elapsed_time / len(X_train)

    # Store results for plotting
    n_neighbors_list.append(n_neighbors)
    r2_scores.append(mean_validation_score)
    times_per_sample.append(seconds_per_sample)

    # Print out the validation accuracy, the value of n_neighbors, and the time per sample
    print(f'Validation Accuracy: {mean_validation_score:.4f} with n_neighbors={n_neighbors}')
    print(f'Time taken: {elapsed_time:.2f} seconds, Seconds per sample: {seconds_per_sample:.6f} seconds')

# Plotting results
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# Left plot: R² score vs number of neighbors
axs[0].plot(n_neighbors_list, r2_scores, marker='o', linestyle='-', color='b')
axs[0].set_xlabel('Number of Neighbors')
axs[0].set_ylabel('Validation R² Score')
axs[0].set_title('Validation R² Score vs. Number of Neighbors')

# Right plot: Time per sample vs R² score
axs[1].plot(n_neighbors_list, times_per_sample, color='r')
axs[1].set_xlabel('Number of Neighbors')
axs[1].set_ylabel('Time per Sample (seconds)')
axs[1].set_title('Time per Sample vs. Number of Neighbors')

plt.tight_layout()
plt.show()

# Output the best hyperparameter
best_n_neighbors = n_neighbors_list[np.argmax(r2_scores)]
best_score = np.max(r2_scores)
print(f'Best n_neighbors: {best_n_neighbors} with Validation Accuracy: {best_score:.4f}')


## Submit to Kaggle

Using the optimal `k` you found, complete the following code to generate predictions on the test data.

Then, submit the `submission.csv` to Kaggle. Please show the TA both the plots generated above and the Kaggle submission result.

In [None]:
# Create a model using the optimal k value
# model = TODO

# Train the model on the training data
model.fit(X_train, y_train)

# For the test data, we also drop unnecessary columns but keep 'id' for the final submission
X_test = test_data.drop(['id'], axis=1)
test_ids = test_data['id']

# Make prediction on test data
y_pred = model.predict(X_test)

# Create the submission DataFrame
submission = pd.DataFrame({'id': test_ids, 'FloodProbability': y_pred})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' has been created!")