### K-Nearest Neighbors Regression

We chose to use K-Nearest Neighbors (KNN) Regression because it is simple, easy to use, and doesn't require any assumptions about how the data is distributed. This is great for our complex dataset where traditional assumptions may not hold true. KNN also doesn't require a lot of tweaking, which allows us to focus on refining our feature set. 

We're particularly interested in understanding the impact of the features we engineered, and although KNN doesn't provide feature importance scores, we can evaluate the significance of our chosen features by analyzing how different feature configurations affect KNN's performance. This will help us understand which features are most important and how they relate to our target variable. This approach will allow us to refine our model and gain valuable insights into the underlying data relationships.

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [54]:
df = pd.read_csv('../data/data_feature_engineering.csv')
df.head()

Unnamed: 0,price,name,distance,source,destination,precipIntensity,humidity,temperatureHigh,apparentTemperatureHigh,uvIndex,precipIntensityMax,temperatureMax,apparentTemperatureMax
0,12.0,4,1.11,6,11,0.0,0.6,42.52,40.53,0,0.0003,42.52,40.53
1,16.0,0,1.11,6,11,0.0,0.66,33.83,32.85,0,0.0001,33.83,32.85
2,7.5,3,1.11,6,11,0.0,0.56,33.83,32.85,0,0.0001,33.83,32.85
3,7.5,5,1.11,6,11,0.0567,0.86,43.83,38.38,0,0.1252,43.83,38.38
4,26.0,1,1.11,6,11,0.0,0.64,33.83,32.85,0,0.0001,33.83,32.85


In [55]:
X = df.drop('price', axis=1)
y = df['price']

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [33]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(264454, 12) (66114, 12) (264454,) (66114,)


#### Determining K Value

We chose specific k values (3, 4, 5, 7, 9, 11) for testing in the K-Nearest Neighbors (KNN) Regression model based on a strategic approach aimed at balancing model complexity and generalization ability. Starting with a small number of neighbors allows the model to capture more detailed patterns in the data, but this may lead to overfitting where the model is too closely tailored to the training set. As the number of neighbors increases, the model takes a broader context into account, reducing variance but potentially underfitting by smoothing predictions too much.

Our chosen k values provide a range of models from fine-grained to more generalized, allowing us to investigate the trade-off between capturing detailed data relationships and maintaining robustness to unseen data. This range avoids the extremes of too few (sensitive to noise) and too many neighbors (losing relevant information), aiming to find the optimal balance where the model performs best on unseen data, measured by metrics like R-squared, MSE, and RMSE.


In [53]:
neighbors_settings = [3, 4, 5, 7, 9, 11]

for n_neighbors in neighbors_settings:
    # Build the model
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    
    # Train the model
    knn.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = knn.predict(X_test)
    
    # Test the model
    # R-squared
    r2 = r2_score(y_test, y_pred)
    # Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, y_pred)
    # Root Mean Squared Error (RMSE)
    rmse = np.sqrt(mse) 
    
    print(f"R^2 for {n_neighbors} neighbors: {r2}")
    print(f"MSE for {n_neighbors} neighbors: {mse}")
    print(f"RMSE for {n_neighbors} neighbors: {rmse}")

R^2 for 3 neighbors: 0.9167395562237907
MSE for 3 neighbors: 6.026877396954081
RMSE for 3 neighbors: 2.4549699380957968
R^2 for 4 neighbors: 0.917296154510147
MSE for 4 neighbors: 5.986587560879693
RMSE for 4 neighbors: 2.4467504083742773
R^2 for 5 neighbors: 0.9160188768906401
MSE for 5 neighbors: 6.079044377892731
RMSE for 5 neighbors: 2.4655718156023627
R^2 for 7 neighbors: 0.9128714870104566
MSE for 7 neighbors: 6.306870847077373
RMSE for 7 neighbors: 2.511348412123928
R^2 for 9 neighbors: 0.9091185582051136
MSE for 9 neighbors: 6.578529761724698
RMSE for 9 neighbors: 2.5648644723892717
R^2 for 11 neighbors: 0.9058390007629915
MSE for 11 neighbors: 6.815923291274751
RMSE for 11 neighbors: 2.6107323285382495


The evaluation of the KNN regression model across varying numbers of neighbors reveals the superiority of the model with 4 neighbors in terms of key performance metrics. This model stands out with an impressive R^2 value of 0.9173, indicating a strong ability to explain the variance in the target variable. Additionally, it showcases the lowest Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) of 5.9866 and 2.4468, respectively, suggesting the smallest average prediction error.

Taken together, these results suggest that the KNN model with 4 neighbors achieves the best balance between capturing the underlying patterns in the data and generalizing well to unseen data. This model's performance is particularly noteworthy, providing a compelling argument for its use in modeling and prediction tasks.

#### Drawbacks with Model

Although the KNN regression model, especially with 4 neighbors, has shown impressive performance, it's not without limitations. Some of the possible limiations or drawbacks are: 
1. KNN can be computationally intensive, particularly as the dataset size grows. This is because it requires computing the distance between each query point and all other points in the dataset to identify the nearest neighbors. As a result, scalability can become an issue for large datasets.

2. KNN's performance heavily relies on the choice of distance metric and feature relevance. Irrelevant or highly correlated features can significantly degrade the model's accuracy.

3. KNN doesn't handle categorical variables well and requires pre-processing to convert them into a suitable numeric format.

4. KNN makes predictions based solely on the nearest neighbors, it can be sensitive to noise in the data, leading to overfitting, especially with an excessively low value of k.
