# k-Nearest Neighbors Regression
## DS-3001: Machine Learning 1

Content adapted from Terence Johnson (UVA)

**Notebook Summary**: In this notebook, we discuss KNN regression. This is a simple machine learning technique for predicting a continuous numerical outcome variable. We discuss the similarities and differences between KNN regression and classification. As well, we discuss how we can quantify our errors in a regression task and how we can use that quantification for hyperparameter selection.

#### Setting Up Our Environment

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os # For changing directory

# To mount your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# path_to_DS_3001_folder = '/content/drive/MyDrive/DS-3001/02_Intro_to_ML_Algorithms'
path_to_DS_3001_folder = ''

# Update the path to your folder for the class
# Where you stored the data from the previous noteboook
# path_to_DS_3001_folder = ''
os.chdir(path_to_DS_3001_folder)

### Recap From Last Time and Changes for Today

- In the KNN Classification notebook, we were introduced to a supervised learning task where we used some features/covariates $X$ to predict some outcome $y$.
- The task we looked at was predicting diabetes outcome, which was a categorical feature, making it a **classification** task.
- We looked at our model's performance using the confusion matrix and accuracy measures.
- Today, we'll look at slightly different task: **regression.** For regression, we are interested in predicting a numeric outcome variable. In other words, $y$ will be a continuous variable instead of a categorical class label.
- Today, we will look at tools to model data in a regression setting and how to measure our performance.

# 1. Preparing the Data
### Introduction to the Data for Today

- Today we're looking at a data set about cars. Each observation is a unique make and model of a car. We have a variety of features to choose from. We're going to attempt to predict the `baseline sales` ($y$) of the car given the car's `baseline price` and `baseline mpg` (x). A more complete description of our variables of interest are given below:
  - `baseline price`: The market price of the car.
  - `baseline mpg`: The manufacturer's claim of miles per gallon.
  - `baseline sales`: The predicted sales of the car.

In [None]:
# Load in our data set as a Pandas Data Frame
cars_df = pd.read_csv('data/cars_env.csv')
cars_df.head()

#### Quick Data Cleaning for our `baseline price` and `baseline mpg` variables

In [None]:
# Look at the distribution of the price
# Should we make any changes?


In [None]:
# Look at the distribution of MPG
# should we make any changes?


### How do our chosen variables vary with the `baseline sales`?

In [None]:
# How can we look at a comparison between the two numeric values visually?



### How do our chosen variables vary with one another, and is there a distinct seperation of the sales value?

In [None]:
import matplotlib as mpl

In [None]:
# Create a scatterplot between the baseline mpg and price, colored by the baseline sales
ax = sns.scatterplot(
    x = cars_df['baseline mpg'],
    y = cars_df['baseline price'],
    hue = cars_df['baseline sales'].values,
    palette = 'crest',
    alpha = 0.5,
    legend = False
)
plt.xlabel('Baseline MPG')
plt.ylabel('Baseline Price')

# Creating a colorbar
norm = mpl.colors.Normalize(
    vmin=cars_df["baseline sales"].min(),
    vmax=cars_df["baseline sales"].max()
)

# Create scalar mappable
sm = mpl.cm.ScalarMappable(norm=norm, cmap="crest")
sm.set_array([])

# Add colorbar
ax.figure.colorbar(sm, ax=ax, label="Baseline Sales")

plt.show()

# 2. Regression

## k-NN Regression

* In KNN regresssion, we use a similar set up as we had for KNN Classification but with a couple of tweaks.
* We are still working with covaraites/features $X = [x_1, x_2, ..., x_L]$, consisting of $N$ observations of $L$ variables, and an observed outcomes/target variable $Y$. Because we have switched to regresssion, our outcome $Y$ is **numeric,** not categorical.
* We are intereseted in when we get a new case $\hat{x} = (\hat{x}_1,...,\hat{x}_L)$. We want to predict what **numeric value** the outcome will likely take, $\hat{y}$.
* With this set up, the $k$ Nearest Neighbor Regression Algorithm works as follows:
  1. Compute the distance from $\hat{x}$ to each observation $x_i$ in the data set.
  2. Find the $k$ "nearest neighbors" $x_1^*$, $x_2^*$, ..., $x_k^*$ to $\hat{x}$ in the data in terms of distance, with values $y_1^*$, $y_2^*$, ..., $y_k^*$.
  3. Return the *average* of the neighbor values,
  \begin{gather}
    \hat{y}(\hat{x}) = \dfrac{y_1^* + y_2^* + ... + y_k^*}{k} = \frac{1}{k} \sum_{i=1}^k y_i^*
  \end{gather}

### Implementing k-NN Regression in Scikit-Learn

* To implement KNN regression in Scikit-Learn, you can import the model similarly to how we did for the KNN Classifier
  - `from sklearn.neighbors import KNeighborsRegressor`
* We will use the same workflow we discussed for the KNN Classifier when using Scikit-Learn:
  1. Create an untrained model object with a fixed $k$:
    - `model = KNeighborsRegressor(n_neighbors=k)`
  2. Fit that object to the data, $(X,y)$:
    - `fitted_model = model.fit(X, y)`
  3. Use the fitted object to make predictions for new cases $\hat{x}$
    - `y_hat = fitted_model.predict(x_hat)`

### Seperating our data into a Train/Test Split

* We discussed in the KNN classification notebook the need to seperate our data into a Train/Test split. We train our model using the training data and test the model (selecting hyperparameters, evaluating performance, etc.) using the test data set.
* This allows us to simulate how our model may perform on unseen data that were created using the same data generation process.
* We care more about how our model does on unseen data than we do about how it performs on the data it was trained on.

### Implementing KNN Using our Cars Data Set

#### 1. Imports and Function Definitions

In [None]:
# Our necessary imports for SKLearn
from sklearn.neighbors import KNeighborsRegressor # The model object
from sklearn.model_selection import train_test_split # Creating a train test split

# Our MinMaxScaler function from last time
# scaling the variables is critical since we are working with distances
def MinMaxScaler(x):

  # Pre-compute the min and max of the variable
  min_x = np.min(x)
  max_x = np.max(x)

  # Calculate the newly scaled version of the variable
  u = (x - min_x) / (max_x - min_x)

  # Return the scaled version of the value
  return u

#### 2. Normalizing our Variables and Creating a Train/Test Split

In [None]:
# Select our outcome variable, baseline sales in this case
y = cars_df['baseline sales']

# Select our variables of interest
var1 = 'baseline mpg'
var2 = 'baseline price'
x = cars_df.loc[:, [var1, var2]] # Creating our x pandas series

# Scale our variables so that they are on the same scale, a crucial step
u = x.apply(MinMaxScaler)

# Create a train test split
u_train, u_test, y_train, y_test = train_test_split(u, y, test_size = 0.2, random_state = 123)


In [None]:
# Look at our scaled distributions
sns.histplot(
  u['baseline mpg'],
  label = 'Baseline MPG',
  color = 'dodgerblue',
  bins = 50,
  alpha = 0.5
)
sns.histplot(
  u['baseline price'],
  label = 'Baseline Price',
  color = 'firebrick',
  bins = 50,
  alpha = 0.5
)

plt.legend()
plt.show()

#### 4. Fit the KNN Regeression Model and make Predictions

In [None]:
# Pick a value of k to start
k = 5

# Create a model instance
model = KNeighborsRegressor(n_neighbors = k)

# Fit the model to our TRAIN data
model = model.fit(u_train, y_train)

# Make predictions on the test data set
y_hat = model.predict(u_test)

#### 5. Visualize our Predicted Outcome vs the True Outcome

In [None]:
# Creating a Scatter Plot
ax = sns.scatterplot(
    x = y_test,
    y = y_hat
)
ax.set_aspect('equal')

# Setting what the y and x limits should be
ax_min = 0
ax_max = max(y_test.max(), y_hat.max())
plt.ylim([ax_min, ax_max])
plt.xlabel('True Test Outcome')
plt.ylabel('Predicted Test Outcome')
plt.title('Comparison of Predicted vs. True Outcome on Test Data Set')
plt.xticks(rotation = 45) # Change the rotation of the tick labels so that they don't overlap
plt.show()

**Question:** What shape would we like to see in the plot before? In other words, if our model was perfect at predicting the outcome, what would the graph look like?

In [None]:
# Add that shape as a reference to the plot above


# 3. Residuals and MSE

## Residuals

* The **residual** is the distance between the true value ($y$) and the predicted value ($\hat{y}$).

\begin{gather}
  \underbrace{r_i}_{\text{Residual, error}} = \underbrace{y_i}_{\text{True}} - \underbrace{\hat{y}(x_i)}_{\text{predicted}}
\end{gather}

* This tells us how far our predicted value is from the true value. This can be interpreted as the error of our prediction.

* This is the performance metric for regression in the same way that the confuson matrix was for classification.

* It is helpful to look at the residuals to understand how our model did

In [None]:
# Let's visualize our residuals
import numpy as np
import matplotlib.pyplot as plt

# Turn our pandas series into Numpy arrays
y_test = np.asarray(y_test)
y_hat = np.asarray(y_hat)

# Add in our data points
plt.scatter(
    y_test,
    y_hat,
    label = 'Our Data'
)

# We can get the perfect prediction line which
ax_min = min(y_test.min(), y_hat.min())
ax_max = max(y_test.max(), y_hat.max())
line = np.linspace(0, ax_max, 1000)

# Visualizing the Perfect Prediction Line
plt.plot(
    line,
    line,
    color = 'black',
    linestyle = '--',
    alpha = 0.2,
    label = r'y = $\hat{y}$'
)

# Plot the Residuals: vertical lines y_i - y_hat_i
for a, b in zip(y_test, y_hat):
    plt.vlines(
        a, a, b,
        color='red', linewidth=.5,
        alpha=0.5, linestyle = '--'
    )

# Add one more empty line to just include in the legend
plt.vlines(
    a, a, b,
    color='red', linewidth=.5,
    alpha=0.5, linestyle = '--',
    label = r'$r = y - \hat{y}$'
)

plt.xlabel('True Y Outcome')
plt.ylabel('Predicted Y Outcome')
plt.title('Plotting Residuals')
plt.xticks(rotation = 45)
plt.legend()
plt.show()


#### Plotting the Residuals on the y-axis instead of our predicted outcome

In [None]:
# Let's visualize our residuals
import numpy as np
import matplotlib.pyplot as plt

# Turn our pandas series into Numpy arrays
y_test = np.asarray(y_test)
y_hat = np.asarray(y_hat)

# Calculate our residual
residual = y_test - y_hat

# Add in our data points
plt.scatter(
    y_test,
    residual
)

# Add a horizontal line at 0
plt.axhline(
    y = 0,
    linestyle = '--',
    color = 'black',
    alpha = 0.2,
    label = 'Prefect Prediction'
)

# Plot the Residuals: vertical lines y_i - y_hat_i
for a, b in zip(y_test, residual):
    plt.vlines(
        a, b, 0,
        color='red', linewidth=.5,
        alpha=0.5, linestyle = '--'
    )

# Add one more empty line to just include in the legend
plt.vlines(
    a, a, b,
    color='red', linewidth=.5,
    alpha=0.5, linestyle = '--',
    label = r'$r = y - \hat{y}$'
)

plt.xlabel('True Y Outcome')
plt.ylabel(r'Residual ($r = y - \hat{y}$)')
plt.title('Plotting Residuals')
plt.legend()
plt.show()


## Loss Function: Mean Squared Error

* Similar to how accuracy was a one-number summary of the confusion matrix for classification, **mean squared error** is a one number summary of how we did at regression.
* We compute the **mean squared error (MSE)** as

\begin{gather}
  \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
\end{gather}

* An alternative measure is the **root mean squared error (RMSE)** which is computed as:

\begin{gather}
  \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
\end{gather}

* This gives us the distance from the true values to the predicted one, weighted by the sample size. As the number of observations (n) gets large, these values typically approach some fixed value.

### Creating a Mean Squared Error Function

* In the classification notebook we used `model.score(y_hat, y_test)` to understand how our model performed. `model.score(y_hat, y_test)` does not return either the MSE or RMSE for a regression task. Instead, it returns the $R^2$ value, which we'll cover in the future.

* Because of this, we need to write our own function for the MSE or we could use use the alternative skelarn function:
  - `from sklearn.metrics import mean_squared_error`

* For today, we're going to write our own function to see how we can implement the math formula in Python.

In [None]:
# First, let's create our own funciton

def mse(y_test, y_hat):

  # Calculate the squared errors
  squared_errors = (y_hat - y_test)**2

  # Calculate the mean of the squared errors
  mean_squared_errors = np.sum(squared_errors) / len(y_test)

  return mean_squared_errors

In [None]:
# Apply the function to our current case
test_mse = mse(y_hat, y_test)
test_rmse = np.sqrt(test_mse)

print('Test MSE:', test_mse)
print('Test RMSE:', test_rmse)

### Picking k (our hyperparameter)

* We can make a comparison between the performance of the model on the train and test data for different values of k. This will let us know how our model performs for different values of k.
* By plotting both the train and test MSE, we can see if the model is overfitting or underfitting on our training data.
* As well, we can identify the model that minimizes the MSE on the test data set and use that value of k for our model going forward.

In [None]:
k_grid = [ (2 * k + 1) for k in range(100) ] # Look at odd k's

mse_train_by_k = []
mse_test_by_k = []

# Loop over the values of k
for k in k_grid:

  # Create a model instance
  model = KNeighborsRegressor(n_neighbors = k)

  # Fit the model to our TRAIN data
  model = model.fit(u_train, y_train)

  # Make predictions on the train and test data set for comparison
  y_hat_train = model.predict(u_train)
  y_hat_test = model.predict(u_test)

  # Calculate the mse for the train and test
  mse_train = mse(y_hat_train, y_train)
  mse_test = mse(y_hat_test, y_test)

  # Append the train and test mse to the lists
  mse_train_by_k.append(mse_train)
  mse_test_by_k.append(mse_test)


In [None]:
# Visualize the results

# Plotting the train MSE
sns.lineplot(
    x = k_grid,
    y = mse_train_by_k,
    label = 'Training MSE'
)

# Plotting the test MSE
sns.lineplot(
    x = k_grid,
    y = mse_test_by_k,
    label = 'Testing MSE'
)

plt.xlabel('k')
plt.ylabel('MSE')
plt.title('MSE vs. k')
plt.legend()

plt.show()

**Question:** What value of k minimizes the MSE for the test data set?

In [None]:
# Find the value for k that minimizes the MSE on the test data set


## Conclusion

* Through the past notebooks, we have completed a full data science loop:
  1. Wrangle the data
  2. EDA and Visualize to view relationships
  3. KNN Regression for numeric response data, and KNN classification to predict categorical response data
  4. Train/Test split for hyperparameter selection

* We will continue to iterate on this process throughout the course.