In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# KNN-Regression

The goal of this notebook is to help you understand how the KNN algorithm for regression works.

The KNN regression algorithm relies on the proximity of data points in the feature space to make predictions. It assumes that points with similar features in the predictor space will have similar response values. 

The main idea is to first chop the *predictor* space into chunks.
+ For each chunk in the predictor space:
    + find the k-nearest points in your data set.
    + Aggregate the values of the *response* for those k-nearest points to generate a *predicted value*. This could be a *mean*, a *weighted mean*, a *median* or something similar


Let's try this out with simple 2D examples. In the 2D case we have *predictor* `sqft` that we want to use to predict the value for *response* variable `price`.

## Choose a Data Set

Pick out one data set to work with.

+ Linear Relationship
+ Quadratic Relationship
+ Sin(x) Relationship

In [None]:
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# Linear

# Generate square footage values from 800 to 4000
sqft = np.linspace(800, 4000, num=100)
# Generate corresponding prices with a quadratic relationship
price = 80000 + 100 * (sqft - 800)

# Add some random noise to make it more realistic
noise_level = 15000
price += np.random.normal(0, noise_level, size=len(price))

dict = {'price':price, 'sqft':sqft}

df_linear = pd.DataFrame(dict)
df_linear.plot.scatter(x='sqft',y='price')
plt.ticklabel_format(style='plain', axis='y')



In [None]:
# Quadratic

# Generate square footage values from 800 to 4000
sqft = np.linspace(800, 4000, num=100)

# Generate a non-linear relationship for price
price = 80000 + 0.1 * (sqft - 800) ** 2 + np.random.normal(0, 15000, size=len(sqft))

# Ensure that prices are reasonable
price = np.maximum(price, 80000)  # Set a minimum price

# Create a DataFrame
data = {'price': price, 'sqft': sqft}
df_quadratic = pd.DataFrame(data)
df_quadratic.plot.scatter(x='sqft',y='price')
plt.ticklabel_format(style='plain', axis='y')

In [None]:
# Sin(x) example
# Generate square footage values from 800 to 4000
sqft = np.linspace(800, 4000, num=100)

# Generate a non-linear relationship for price using a sine function
periodic_factor = 2 * np.pi / 1000  # Adjust this to complete a full period every 1000 sqft
price_amplitude = 30000  # Adjust this to control the amplitude of the sine wave
price = 80000 + price_amplitude * np.sin(periodic_factor * sqft) + np.random.normal(0, 5000, size=len(sqft))

# Create a DataFrame
data = {'price': price, 'sqft': sqft}
df_sin = pd.DataFrame(data)

# Plot the scatter plot
plt.figure(figsize=(8, 6))
df_sin.plot.scatter(x='sqft', y='price')
plt.xlabel('Square Footage (sqft)')
plt.ylabel('Price')
plt.title('Sinusoidal Relationship between Square Footage and Price')

# Disable scientific notation on the y-axis
plt.ticklabel_format(style='plain', axis='y')


In [None]:
# Set the dataframe for the rest of the notebook.
df = df_linear

### The Goal of KNN

The **goal** is to find a **line-of-best-fit** through these data points.

### KNN - Regression Algorithm

Given a `sqft` where we want to make a prediction, we:

1. Calculate the distance between the new data point and all known data points in the dataset.
2. Select the *k* closest datapoints.
3. Average the target variable for the *k* closest data points.

There are many ways to measure the *distance* between dimensions. For this example, let's use the formula for *Euclidean Distance* with 1-dimension (which simplifies to absolute difference, but no worries): 

$dist(x_1, x_2) = \sqrt(x_1 - x_2)^2$

#### Euclidean Distance Pros & Cons
Pros: 
- Easy to understand (straight line distance between Euclidean Space)
- Works with 1 or more dimensions

Cons:
- Sensitive to Scaling - Features not in the same units can skew the result
- Suffers from the curse of dimensionality
- Doesn't account for covariance (how much variables change together).  Sometimes, other distance measures, such as Mahalanobis distance, may be more appropriate.

In [None]:
# Create helper function

def euclidean_distance_x(x_target, x_actual):
    return np.sqrt((x_target-x_actual)**2)

# Test it
euclidean_distance_x(x_target = np.array([1500]), x_actual = np.array([1700,1000]))

Now we can create a function for our algorithm:

1. We need to pass our data `df`, a value `k` indicating how many neighbors we want to consider, and the `x`-value that we want to make a prediction at
1. Calculate the distance from `x` to every `x-value` in the data
1. Get the closest `k` of these points
1. aggregate the response value `y` for the closest `k` points, this is our prediction
    


In [None]:
def knn_2d_specific_x(df, k, sqft, visualize=False):
    
    """
    To demonstrate how the KNN algorithm produces a prediction in a 2D case
    
    Input:
    
        df - DataFrame with x, y columns
        k  - number of neighbors to consider
        x_pt - value of x where we want to make prediction
    
    Output:
        Prediction 
    """
        
    df_temp = df.copy()
    df_temp['distance'] = euclidean_distance_x(np.array(sqft), df_temp.sqft)
    y_pred = df_temp.loc[df_temp.distance.nsmallest(k).index].price.mean()
    df_knn = df_temp.loc[df_temp.distance.nsmallest(k).index]
             
    # Visualize
    if visualize:
        prediction = pd.DataFrame({'sqft':sqft, 'y':y_pred}, index=['0'])
        figure, axes = plt.subplots(1,1,figsize=(10,10))
        axes.axvline(sqft, color='lightgrey', linestyle='dashed')
        axes.scatter(df.sqft,df.price, color='lightblue')
        axes.scatter(df_knn.sqft,df_knn.price, color='red', label = f'Nearest {k} Neighbors to X={sqft}')
        axes.scatter(
            prediction.sqft,
            prediction.y, 
            color='black', 
            marker='x', 
            s = 75,
            label = f'Prediction = Mean of {k}-Nearest Neighbors'
        )
        axes.set_title(
            f'Making a prediction at x={sqft}\nUsing the Nearest k={k} Neighbors',
            fontsize=20)
        axes.legend()
        
    return y_pred
    

In [None]:
# Test the function
knn_2d_specific_x(df, k=10, sqft=2500, visualize=True)

---

We can try out our function on our randomly generated data and different values of `k`.

In [None]:
# for x in range(1000,2000,100):
#     knn_2d_specific_x(df, k=5, sqft=x, visualize=True)

## Apply Your Prediction Function to Entire DataFrame

We can run our algorithm over all points in our data and then assess the *model fit*.

In the code cell below, we calculate a new column `predictions_k5` by using the `apply()` method to *apply* the `knn_2d_specific_x()` function over all `x` in the data. 

In [None]:
# code to apply the knn_2d_specific_x() function to each x in df
df['predictions_k5'] = df.apply(
    lambda row: knn_2d_specific_x(
        df, 
        k=5, 
        sqft=row['sqft'], 
        visualize=False),
    axis=1
)
df.head()                         

## Visualize Your Predictions

Now that you can make a prediction for every x-value in the data, create a plot!

+ Plot the `predictions` as a line
+ Plot the original data as a scatterplot

In [None]:
fig,ax = plt.subplots(1,1,figsize=(10,10))

ax.scatter(df.sqft, df.price)
ax.plot(df.sqft, df.predictions_k5)
ax.set_title('KNN Model Fit for k = 5', fontsize=15)

## Find the Optimal Value of k

How do we determine which value of `k` is best?

We can calculate various functions of the data that might help us to compare predictions across models.

Here we will use **root mean squared prediction error (RMSPE)** = $\sqrt{\sum_{i=1}^n(\hat{y_i} - y_i)^2}$ 

+ Calculate Residual = Total Actual Value - Predicted Value
+ Square the Residuals
+ Sum the Squared Residuals
+ Take the square root


Now that you can calculate the `RMSPE` for a single value of `k`, calculate this quantity for a variety of `k` and then create a plot with `k` on the x-axis and `RMSPE` on the y-axis. This plot will help you to determine a value of `k` that is *optimal*.

In [None]:
# Create new DataFrame containing two columns, k and rmspe
# Try k = 2, 5, 10, 15, 20

def get_preds_for_multiple_k(ks):

    for k in ks:
        
        # Create new column of predictions
        df[f'predictions_k{k}'] = df.apply(
            lambda row: knn_2d_specific_x(
                df, 
                k=k, 
                sqft=row['sqft'], 
                visualize=False),
            axis=1
        )
        
def get_residual(k):
    
    df[f'residuals_k{k}']= df['price'] - df[f'predictions_k{k}']
    return df
    
def calculate_rmspe(ks):
    rmspe = []
    for k in ks:
        model_data = get_residual(k)
        rmspe.append(np.sqrt((model_data[f'residuals_k{k}']**2).mean()))
        
    return pd.DataFrame(zip(ks, rmspe), columns=['k','rmspe'])


In [None]:
# Use functions

ks = [i for i in range(1,20,1)]
get_preds_for_multiple_k(ks)
rmspe = calculate_rmspe(ks)

In [None]:
# Create plot

fig, ax = plt.subplots(1,1, figsize=(10,10))
ax.plot(rmspe.k, rmspe.rmspe)
ax.scatter(rmspe.k, rmspe.rmspe,c='black')
ax.set_xlabel('k = # nearest neighbors')
ax.set_ylabel('RMSPE')
ax.set_title('Elbow Plot - Which Value of k Minimizes the RMSPE?')

---

One cool thing about KNN is that it is very flexible. Try re-running the notebook with a different data set at the top!

## What is to come for Data Science students?

Of course all of these algorithms have been implemented. The `sklearn` library contains many common machine learning models that all have a similar API. However, you are NOT allowed to use this for you final project. I am including it here as a teaser for future courses.

There is usually:

1. A call to a constructor to make a machine learning model object. `neigh = KNeighborsRegressor(n_neighbors=k)`
1. A fit method call where you pass your data `X` and target `y`. `neigh.fit(X, y)`
1. A predict method call where you pass the values where you want to predict. `neigh.predict(X)`

In [None]:
# Import necessary libraries

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [None]:
def fit_knn_sklearn(df, k, visualize=True):

    neigh = KNeighborsRegressor(n_neighbors=k)
    neigh.fit(X=df[['sqft']], y=df.price)
    df[f'predictions_k{k}'] = neigh.predict(df[['sqft']]) 
    
    RMSE = round(np.sqrt(mean_squared_error(df.price, df[f'predictions_k{k}'])), 4)

    if visualize:
        
        figure, axes = plt.subplots(1,1,figsize=(10,10))

        axes.scatter(
            df.sqft,
            df.price, 
            color='lightblue',
            label='Data')

        axes. plot(
            df['sqft'], 
            df[f'predictions_k{k}'],
            color='black',
            label='Predicted Values')

        axes.set_title(
            f'Prediction Line\nUsing the Nearest k={k}',
            fontsize=20)
        axes.legend()
    
    return RMSE

In [None]:
# Try different k-values and look for optimal 

ks = [i for i in range(2,30)]
rmse = []
for k in ks:
    rmse.append(fit_knn_sklearn(df, k, visualize=False))
   
fig, ax = plt.subplots(1,1, figsize=(10,10))
ax.scatter(ks, rmse)

In [None]:
fit_knn_sklearn(df, k=2)

In [None]:
predictions = fit_knn_sklearn(df, k=5)

In [None]:
predictions = fit_knn_sklearn(df, k=10)

In [None]:
predictions = fit_knn_sklearn(df, k=20)