<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>
<h2>Script 07 | K-Nearest Neighbors and Distance Standardization</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h3>Part I: Preparing for Model Building</h3><br>
In this script, we will move into distance-based modeling with k-Nearest Neighbors (KNN). Like OLS regression, KNN is a widely used model type because:

* Predictions are based on an intuitive concept.
* It works in both regression and classification settings

<br><br><strong>a)</strong> Imports and Loading the Dataset

1. Import the following packages:
    * pandas (as pd)
    * matplotlib.pyplot (as plt)
    * seaborn (as sns)
    * numpy (as np)
    * train_test_split (from sklearn.model_selection)<br><br>

2. Load the 'housing_feature_rich.xlsx' dataset into Python as <em>housing</em>.

In [None]:
# importing libraries
_____


# new libraries
from sklearn.neighbors import KNeighborsRegressor # KNN for Regression
from sklearn.preprocessing import StandardScaler  # standard scaler


# setting print options for pandas and numpy
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
np.set_printoptions(suppress=True)


# specifying file names
file     = _____


# reading into Python
housing     = pd.read_excel(io         = file,
                            header     = 0   ,
                            sheet_name = 0   )


# this code will not produce an output

In [None]:
# importing libraries
import pandas as pd                                  # data science essentials
import matplotlib.pyplot as plt                      # data visualization
import seaborn as sns                                # enhanced data viz
import numpy as np                                   # mathematical essentials
from sklearn.model_selection import train_test_split # train/test split


# new libraries
from sklearn.neighbors import KNeighborsRegressor # KNN for Regression
from sklearn.preprocessing import StandardScaler  # standard scaler


# setting print options for pandas and numpy
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
np.set_printoptions(suppress=True)


# specifying file name
file     = './datasets/housing_feature_rich.xlsx'

# reading into Python
housing     = pd.read_excel(io         = file,
                            header     = 0   ,
                            sheet_name = 0   )


# this code will not produce an output

<br>

In [None]:
# checking the dataset
housing.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>b)</strong> Fill in the blanks to separate features for modeling.

In [None]:
# preparing X-features
housing_data   = housing.drop(['Sale_Price',
                               'log_Sale_Price'],
                                axis = 1)


# preparing y-feature
housing_target = housing.loc[ : , _____]


# this code will not produce an output

In [None]:
# preparing X-features
housing_data   = housing.drop(['Sale_Price',
                               'log_Sale_Price',
                               'property_id'],
                                axis = 1)


# preparing y-feature
housing_target = housing.loc[ : , 'Sale_Price']


# this code will not produce an output

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Run the following code to create a <strong>standardized</strong> version of the the dataset.

In [None]:
# INSTANTIATING a StandardScaler() object
scaler = StandardScaler()


# FITTING the scaler with the data
scaler.fit(housing_data)


# TRANSFORMING our data after fit
x_scaled = scaler.transform(housing_data)


# converting scaled data into a DataFrame
x_scaled_df = pd.DataFrame(x_scaled)


# checking the results
x_scaled_df.describe(include = 'number').round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Notice that the headers (feature names) have disappeared. Let's add them back and then analyze how variance has changed after scaling.

In [None]:
# adding labels to the scaled DataFrame
x_scaled_df.columns = housing_data.columns

#  Checking pre- and post-scaling of the data
print(f"""
Dataset BEFORE Scaling
----------------------
{np.var(housing_data.iloc[ : , 0:5 ],
        axis = 0)}


Dataset AFTER Scaling
----------------------
{np.var(x_scaled_df.iloc[ : , 0:5 ],
        axis = 0)}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Correlation Analysis: Pre- and Post-Standardization</h3><br>
Let's observe what happens to correlation after standardizing the dataset. In order to best see the results, we will limit our analysis to a small set of features.

In [None]:
##############################################################################
# Unscaled Dataset
##############################################################################

# subsetting the original dataset
housing_subset = housing_data.loc[ : , ['Garage_Cars',
                                        'Overall_Qual',
                                        'Total_Bsmt_SF',
                                        'NridgHt',
                                        'Kitchen_AbvGr',
                                        'has_Second_Flr']]


# UNSCALED correlation matrix
df_corr = housing_subset.corr().round(2)


# heatmap of UNSCALED correlations
sns.heatmap(df_corr,
            cmap = 'coolwarm',
            square = True,
            annot = True,
            cbar = False,
            linecolor  = 'black', 
            linewidths = 0.5)


plt.show()

##############################################################################
# Scaled Dataset
##############################################################################

# SCALED correlation matrix
df_scaled_corr = x_scaled_df.loc[ : , ['Garage_Cars',
                                       'Overall_Qual',
                                       'Total_Bsmt_SF',
                                       'NridgHt',
                                       'Kitchen_AbvGr',
                                       'has_Second_Flr']].corr().round(2)


# titling the plot
plt.title("BEFORE Standardization")



# heatmap of SCALED correlations
sns.heatmap(df_scaled_corr,
            cmap = 'coolwarm',
            square = True,
            annot = True,
            cbar = False,
            linecolor  = 'black',
            linewidths = 0.5)


# titling the plot
plt.title("AFTER Standardization")
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Notice how the correlations remain unchanged. Not a single linear relationship has changed. However, standardization has profound effects on distance-based algorithms, as we will discover below.
<br><br>
<h2>Part III: k-Nearest Neighbors with Non-Standardized Data</h2><br>
<strong>a)</strong> Develop training and testing sets using the non-standardized dataset.

In [None]:
# this is the exact code we were using before
_____, _____, _____, _____ = _____(
            _____,
            _____,
            test_size    = 0.25,
            random_state = 702 )


# this code will not produce an output

In [None]:
# this is the exact code we were using before
x_train, x_test, y_train, y_test = train_test_split(
            housing_data,
            housing_target,
            test_size    = 0.25,
            random_state = 702 )


# this code will not produce an output

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>KNN with Non-Standardized Data</h3><br>
<strong>b)</strong> Fill in the blanks below to develop a k-Nearest Neighbors model.

In [None]:
# INSTANTIATING a KNN model object
knn_reg = KNeighborsRegressor(algorithm = 'auto',
                              n_neighbors = 1)


# FITTING to the training data
knn_fit = knn_reg._____(_____, _____)


# PREDICTING on new data
knn_reg_pred = knn_fit._____(_____)


# SCORING the results
knn_reg_score_train = round(knn_reg._____(_____, _____), ndigits = 4)
knn_reg_score_test  = round(knn_reg._____(_____, _____), ndigits = 4)
knn_reg_test_gap = round(abs(knn_reg_score_train - knn_reg_score_test), ndigits = 4)


# checking results
print(f"""
K-Nearest Neighbors
-------------------
Training Score: {knn_reg_score_train}
Testing Score : {knn_reg_score_test}
Train-Test Gap: {knn_reg_test_gap}
""")

In [None]:
# INSTANTIATING a KNN model object
knn_reg = KNeighborsRegressor(algorithm   = 'auto',
                              n_neighbors = 1     )


# FITTING to the training data
knn_fit = knn_reg.fit(x_train, y_train)


# PREDICTING on new data
knn_reg_pred = knn_fit.predict(x_test)


# SCORING the results
knn_reg_score_train = round(knn_reg.score(x_train, y_train), ndigits = 4)
knn_reg_score_test  = round(knn_reg.score(x_test, y_test), ndigits = 4)
knn_reg_test_gap = round(abs(knn_reg_score_train - knn_reg_score_test), ndigits = 4)


# checking results
print(f"""
K-Nearest Neighbors
-------------------
Training Score: {knn_reg_score_train}
Testing Score : {knn_reg_score_test}
Train-Test Gap: {knn_reg_test_gap}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>c)</strong> How Many Neighbors?<br>
We can spend time testing out several different neighbor values, but it would be much more efficient to develop a function to automate this, as in the code below.

In [None]:
## optimal neighbors ##
def opt_neighbors(x_train   = x_train,
                  y_train   = y_train,
                  x_test    = x_test,
                  y_test    = y_test,
                  max_neighbors = 50):
    
    """
    This function visualizes R-Square values for the K-Nearest Neighbors
    algorithm.
    
    
    Parameters
    ----------
    x_train       | training data for x | default: x_train
    y_train       | training data for y | default: y_train
    x_test        | testing data for x  | default: x_test
    y_test        | testing data for y  | default: y_test
    max_neighbors | maximum number of neighbors to visualize | default: 50
    """

    # lists to store metrics
    train_rsq = []
    test_rsq  = []
    tt_gap    = []
    
    
    # creating range object for neighbors
    neighbors = range(max_neighbors)
    
    
    # visualizing results
    for n_neighbors in neighbors:

        # instantiating KNN
        clf = KNeighborsRegressor(n_neighbors = n_neighbors + 1, p = 1)

        # fitting to the data
        clf.fit(x_train, y_train)

        # storing the training set accuracy
        train_rsq.append(clf.score(x_train, y_train))

        # recording the generalization accuracy
        test_rsq.append(clf.score(x_test, y_test))

        # train
        tt_gap.append(abs(clf.score(x_train, y_train) - clf.score(x_test, y_test)))


    # plotting the visualization
    fig, ax = plt.subplots(figsize=(12,8))
    plt.plot(neighbors, train_rsq, label = "R-Square (Training Set)")
    plt.plot(neighbors, test_rsq,  label = "R-Square (Testing Set)")
    plt.ylabel(ylabel = "Coefficient of Determination")
    plt.xlabel(xlabel = "Number of Neighbors")
    plt.legend()
    plt.show()


    # finding the optimal number of neighbors
    opt_neighbors = tt_gap.index(min(tt_gap)) + 1
    print(f"""The optimal number of neighbors is {opt_neighbors}""")
    
    return train_rsq, test_rsq, tt_gap

<br>

In [None]:
# visualizing KNN results
x = opt_neighbors(x_train   = x_train,
              y_train   = y_train,
              x_test    = x_test,
              y_test    = y_test,
              max_neighbors = 50)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>d)</strong> Fill in the blanks below to develop a KNN model using the optimal number of neighbors.

In [None]:
# INSTANTIATING a model with the optimal number of neighbors
knn_opt = _____(algorithm = 'auto',
                n_neighbors = _____)



# FITTING the model based on the training data
knn_opt_fit = knn_opt._____(_____, _____)



# PREDITCING on new data
knn_opt_pred = _____._____(x_test)


# SCORING the results
knn_opt_score_train = round(knn_opt.score(_____, _____), ndigits = 4)
knn_opt_score_test  = round(knn_opt.score(_____, _____), ndigits = 4)
knn_opt_test_gap    = round(abs(knn_opt_score_train - knn_opt_score_test), ndigits = 4)


# checking results
print(f"""
K-Nearest Neighbors
-------------------
Training Score: {knn_opt_score_train}
Testing Score : {knn_opt_score_test}
Train-Test Gap: {knn_opt_test_gap}
""")

In [None]:
# INSTANTIATING a model with the optimal number of neighbors
knn_opt = KNeighborsRegressor(algorithm   = 'auto',
                              n_neighbors = 47)



# FITTING the model based on the training data
knn_opt_fit = knn_opt.fit(x_train, y_train)



# PREDITCING on new data
knn_opt_pred = knn_opt_fit.predict(x_test)


# SCORING the results
knn_opt_score_train = round(knn_opt.score(x_train, y_train), ndigits = 4)
knn_opt_score_test  = round(knn_opt.score(x_test, y_test), ndigits = 4)
knn_opt_test_gap    = round(abs(knn_opt_score_train - knn_opt_score_test), ndigits = 4)


# checking results
print(f"""
K-Nearest Neighbors
-------------------
Training Score: {knn_opt_score_train}
Testing Score : {knn_opt_score_test}
Train-Test Gap: {knn_opt_test_gap}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV: k-Nearest Neighbors with Standardized Data</h2><br>
<strong>a)</strong> Develop training and testing sets using the standardized dataset.

In [None]:
# this is the exact code we were using before
x_train_STAND, x_test_STAND, y_train_STAND, y_test_STAND = _____(
            _____,
            _____,
            test_size = 0.25,
            random_state = 702)


# this code will not produce an output

In [None]:
# this is the exact code we were using before
x_train_STAND, x_test_STAND, y_train_STAND, y_test_STAND = train_test_split(
            x_scaled_df,
            housing_target,
            test_size = 0.25,
            random_state = 702)


# this code will not produce an output

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>b)</strong> Complete the code below and determine the optimal number of neighbors.

In [None]:
# visualizing KNN results on standardized data
opt_neighbors(x_train   = _____,
              y_train   = _____,
              x_test    = _____,
              y_test    = _____,
              max_neighbors = 50)

In [None]:
# visualizing KNN results on standardized data
opt_neighbors(x_train   = x_train_STAND,
              y_train   = y_train_STAND,
              x_test    = x_test_STAND,
              y_test    = y_test_STAND,
              max_neighbors = 50)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>c)</strong> Fill in the blanks below to develop a KNN algorithm based on the standardized data.<br>

In [None]:
# INSTANTIATING a model with the optimal number of neighbors
knn_stand = KNeighborsRegressor(algorithm   = 'auto',
                                n_neighbors = 13,
                                p = 1)



# FITTING the model based on the training data
knn_stand_fit = knn_stand.fit(x_train_STAND, y_train_STAND)



# PREDITCING on new data
knn_stand_pred = knn_stand_fit.predict(x_test_STAND)



# SCORING the results
knn_stand_score_train = round(knn_stand.score(x_train_STAND, y_train_STAND), ndigits = 4)
knn_stand_score_test  = round(knn_stand.score(x_test_STAND, y_test_STAND), ndigits = 4)
knn_stand_test_gap = round(abs(knn_stand_score_train - knn_stand_score_test), ndigits = 4)


# checking results
print(f"""
K-Nearest Neighbors
-------------------
Training Score: {knn_stand_score_train}
Testing Score : {knn_stand_score_test}
Train-Test Gap: {knn_stand_test_gap}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>


<strong>d)</strong> Fill in the blanks below to create an summary output for each KNN model's performance.<br>

In [None]:
# comparing results

print(f"""
KNN Model             Neighbors     Train Score      Test Score
----------------      ---------     ----------       ----------
Non-Standardized      _____          {_____}           {_____}
Non-Standardized      _____          {_____}           {_____}
Standardized          _____          {_____}           {_____}
""")

In [None]:
## Sample Solution ##

# comparing results

print(f"""
KNN Model             Neighbors     Train Score      Test Score
----------------      ---------     ----------       ----------
Non-Standardized      1             {knn_reg_score_train}              {knn_reg_score_test}
Non-Standardized      47            {knn_opt_score_train}           {knn_opt_score_test}
Standardized          13            {knn_stand_score_train}           {knn_stand_score_test}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

A great <a href="https://www.youtube.com/watch?v=HVXime0nQeI">video on KNN can be found here</a>.
Also, more linear model types can be found in <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model">scikit-learn's linear model documentation</a>.
<br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

~~~


____ ____ ___ ___ _ _  _ ____          
[__  |___  |   |  | |\ | | __          
___] |___  |   |  | | \| |__]          
                                       
___ _  _ ____                          
 |  |__| |___                          
 |  |  | |___                          
                                       
____ ___ ____ _  _ ___  ____ ____ ___  
[__   |  |__| |\ | |  \ |__| |__/ |  \ 
___]  |  |  | | \| |__/ |  | |  \ |__/ 
                                       
_  _ _ ____ _  _   /                   
|__| | | __ |__|  /                    
|  | | |__] |  | .                     
                                       



~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>