# Regression & K-Nearest Neighbors: Programming Practice

COSC 410: Applied Machine Learning\
Colgate University\
*Prof. Apthorpe*

## Overview

This notebook will give you practice with the following topics:
  * Training linear regression models using gradient descent
  * Plotting learning curves to measure overfitting
  * Training KNN models and using KNN to provide interpretable ML
  
We will be using a dataset published by the University of Mons in Belgium with the energy use by household appliances in a research subject's home along with the local weather conditions. We will be attempting to use this data to train a model that can predict energy use given weather conditions alone. This type of prediction could be useful for energy companies to make automatic decisions about managing the power supply or for climate researchers interested in modeling future carbon use based on posible weather patterns.

In [2]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 6]
import seaborn as sns
import sklearn
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.neighbors

## Data Preparation

In the cell below, import the `energydata.csv` dataset. Go to the UCI Machine Learning Repository website where this dataset is hosted to read about each of the features: https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction.

In [3]:
# Use the pandas read_csv function to import energydata.csv
energy = pd.read_csv("energydata.csv")


In [4]:
# print the shape (# rows, # columns) of the dataset
energy.shape

(19735, 25)

In [5]:
# print the first 5 rows of the data using the .head method
energy.head(5)

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint
0,60,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,...,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3
1,60,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,...,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2
2,50,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,...,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1
3,50,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,...,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0
4,60,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,...,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9


All of the features are already numeric, so we do not need to do any additional feature encoding. 

Now, separate the data into labels `y` with column `Appliances` and features `X` with the rest of the columns. 

In [6]:
# Index just the "Appliances" column into a new variable y
y = energy["Appliances"]
X = energy.drop("Appliances", axis=1)


# Remove the "Appliances column from the data using the .drop method and set the result as new variable X


Finally, standardize the examples in X using a `StandardScaler`

In [7]:
# Create a StandardScalar object and use it to standardize X
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

## Linear Regression & Learning Curves

Now that our data is prepared, we will start by training a standard linear regression model using stochastic gradient descent. In Scikit-Learn, this corresponds to the `SGDRegressor` class for regression tasks (or `SGDClassifier` for classification tasks). If you look at the documentation for `SGDRegressor` you should now understand nearly all of the keyword arguments and be able to connect them to material from Chapter 4 in the textbook and class (even if the notation is different): https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html). 

As with most of the Scikit-Learn classes, its worth double-checking the default argument values before doing any training. In this case, we see that `SGDRegressor` uses mean-squared loss (as we did in class). It also uses a L2 penalty, meaning that by default, this class technically performs *Ridge Regression* rather than vanilla linear regression. You can change the loss function to `none`, `l1`, or `elasticnet` to perform Linear Regression, Lasso Regression, or Elastic Net. Note that the `SGDRegressor` class is different from the `LinearRegression`, `Ridge` or `Lasso` classes in that `LinearRegression`, `Ridge` and `Lasso` attempt to solve the closed-form solution of the training minimization problem, while `SGDRegressor` performs stochastic gradient descent. 

Before we get fancy, let's try it the easy way by creating a default instance of the `SGDRegressor` class and using the `cross_val_score()` function to train and test performance 

In [8]:
# Create a SGDRegressor object
sgd = sklearn.linear_model.SGDRegressor()

# Use the cross_val_score function to perform 5-fold cross validation. Use negative mean absolute error as the performance metric (keyword argument "scoring")
# Negative mean abs error: the negative of the abs value of the sum of differences between predicted and actual labels 
scores = sklearn.model_selection.cross_val_score(sgd, X, y, cv=5, scoring="neg_mean_absolute_error")
# print the average score
print("Average MAE: ", -np.mean(scores))

Average MAE:  58.20794473593484


**DISCUSSION:** Do you think this is good or bad? Look at the range of Watt-hours in the actual labels `y` to see whether or not this error represents a significant portion of the label range.

When you have a performing model that performs questionably well, it helps to make as many visualizations as you can to understand what's going wrong. We will start by plotting a learning curve over the size of the training set

In [9]:
# Create a list with the number of examples you will use for training, ranging from 100 to the full training set in steps of 100
n_examples = range(100, X.shape[0], 100)

# Create two lists to hold 1) the training errors and 2) the validation error
train_errors = []
val_errors = []


# Loop over each number of examples n
for n in n_examples :
    
    # Select the first n training examples and training labels 
    print(n, end=' ')
    X_curr = X[0:n, :]
    y_curr = y[0:n]
    # Create a SGDRegressor object
    SGD = sklearn.linear_model.SGDRegressor

    # Use the cross_validate function (NOT cross_val_score) to perform 5-fold cross-validation and return the 
    #     negative mean absolute error on both the training and the validation set. Look this function up in the docs for details!
    scores = sklearn.model_selection.cross_validate(sgd, X, y, cv=5, scoring="neg_mean_absolute_error", return_train_score=True)
    
    # Compute the average training and validation scores accross all folds
    avg_train_score = -scores["train_score"].mean()
    avg_val_score = -scores["test_score"].mean()
    
    # Append the average scores into the accumulator lists
    train_errors.append(avg_train_score)
    val_errors.append(avg_val_score)

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900 8000 8100 8200 8300 8400 8500 8600 8700 8800 8900 9000 9100 9200 9300 9400 9500 9600 9700 9800 9900 10000 10100 10200 10300 10400 10500 10600 10700 10800 10900 11000 11100 11200 11300 11400 11500 11600 11700 11800 11900 12000 12100 12200 12300 12400 12500 12600 12700 12800 12900 13000 13100 13200 13300 13400 13500 13600 13700 13800 13900 14000 14100 14200 14300 14400 14500 14600 14700 14800 14900 15000 15100 15200 15300 15400 15500 15600 15700 15800 15900 16000 16100 16200 16300 16400 16500 16600 16700 16800 16900 17000 17100 17200 17300 17400 17500 17600 17700 17800 17900 18000 18100 18200 18300 18400 1850

In [10]:
# Use Matplotlib to plot the training and validation errors (y-axis) against the number of training examples (x-axis)
plt.plot

# add a x-axis label and a y-axis label


# add a legend and grid lines


<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

This looks like a fairly typical learning curve. The error goes up initially as you get more than a few datapoints, but then back down as the model learns to generalize.

**DISCUSSION:** Look at the vertical space between the validation error and the training error at the end of the curve. Do you think that this is an example of *overfitting* or *underfitting*?

## K-Nearest Neighbors

Now that we have tried out linear regression, let's see whether a nearest neighbors regressor does better or worse on this dataset. 

**DISCUSSION:** What are two reasons why this dataset might be more amenable to KNN than linear regression?

In [11]:
# Create a KNeighborsRegressor object to perform 3-nearest neighbors

knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=5)
# train and evaluate using the cross_val_score function and the negative mean absolute error metric
scores = sklearn.model_selection.cross_val_score(knn, X, y, cv=5, scoring="neg_mean_absolute_error")

# print the average score
print("Average MAE: ", -np.mean(scores))

Average MAE:  71.90787940207751


The primary hyperparameter for KNN is the number of neighbors. When training a model with just a few hyperparameters, a good strategy is often to just try a reasonable range of values (e.g. 1-10) and see which value gives the highest cross validation score. In the following cell, write a loop that creates and trains a KNNRegressor on the energy dataset for a range of values of `n_neighors` and decides which value is the best choice.

In [14]:
# Create list to store average scores
avg_score = []

# Loop over neighbors n = 1 to 10
for n in range(1, 11):
    print(n, end= " ")
    
    # Create a KNNRegressor object for the current number of neighbors
    knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=5)
    # train and evaluate using the cross_val_score function and the negative mean absolute error metric
    scores = sklearn.model_selection.cross_val_score(knn, X, y, cv=5, scoring="neg_mean_absolute_error")
    # store the average score
    avg_score.append(-np.mean(scores))

1 2 3 

In [None]:
# plot the average scores against the number of neighbors


# Add axis labels


Finally, create a learning curve for the KNN model and your choice of `n_neighbors`. You can copy and paste code from the `SGDRegressor` learning curve above. Note that KNNs take longer to perform predictions the larger the training dataset, so you may want to reduce the number of iterations to speed up your code.

In [None]:
# Create a list with the number of examples you will use for training, ranging from 100 to the full training set in steps of 1000


# Create two lists to hold 1) the training errors and 2) the validation error


# Loop over each number of examples n

    
    # Select the first n training examples and training labels 

    
    # Create a KNeighborsRegressor object


    # Use the cross_validate function (NOT cross_val_scores) to perform 5-fold cross-validation and return the 
    #     negative mean absolute error on both the training and the validation set. Look this function up in the docs for details!
    
    
    # Compute the average training and validation scores accross all folds

    
    # Append the average scores into the accumulator lists


In [None]:
# Use Matplotlib to plot the training and validation errors (y-axis) against the number of training examples (x-axis)


# add a x-axis label and a y-axis label


# add a legend and grid lines


**DISCUSSION:** Look at the vertical space between the validation error and the training error at the end of the curve. Do you think that this is an example of *overfitting* or *underfitting*?

## *(Optional)* Regularization & Hyperparameters for Linear Regression

The `SGDRegressor` class has many constructor keyword arguments that change the behavior of the model. Let's see how modifying these parameters affects model performance. Perform the following tasks in the cells below:
1. Choose an argument to `SGDRegressor` that you think will impact performance accuracy
2. Try multiple (at least 3) different options for this argument and use `cross_val_score()` to test the performance of the model with each of these options.
3. Plot a bar chart comparing model performance for each of the argument options

**DISCUSSION:** Once you have finished the tasks above, discuss whether the options you chose made a substantial difference in performance and why you think this might be the case

In [None]:
# Your Code Here