# DSCI 100: Introduction to Data Science

## Tutorial 8 - Regression I (K-nearest neighbours) : Class activity

In [None]:
library(tidyverse)
library(tidymodels)

Let's look at the avocado data, which we looked at in week 3, and try to use the volume of small hass avocado sales to predict the volume of large hass sales. To reduce the size of the dataset, let's also narrow our observations to only include avocados from 2015.

In [None]:
# run this
avocado <- read_csv("data/avocado_prices.csv") |>
    filter(yr == 2015)
head(avocado)

In the readings, we looked at both RMSE and RMSPE and their differences.<br>
* <b>RMSE</b> refers to the root mean squared error, or an error in the predictions made for the training data. Hence, this is a property we look at when we evaluate how well our model is able to fit the data.
<br>
* <b>RMSPE</b> refers to the root mean squared <b>prediction</b> error, or the error in our predictions made about the actual testing data. We look at this property when we want to evaluate the quality of our future predictions on new data we haven't seen before.

Let's take a look at their differences, and at which point in our workflow might we need one over the other.

In [None]:
# Split the data into training and testing
set.seed(1234)
avo_split <- initial_split(avocado, prop = 0.5, strata = large_hass_volume)
avo_train <- training(avo_split)
avo_test <- testing(avo_split)

In [None]:
# Set the seed. Don't remove this!
set.seed(3456) 
# Create a recipe, model specification, and workflow
avo_recipe <- recipe(large_hass_volume ~ small_hass_volume, data = avo_train) |>
                  step_scale(all_predictors()) |>
                  step_center(all_predictors())

avo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                  set_engine("kknn") |>
                  set_mode("regression")

avo_workflow <- workflow() |>
                 add_recipe(avo_recipe) |>
                 add_model(avo_spec)
avo_workflow

Here we've provided most of the initial setup: splitting the data into training and testing sets, making the recipe, the model, and adding them to the workflow is done! Now let's perform cross validation with **3 folds** and take a look at the RMSE values. (This might take a bit to run!!)

In [None]:
set.seed(1234)

avo_vfold <- vfold_cv(avo_train, v = 3, strata = large_hass_volume)

gridvals <- tibble(neighbors = seq(1,200))

training_results <- avo_workflow |>
                       tune_grid(resamples = avo_vfold, grid = gridvals) |>
                       collect_metrics() 

head(training_results)

You will see that each number of neighbors has an `rmse` metric and an `rsq` metric.

**Question:** Is `rmse` there RMSE or RMSPE?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

Now find the k value that gives the minimum RMSE.

In [None]:
avo_min <- training_results |>
               filter(.metric == 'rmse') |>
               filter(mean == min(mean))
avo_min

Our optimal k value is 18!

Using k = 18, fit the model on to our testing set and return the summary statistics.

In [None]:
# run this
set.seed(1234)

avo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) |>
                  set_engine("kknn") |>
                  set_mode("regression")

avo_fit <- workflow() |>
           add_recipe(avo_recipe) |>
           add_model(avo_spec) |>
           fit(data = avo_train)

In [None]:
avo_summary <- avo_fit |> 
           predict(avo_test) |>
           bind_cols(avo_test) |>
           metrics(truth = large_hass_volume, estimate = .pred) 
avo_summary

Once again, we see the metric `rmse` in one of the rows.

**Question:** Is `rmse` there RMSE or RMSPE?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.
