# DSCI 100: Introduction to Data Science

## Tutorial 9 - Regression (continued): Class activity

In [None]:
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

Let's look at the avocado data, which we looked at in week 3, and try to use the small hass volumes of avocados to predict their large hass volumes. To reduce the size of the dataset, let's also narrow our observations to only include avocados from 2015.

In [None]:
# run this
avocado <- read_csv("data/avocado_prices.csv") %>%
    filter(yr == 2015)

We can measure the quality of our regression model using the RMSPE value—just like how we used accuracy to evaluate our knn classification models.

In the readings, we looked at both RMSE and RMSPE and their differences.<br>
* <b>RMSE</b> refers to the root mean squared error, or predicting and evaluating prediction quality on the training data. <br>
* <b>RMSPE</b> refers to the root mean squared <b>prediction</b> error, or the error in our predictions made about the actual testing data. We look at this property when we evaluate the quality of our final predictions.

Let's take a look at their differences, and at which point in our workflow might we need one over the other. Let's split our data into training and a testing set using a 50-50 split.

In [None]:
# run this
set.seed(1234)
avo_split <- initial_split(avocado, prop = 0.5, strata = large_hass_volume)
avo_train <- training(avo_split)
avo_test <- testing(avo_split)

Now let's set up our recipe, model specification and workflow.

In [None]:
# run this
avo_recipe <- recipe(large_hass_volume ~ small_hass_volume, data = avo_train) %>%
                  step_scale(all_predictors()) %>%
                  step_center(all_predictors())

avo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
                  set_engine("kknn") %>%
                  set_mode("regression")

avo_workflow <- workflow() %>%
                 add_recipe(avo_recipe) %>%
                 add_model(avo_spec)
avo_workflow

Here we've provided most of the initial setup: splitting the data into training and testing sets, making the recipe, the model, and adding them to the workflow is done! Now let's perform cross validation with 3 folds and take a look at the RMSPE values. (This might take a bit to run!!)

In [None]:
set.seed(1234)

avo_vfold <- vfold_cv(avo_train, v = 3, strata = large_hass_volume)

gridvals <- tibble(neighbors = seq(1,200))

training_results <- avo_workflow %>%
                       tune_grid(resamples = avo_vfold, grid = gridvals) %>%
                       collect_metrics() 

training_results

Take a look inside the .metric column and you'll find that a given number of neighbors has an observation for each rmse. Now we find the k value that gives the minimum RMSPE.

In [None]:
avo_min <- training_results %>%
               filter(.metric == 'rmse') %>%
               filter(mean == min(mean))
avo_min

Our optimal k value is 18!

Using k = 18, fit the model on to our testing set and return the summary statistics.

In [None]:
# run this
avo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) %>%
                  set_engine("kknn") %>%
                  set_mode("regression")

avo_fit <- workflow() %>%
           add_recipe(avo_recipe) %>%
           add_model(avo_spec) %>%
           fit(data = avo_train)

In [None]:
avo_summary <- avo_fit %>% 
           predict(avo_test) %>%
           bind_cols(avo_test) %>%
           metrics(truth = large_hass_volume, estimate = .pred) 

avo_summary

Don't be fooled here, the number inside the "rmse" row now signifies the <b>RMSPE</b>, the error computed when our model is used to actually predict the values of the testing set, i.e. from an <b>out-of-sample</b> prediction. Remember that this value doesn't have a easily-interpretable scale and is measured in reference to the predictor/target variables!

Let's do the same thing with linear regression, find the RMSPE and compare our results to knn regression. 

In [None]:
set.seed(1234)
lm_spec <- linear_reg() %>%
                set_engine("lm") %>%
                set_mode("regression")

lm_recipe <- recipe(large_hass_volume ~ small_hass_volume, data = avo_train)

lm_fit <- workflow() %>%
                add_recipe(lm_recipe) %>%
                add_model(lm_spec) %>%
                fit(data = avo_train)

lm_rmse <- lm_fit %>%
                predict(avo_train) %>%
                bind_cols(avo_train) %>%
                metrics(truth = large_hass_volume, estimate = .pred) %>%
                filter(.metric == 'rmse') %>%
                select(.estimate) %>%
                pull()

lm_rmse
lm_rmspe <- lm_fit %>%
                predict(avo_test) %>%
                bind_cols(avo_test) %>%
                metrics(truth = large_hass_volume, estimate = .pred) %>%
                filter(.metric == 'rmse') %>%
                select(.estimate) %>%
                pull()
lm_rmspe