# **Tutorial 07: Goodness of Fit beyond MLR and Stepwise Selection**

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inferences about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Identify appropriate goodness-of-fit metrics for MLR, logistic and Poisson regressions.
3. Compute appropiate residuals of logistic and Poisson regressions.
4. Explain how an $F$-test to compare nested models can be used as a variable selection methods.
5. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.

In [None]:
# Loading Libraries

library(MASS)
library(broom)
library(tidymodels)
library(repr)
library(mltools)
library(leaps)
library(tidyverse)
library(modelr)
source("tests_tutorial_07.R")

# Stepwise Model Selection

In this tutorial we will focus on selecting a subset of variables to include in a predictive model. Do we actually need all the available input variables? Some datasets contain *many* variables, but not all of them are relevant. To decide if a variable (or set of variables) is relevant, we need to choose an evaluation metric. 

The evaluation metric used depends on the goal of the analysis. So, what is your goal? *Inference or prediction?*

## Variable selection for generative models

In previous worksheets, we learned different selection and estimation methods when the goal is to *estimate and make inferences about* the model that generated the data. We referred to these models as *generative models*.

For a LR with an intercept and estimated by LS:

- The $\mathbf{R^2}$, coefficient of determination, can be used to measure the part of the variation in the response explained by the estimated model


- The **Adjusted $\mathbf{R^2}$** can be used to compare the fit of estimated models of different sizes


- The **MSE** (based on in-sample data) can be used to compare the observed values with those predicted by the estimated model  


- These $\mathbf{F}$ tests can be used to select variables by comparing nested models

## Variable selection for predictive models

How do we evaluate the predictive performance of a model? For regression models, two common choices are:

- **Mean Squared Error (MSE)**: $$\text{MSE}_{\text{test}} = \frac{1}{n_{\text{new}}}\sum_{i=1}^{n_{\text{new}}}(y^{\text{new}}_i - \widehat{y}^{\text{new}}_i)^2$$
<font color='darkred'>where $y^{\text{new}}_i$ are **new responses from the test set**</font>, $\widehat{y}^{\text{new}}_i$ are the predicted values using the LR estimated with the training data but with the input data from the test set, and $n_{\text{new}}$ is the number of data points in the test set. You *do not want* to use the data in the training set to evaluate your model. 

- **Root Mean Squared Error (RMSE)**: this is the square root of MSE.
$$\text{RMSE}_{\text{test}} = \sqrt{\text{MSE}_{\text{test}}} = \sqrt{\frac{1}{n_{\text{new}}}\sum_{i=1}^{n_{\text{new}}}(y^{\text{new}}_i - \widehat{y}_i)^2}$$
Once again, remember that $y_i$ are observations in the test set and weren't used to train the model. 

<br>

<br>

Another possibility, not really common for prediction is the $R^2$.
- **$R^2$**: remember that $R^2$ can be computed for new responses in a test set.
$$R^2 = cor(\boldsymbol{y}^{\text{new}}, \widehat{\boldsymbol{y}}^{\text{new}})$$
Some functions compute the $R^2$ from a validation set or using cross-validation (perhaps seen in other courses). However, note that it is ***no longer the coefficient of determination***. It measures the correlation between the true and the predicted responses *in a test set*.   

<br> 
<hr>
<br>

There are other common metrics that have been proposed to approximate the *test MSE* but are computed with the training set only. You can use these measures to select variables of predictive models, even without using a test set.

- $C_p$

- $AIC$

- $BIC$ 

## 1. Dataset: the [Ames `Housing` dataset](https://www.kaggle.com/c/home-data-for-ml-course/)

In this section, we will work with a real estate dataset, the [Ames `Housing` dataset](https://www.kaggle.com/c/home-data-for-ml-course/), compiled by Dean De Cock. It has 79 input variables on different characteristics of residential houses in Ames, Iowa, USA, that can be used to predict the property's final price, `SalePrice.` We will use the following continuous input variables:

- `LotFrontage`: Linear $\text{ft}$ of street connected to the house.
- `LotArea`: Lot size in $\text{ft}^2$.
- `MasVnrArea`: Masonry veneer area in $\text{ft}^2$.
- `TotalBsmtSF`: Total $\text{ft}^2$ of basement area.
- `GrLivArea`: Above grade (ground) living area in $\text{ft}^2$.
- `BsmtFullBath`: Number of full bathrooms in the basement.
- `BsmtHalfBath`: Number of half bathrooms in the basement.
- `FullBath`: Number of full bathrooms above grade.
- `HalfBath`: Number of half bathrooms above grade.
- `BedroomAbvGr`: Number of bedrooms above grade (it does not include basement bedrooms).
- `KitchenAbvGr`: Number of kitchens above grade.
- `Fireplaces`: Number of fireplaces.
- `GarageArea`: Garage's area in $\text{ft}^2$.
- `WoodDeckSF`: Wood deck area in $\text{ft}^2$.
- `OpenPorchSF`: Open porch area in $\text{ft}^2$.
- `EnclosedPorch`: Enclosed porch area in $\text{ft}^2$.
- `ScreenPorch`: Screen porch area in $\text{ft}^2$.
- `PoolArea`: Pool area in $\text{ft}^2$.

The following variables will be used to construct a variable `ageSold`
- `YearBuilt`: Original construction date.
- `YrSold`: Year sold.

Run this code to prepare a working dataset

In [None]:
# Run this cell
housing <- 
    read_csv("data/Housing.csv") %>%
    mutate(ageSold = YrSold - YearBuilt) %>%
    select(LotFrontage, LotArea, MasVnrArea, TotalBsmtSF,
           GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, 
           HalfBath, BedroomAbvGr, KitchenAbvGr, Fireplaces,
           GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 
           ScreenPorch, PoolArea, ageSold, SalePrice) %>%
    drop_na() %>%
    filter(LotArea < 20000)

str(housing)

We'll first split this dataset into a training and a test set using the tidymodels package.

Run this code to split the dataset `housing`.

In [None]:
#run this cell
set.seed(1234)

housing_split <- 
    housing %>%
    initial_split(prop = 0.6, strata = SalePrice)

training_housing <- training(housing_split)
testing_housing <- testing(housing_split)

In [None]:
head(training_housing, 3)
cat('\nTraining data has', nrow(training_housing), 'rows.\n')

In [None]:
# You don't want to even look at the test data. 
# Set it aside.

cat('\nTest data has', nrow(testing_housing), 'rows.')

### 1.1 Estimating an MLR and Predicting

In DSCI100, you have learned how to use `tidymodels` to build a model and use it to predict. You can write your own script to perform these steps or use a `linear_reg` model specification with the `lm` engine in `tidymodels`. Below, you will use the usual tidymodels workflow to predict each house's sale price in the test set.

Run this code to fit a MLR using `tidymodels` and call it `housing_full_tidy`.

In [None]:
lm_spec <- 
    linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")

lm_recipe <- recipe(SalePrice ~ ., data = training_housing)

housing_full_tidy <- 
    workflow() %>% 
    add_recipe(lm_recipe) %>% 
    add_model(lm_spec) %>% 
    fit(data = training_housing)

Recall that we can extract the estimated coefficients from the workflow using the `extract_git_parsnip()` function. We then use `tidy()` to arrange them into a data frame.

Run the code below to get the estimated coefficients:

In [None]:
coeffs <- 
    housing_full_tidy %>% 
    extract_fit_parsnip() %>% 
    tidy()

coeffs

**Question 1.0**
<br>{points: 1}

Write your own code and compare the results obtained with those using `tidymodels`

- Fit a MLR using data from `training_housing`. Store the output in an object named `housing_full_OLS`.

- Use `tidy()` to obtain a summary table of the estimated `housing_full_OLS`. Call it `housing_full_OLS_results`.

- Use the `modelr::add_predictions()` and `housing_full_OLS` to obtain the **out-of-sample predictions** for `testing_housing`. Store them as a new column in `testing_housing` called `pred_full_OLS`.

> **Note**: if you enter the input variables manually in `lm`, follow the order in the dataset. This is not important for results, just to pass the tests of autograding.

In [None]:
# Your code goes here. 

# your code here
fail() # No Answer - remove if you provide an answer

housing_full_OLS_results
head(testing_housing)

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

Running the code below, you can check whether the estimated coefficients obtained using the `tidymodels` workflow are the same as those obtained using your code. 

Are they the same? 

*Assign your answer to an object called `answer1.1`. Your answer should be either `"true"` or `"false"`, surrounded by quotes.*

In [None]:
# Run this cell to compare the estimates
tibble(estimates_your_code = housing_full_OLS_results$estimate,
       estimate_tidymodels = coeffs$estimate, 
       difference = housing_full_OLS_results$estimate - coeffs$estimate)

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

We can use the function `metrics()` to compute the root mean squared error, the $R^2$ and the mean absolute error for the predicted `SalePrice` of the test samples. 

Alternatively, we can write our own code to compute these measures. In this exercise, write your code to calculate the RMSE. 

Create a tibble, called `housing_RMSE_models` to store the computed RMSE. We will later compare it with the RMSE of a reduced model. The new tibble will have 2 columns:

- `Model`: Name of the estimated model from which we will obtain the prediction accuracy.
- `RMSE`: The $\text{RMSE}_{\text{test}}$ corresponding to the estimated model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# 1. compute the RMSE (the template below is for inspiration, but you can calculate it
#    any way you want. 

# rmse_full <-
#     ...
#     ...
#     ...
#     ...

# 2. store it in a tibble:

# housing_RMSE_models <- tibble(
#   Model = "OLS Full Regression",
#   RMSE = ...)

# your code here
fail() # No Answer - remove if you provide an answer

housing_RMSE_models

In [None]:
test_1.2()

Run the code below to compare the results with those obtained with `metrics()`.

In [None]:
housing_test_metrics <- 
    testing_housing %>%
    metrics(truth = SalePrice, estimate = pred_full_OLS)

housing_test_metrics

<font color='darkred'>**Tip:** In practice, refrain from creating your own functions if a reliable package with the desired function is available. This not only saves time but also minimizes the risk of bugs and errors, as these functions are widely tested.</font>

## 1.2 An automated procedure for model selection

When we don't know which variables should be included in the model, ideally, you want to select the best model out of *all possible models* of all possible sizes. 

For example, if the dataset has 2 explanatory variables $X_1$ and $X_2$, there are 4 models to compare: 
    
1. an intercept-only model; 
2. a model with only $X_1$; 
3. a model with only $X_2$; and 
4. a model with both $X_1$ and $X_2$. 

Unfortunately, the number of *all possible* models becomes too large rapidly, even for a small subset of variables. In fact, from a set of $p$ variables, we can fit a total of $2^p$ different models. For example, if $p = 20$ (i.e., 20 available explanatory variables), we would need to evaluate more than a million models. 

There are methods to search more efficiently for a good model (although it may not find the "best" one out of all possible):

### 1.2.1 Forward selection algorithm
Image from [ISLR](https://www.statlearning.com)
![](https://github.com/UBC-STAT/stat-301/blob/master/supplementary-material/img/forward.png?raw=true)

1. **Step 1:** Start with the intercept-only model: $y_i = \beta_0 + \varepsilon_i$ (remember that in this case, $\hat{\beta}_0 = \bar{y}$ from the training samples, so $\hat{y} = \bar{y}$ for any observation from the training or the test set)

2. **Step 2:** Evaluate all models of size 1, choose the "best" model with 1 covariate (based on RSS, equal size models), and call it $\mathcal{M}_1$. 

3. **Step 3** *Starting with the best size 1 model*, add 1 variable to create a (expanded) model of size 2. Repeat for all remaining variables and evaluate all expanded models of size 2. Choose the best model of size 2 (based on RSS) and call it $\mathcal{M}_2$. (*Note that there are more models of size 2 that we are not evaluating since 1 variable has already been chosen in the previous step*).


$\quad \quad \vdots$ 

continue until you reach a predetermined model size or the full model, $\mathcal{M}_p$. Note that the full model is unique. 


Now, we have to select the best out of the $p$ selected models: 
- $\mathcal{M}_1$ (the best model of size 1);
- $\mathcal{M}_2$ (the best-expanded model of size 2),
- $\ \ \vdots$
- $\mathcal{M}_p$ (the full model of size $p$). 

Unfortunately, we cannot use the RSS to compare models of different sizes. In fact, the metric will depend on the study goal. For generative models, the adjusted $R^2$ can be helpful. If the objective is predictions, then the test MSE, $C_p$, AIC, or BIC are useful.
 
You can learn more about these measures in [ISLR](https://www.statlearning.com)

Other selection procedures include:

- **Backward selection**: start with the full model and remove variables, one at a time


- **Hybrid selection**: after adding a variable, the method may also remove variables 


### 1.2.2 Selecting a smaller model in R

The OLS model estimates a generative model using all input variables. However, as we see from the results table, not all the terms in this regression are statistically significant, and this may not be the best predictive model either. You might want to select a smaller subset of variables that better explain the variation in `SalePrice` or to predict. In the following questions, you will use the forward selection algorithm to select a smaller model. We will compute different metrics to examine different types of models.

#### **R functions**

Both the **forward** and **backward** selection algorithms are implemented in R by the function `regsubsets()` from library `leaps`. 

- The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

- The argument `nvmax` indicates the maximum number of variables to be used in the variable selection.

This function identifies subsets of input variables that provide the best model for different model sizes and then selects the best among those.


**Forward selection in the `housing`  dataset**

Let's select some of the input variables in the `housing` dataset using the **forward selection** algorithm, aiming for a strong generative model. 

Create one object using `regsubsets()`with `training_housing`: `housing_forward_sel`. This object has to indicate  selected models for each model size, from **1 to 19 input variables** (check argument `nvmax`).

*Run the code below to select the best nested models of each size*

In [None]:
housing_forward_sel <- regsubsets(x = SalePrice ~ ., nvmax = 19,
                                  data = training_housing,
                                  method = "forward")

housing_forward_summary <- summary(housing_forward_sel)
housing_forward_summary

You can see that: 

- variables are selected one at a time.

- once the variable is in the model, it stays, and another variable is selected

- the algorithm continues until it builds a model of size `nvmax`

**Final selection**

Out of the 19 possible models obtained with forward selection and stored in `housing_forward_sel`, we can select the best one in terms of its *goodness of fit*. 

Let's store and examine different evaluation metrics contained in `housing_forward_summary`. Construct a tibble called `housing_forward_eval`. This object should contain the following columns:

- `n_input_variables`: the number of input variables in each selected model (from 1 to 19).

- `RSQ`: the $R^2$ of each model

- `RSS`: the RSS of each model

- `ADJ_R2`: the adjusted $R^2$ of each model

- `Cp`: the $C_p$ of each model

- `BIC`: the Bayesian Information Criterion of each model

*Run the following code to evaluate the best models of each size*

In [None]:
housing_forward_summary_df <- tibble(
    n_input_variables = 1:19,
    RSQ = housing_forward_summary$rsq,
    RSS = housing_forward_summary$rss,
    ADJ_R2 = housing_forward_summary$adjr2,
    Cp = housing_forward_summary$cp,
    BIC = housing_forward_summary$bic,
)
housing_forward_summary_df

You can see how the $R^2$ increases with more variables in the model. However, its adjusted version will start decreasing after 13 variables are selected. 

**The forward algorithm would select a generative model with 13 variables using the adjusted $R^2$**.

**The forward algorithm would select a predictive model with 11 variables using BIC**.

We can **visualize** how these measures change as variables are added to the selected model with the function `plot()`. 

Run this code to plot the $C_p$ of the models selected by the forward selection algorithm. 

In [None]:
plot(summary(housing_forward_sel)$cp,
     main = "Cp for forward selection",
     xlab = "Number of Input Variables", 
     ylab = "Rsq",
     type = "b",
     pch = 19,
     col = "red"
)

#### **Prediction performance of the selected predictive model**

In this problem, you will select the model that minimizes the $C_p$. Once we have a selected model we can train it using `lm()` with the training dataset and use it predict values of the residences in the test set. 

Run this code to obtain the name of the variables selected.

In [None]:
cp_min = which.min(housing_forward_summary$cp) 

selected_var <- names(coef(housing_forward_sel, cp_min))[-1]
selected_var

Run this code to subset only the predictors selected from the full dataset.

In [None]:
training_subset <- 
    training_housing %>% 
    select(selected_var, SalePrice)

testing_subset <- 
    testing_housing %>% 
    select(selected_var, SalePrice)

Run this code to train the selected models and use it to predict in the test set.

In [None]:
# Estimation

housing_red_OLS <- lm(SalePrice ~ ., data = training_subset)

**Question 1.3**
<br>{points: 1}

Use the new reduced predictive model, `housing_red_OLS`, to predict the response in the test set `testing_subset`. Use the resulting predictive values to compute the error and the $\text{RMSE}_{\text{test}}$ of the predictive values. Add this metric as another row in the tibble `housing_RMSE_models` and store the expanded `housing_RMSE_models` in an object called `housing_RMSE_models_expanded`. Identify the new row as `"OLS Reduced Regression"` (in column `Model`) and enter the corresponding $\text{RMSE}_{\text{test}}$ in the column `RMSE`.

> Note: since you are adding a row to an existing object, you may need to restart the kernel or rerun the cell with the original data frame to avoid extra concatenation.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# [You are all grown up now, do your own coding :) ]

# rmse_red <- ... (several rows of code are needed)


# housing_RMSE_models_expanded <- 
#     bind_rows(
#         housing_RMSE_models,
#         tibble(Model = "OLS Reduced Regression",
#                RMSE = ...)
#     )

# your code here
fail() # No Answer - remove if you provide an answer

housing_RMSE_models_expanded

In [None]:
test_1.3()

While we selected a reduced model with an expected better prediction performance, for this test set, the RMSE of the full model is lower than that of the reduced one. Note that this is only one estimate of the true test RMSE based on a random data split. A different split may give a different result.

### 2 Dataset: Cars Selling Price

In this section we will work with a different dataset that contains many categorical variables with multiple levels since not all selection algorithms can be used in those cases.

The dataset [Vehicle dataset](https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data), from Kaggle contains 20 input variables on different characteristics of cars sold in India, that can be used to predict the cars' selling price, `Price.` We will select only some of these inputs in this example (see code below). 

Although the description of the dataset does not contain too much information about the variables (e.g., units), the names of most variables are self-explanatory and can be used to illustrate and address the problem. For example,

- `FuelType`: a character vector with values 'Petrol', 'Diesel', 'CNG', 'CNG + CNG', 'LPG', 'Hybrid', 'Petrol + CNG', but only 'Petrol','Diesel', and 'CNG' has enough counts for an analysis

> Note that even if this is not a factor variable, R will create 6 dummy variables to include a character vector into the model.

In [None]:
# Run this cell
cars_price <- 
    read_csv("data/cars_price.csv") %>%
    dplyr::select(-Make,-Model,-Location,-Color,-Engine,-MaxPower,-MaxTorque)%>%
    subset(FuelType %in% c('Petrol','Diesel','CNG'))%>%
    drop_na() 

unique(cars_price$FuelType)

## 2.1 Stepwise Selection with Categorical Variables

As we learned in previous lectures, dummy variables are needed to include categorical covariates in statistical models. To represent a categorical variable with $q$ levels, we need $q-1$ dummy variables. 

The function `regsubsets()` evaluates the contribution of each of these dummy variables and may select only a subset of them for the final model. However, this selection depends on the reference level used, which in most cases, is randomly chosen. In  other words, in general, we are interested in the contribution of the *whole* categorical variable and not of individual levels of it relative to an arbitrary reference level. 

Statistically, we need to evaluate the joint contribution of all dummy variables from each categorical variable at once, instead of evaluating each dummy variable at a time. The function `stepAIC()` in the package `MASS` iteratively adds (`direction = "forward"`) and/or removes (`direction = "backward"`) predictors that decrease an information criterion (AIC or BIC).

In the following problems you will illustrate the limitation of `regsubsets()` and use `stepAIC()` to select a model using stepwise algorithms.

We'll first split this dataset into a training and a test set using the tidymodels package.

Run this code to split the dataset `cars_price`.

In [None]:
#run this cell
set.seed(301)

cars_price_split <- 
    cars_price %>%
    initial_split(prop = 0.5, strata = Price)

training_cars <- training(cars_price_split)
testing_cars <- testing(cars_price_split)

In [None]:
# Run this cell to see a few rows of the data

head(training_cars, 3)
cat('\nTraining data has', nrow(training_cars), 'rows.\n')

In [None]:
# You don't want to even look at the test data. 
# Set it aside.

cat('\nTest data has', nrow(testing_cars), 'rows.')

**Question 2.0**
<br>{points: 2}

Write your own code to fit the following 2 linear models. You can use `tidymodels()` if you want. We'll test the estimated coefficients obtained in both cases.

- Fit an intercept-only model using data from `training_cars` and `Price` as a response. Store the output in an object named `cars_null` and the estimated coefficients in an object called `cars_null_results`.
  
- Fit a MLR using data from `training_cars` and `Price` as a response. Store the output in an object named `cars_full` and the estimated coefficients in an object called `cars_full_results`.

In [None]:
# Your code goes here. 

#cars_null <- ...
#cars_null_results <- ...

#cars_full <- ...
#cars_full_results <- ...

# your code here
fail() # No Answer - remove if you provide an answer


In [None]:
test_2.0.0()
test_2.0.1()

### 2.1.1 Stepwise selection using `regsubsets()`

**Question 2.1**
<br>{points: 1}

Let's use a **forward selection** algorithm to select a predictive model of the selling price of a car.

Use `regsubsets()` with `training_cars` and call the resulting object `cars_forward_sel`. This object has to indicate  selected models for each model size, from **1 to 17 input variables** (specify it in the argument `nvmax`).

The code below can be used to select the best nested models of each size.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# cars_forward_sel <- ...(x = ... ~ ., nvmax = ...,
#                                   data = training_cars,
#                                   method = "...")

# cars_forward_summary <- ...(cars_forward_sel)

# your code here
fail() # No Answer - remove if you provide an answer

cars_forward_summary

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

Create a tibble to store the performace measures of each model selected by `regsubsets`. Call the resulting table `cars_forward_performance`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# cars_forward_performance <- tibble(
#     n_input_variables = ...,
#     RSQ = cars_forward_summary$...,
#     RSS = ...,
#     ADJ_R2 = ...,
#     Cp = ...,
#     BIC = ...,
# )

# your code here
fail() # No Answer - remove if you provide an answer
cars_forward_performance

In [None]:
test_2.2()

Run the code below to extract the name of the variables in the model with the lowest BIC. 

In [None]:
# Run this cell

bic_min = which.min(cars_forward_summary$bic) 

selected_var <- names(coef(cars_forward_sel, bic_min))[-1]
selected_var

**Question 2.3**
<br>{points: 1}

When the BIC is used to select a predictive model, only 11 input variables are selected. Some of these variables correspond to dummy variables of categorical variables in the dataset. 

Examine the set of variables selected and explain the results of the selection process for the categorical variables.

In particular, check if some dummy variables were selected. For those selected, check if all the set of dummy variables from the corresponding categorical variable were selected as well. 

- Taking one of these as an example, describe the which comparison was selected as important by the algorithm and which one was discarded. 

- Reflect and explain on the possible drawbacks of this proceedure and results in the context of an inference problem.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

### 2.1.2 Stepwise selection using `stepAIC()`

In this section we'll use a different function `stepAIC()` which evaluates all dummy variables from categorical variables together in the stepwise search.

The function `stepAIC()` in the package `MASS` iteratively adds (`direction = "forward"`) and/or removes (`direction = "backward"`) predictors that decrease an information criterion (AIC or BIC). You can also set `direction = "both"` (default) to perform a forward-backward search that, at each step, decides whether to add or remove a predictor. 

For $n$ equal to the sample size, setting $k = log(n)$, the BIC is computed. If $k = 2$, the AIC is computed.

**Question 2.4**
<br>{points: 1}

Use the function `stepAIC()` and `training_housing` to perform a "backward" search, starting from the full model (`cars_full`) and ending at the intercept-only model (`cars_null`) or until the BIC can not be further reduced. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# n <- ...
# modAIC_back <- stepAIC(..., direction = "...", k = log(n))

# your code here
fail() # No Answer - remove if you provide an answer

summary(modAIC_back)

In [None]:
test_2.4()

<font style='color:darkred'>**Note that the `Df` column in the output of the search algorithm indicates the number of variables that were simultaneously added or removed by the algorithm!**</font>

**Question 2.5**
<br>{points: 1}

Describe the model selected using the search in *Question 2.4*. Is the model selected the same as that in *Question 2.1*? How do they differ?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.6**
<br>{points: 1}

In this question you will change the direction to start the search at the intercept-only model and add variables sequentially until the minimum BIC is obtained.

> Note: the scope argument needs to indicate the model where the algorithm starts the  (`lower`) and the model where it ends (`upper`), otherwise, the algorithm will not do any search.

In [None]:
# modAIC_forward <- MASS::stepAIC(mod_null, direction = "forward",
#               scope = list(lower = cars_null, upper = cars_full), k = ...)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.6()

**Question 2.7: True or False**
<br>{points: 1}

The backward and the forward algorithms selected the same model for this dataset.

Are they the same?

*Assign your answer to an object called answer2.7. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer2.7 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.7()

The function `summary()` gives the results of the fitted model on the same dataset that was used to select the variables. We will see later in the course that the inference results may not be valid if the same data is used to select and test.

Run the code below to compare the results of the individual hypothesis tests when an independent dataset is used to test the model selected by the  algorithm going backwards. 

<font style='color:darkred'>**You can see that variables that were statistically significant when tested using the training set tend not be significant in the test set, even when both sets have the same size.**</font>

We'll generalize this result later using a simulation study!

In [None]:
pval_test <- tidy(lm(Price ~ Year + FuelType + Owner + Drivetrain + Length + 
    Width + Height + SeatingCapacity + FuelTankCapacity, data = testing_cars)) %>%
            select(term, p.value) %>%
            rename(p.value.test = p.value)

pval_train <- tidy(lm(Price ~ Year + FuelType + Owner + Drivetrain + Length + 
    Width + Height + SeatingCapacity + FuelTankCapacity, data = training_cars)) %>%
            select(term, p.value) %>%
            rename(p.value.train = p.value)

full_join(pval_train, pval_test, by = "term")