In [17]:
library(tidyverse)
library(broom)

Trade off goodness-of-fit for smaller number of predictors, for better explanation, reduce overfitting

# Why?

So far, we have seen criteria such as  
$R^2$ and   RMSE for assessing quality of fit. However, both of these have a fatal flaw. By increasing the size of a model, that is adding predictors, that can at worst not improve. It is impossible to add a predictor to a model and make   $R^2$ or   RMSE worse. That means, if we were to use either of these to chose between models, we would always simply choose the larger model. Eventually we would simply be fitting to noise.

This suggests that we need a quality criteria that takes into account the size of the model, since our preference is for small models that still fit well. We are willing to sacrifice a small amount of “goodness-of-fit” for obtaining a smaller model. (Here we use “goodness-of-fit” to simply mean how far the data is from the model, the smaller the errors the better. Often in statistics, goodness-of-fit can have a more precise meaning.) We will look at three criteria that do this explicitly:  
AIC ,   BIC , and Adjusted   $R^2$ . We will also look at one, Cross-Validated   RMSE
 , which implicitly considers the size of the model.

# 1. Quality Criterion

Read the book for comprehensive


## 1.1 AIC (Aikake Information Criterion)

$$AIC = 2p - 2log(\hat(L))$$


## 1.2 BIC (Baysian Information Criterion)

$$BIC = log(n)p - 2log(\hat(L))$$

with k = log(n), a generalization of AIC

## 1.3 Adjusted-R-squared

## 1.4 Cross validation RMSE

# 2. Selection Procedures

Read the book to see how these methods work

- backward search
- forward search
- stepwise
- exhaustive sear

> using **`step`** function to perform search

In [5]:
library(faraway)
hipcenter_mod = lm(hipcenter ~ ., data = seatpos)
coef(hipcenter_mod)

## 2.1 Backward search

Backward selection procedures start with all possible predictors in the model, then considers how deleting a single predictor will effect a chosen metric.

---

Start from a complex model. Each time, try to remove a predictor, then calculate the metric (e.g: AIC) of the model after removing chosen predictor. The final preditor that is decided to be removed is the ones that results in lowest metric (e.g: AIC) 

In [8]:
hipcenter_mod_backward <- step(hipcenter_mod, direction = "backward")
hipcenter_mod_backward

Start:  AIC=283.62
hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + 
    Leg

          Df Sum of Sq   RSS    AIC
- Ht       1      5.01 41267 281.63
- Weight   1      8.99 41271 281.63
- Seated   1     28.64 41290 281.65
- HtShoes  1    108.43 41370 281.72
- Arm      1    164.97 41427 281.78
- Thigh    1    262.76 41525 281.87
<none>                 41262 283.62
- Age      1   2632.12 43894 283.97
- Leg      1   2654.85 43917 283.99

Step:  AIC=281.63
hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Weight   1     11.10 41278 279.64
- Seated   1     30.52 41297 279.66
- Arm      1    160.50 41427 279.78
- Thigh    1    269.08 41536 279.88
- HtShoes  1    971.84 42239 280.51
<none>                 41267 281.63
- Leg      1   2664.65 43931 282.01
- Age      1   2808.52 44075 282.13

Step:  AIC=279.64
hipcenter ~ Age + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Seated   1     35.10 4131


Call:
lm(formula = hipcenter ~ Age + HtShoes + Leg, data = seatpos)

Coefficients:
(Intercept)          Age      HtShoes          Leg  
   456.2137       0.5998      -2.3023      -6.8297  


the variable `hipcenter_mod_backward` stores the model chosen by this procedure:

In [9]:
coef(hipcenter_mod_backward)

In [19]:
hipcenter_mod_backward %>% glance()

r.squared,adj.r.squared,sigma,statistic,p.value,df,logLik,AIC,BIC,deviance,df.residual,nobs
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
0.6812662,0.6531427,35.12909,24.22403,1.437319e-08,3,-187.0495,384.099,392.2869,41957.79,34,38


---
We can also use **BIC** to choose model, instead of **AIC**, by setting `k = log(n)`

In [15]:
n <- nrow(seatpos)
hip_center_mod_backward_bic <- step(hipcenter_mod, k = log(n), direction = "backward")

Start:  AIC=298.36
hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + 
    Leg

          Df Sum of Sq   RSS    AIC
- Ht       1      5.01 41267 294.73
- Weight   1      8.99 41271 294.73
- Seated   1     28.64 41290 294.75
- HtShoes  1    108.43 41370 294.82
- Arm      1    164.97 41427 294.88
- Thigh    1    262.76 41525 294.97
- Age      1   2632.12 43894 297.07
- Leg      1   2654.85 43917 297.09
<none>                 41262 298.36

Step:  AIC=294.73
hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Weight   1     11.10 41278 291.10
- Seated   1     30.52 41297 291.12
- Arm      1    160.50 41427 291.24
- Thigh    1    269.08 41536 291.34
- HtShoes  1    971.84 42239 291.98
- Leg      1   2664.65 43931 293.47
- Age      1   2808.52 44075 293.59
<none>                 41267 294.73

Step:  AIC=291.1
hipcenter ~ Age + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Seated   1     35.10 41313

In [14]:
step(hipcenter_mod, direction = "backward")

Start:  AIC=283.62
hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + 
    Leg

          Df Sum of Sq   RSS    AIC
- Ht       1      5.01 41267 281.63
- Weight   1      8.99 41271 281.63
- Seated   1     28.64 41290 281.65
- HtShoes  1    108.43 41370 281.72
- Arm      1    164.97 41427 281.78
- Thigh    1    262.76 41525 281.87
<none>                 41262 283.62
- Age      1   2632.12 43894 283.97
- Leg      1   2654.85 43917 283.99

Step:  AIC=281.63
hipcenter ~ Age + Weight + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Weight   1     11.10 41278 279.64
- Seated   1     30.52 41297 279.66
- Arm      1    160.50 41427 279.78
- Thigh    1    269.08 41536 279.88
- HtShoes  1    971.84 42239 280.51
<none>                 41267 281.63
- Leg      1   2664.65 43931 282.01
- Age      1   2808.52 44075 282.13

Step:  AIC=279.64
hipcenter ~ Age + HtShoes + Seated + Arm + Thigh + Leg

          Df Sum of Sq   RSS    AIC
- Seated   1     35.10 4131


Call:
lm(formula = hipcenter ~ Age + HtShoes + Leg, data = seatpos)

Coefficients:
(Intercept)          Age      HtShoes          Leg  
   456.2137       0.5998      -2.3023      -6.8297  


In [16]:
coef(hip_center_mod_backward_bic)

In [18]:
hip_center_mod_backward_bic %>% glance()

r.squared,adj.r.squared,sigma,statistic,p.value,df,logLik,AIC,BIC,deviance,df.residual,nobs
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
0.6345658,0.6244149,36.5549,62.51296,2.206673e-09,1,-189.6474,385.2947,390.2075,48105.38,36,38


> use **`extractAIC`** to get AIC of a model

In [58]:
# the first value is the number of paramters, the second value is AIC
extractAIC(hipcenter_mod_backward) %>% print()

[1]   4.0000 274.2597


In [38]:
# compare adjusted R square for diffenrent models
# not use R-squared because it does not take model complexity into account
summary(hipcenter_mod)$adj.r.squared
summary(hipcenter_mod_backward)$adj.r.squared
summary(hip_center_mod_backward_bic)$adj.r.squared

## 2.2 Forward Search

Forward selection is the exact opposite of backwards selection. Start from a simple model. Each time, try to add a predictor, then calculate the metric (e.g: AIC) of the model after adding chosen predictor. The final preditor that is decided to be adding is the ones that results in lowest metric (e.g: AIC) 

In [26]:
hipcenter_mod_start = lm(hipcenter ~ 1, data = seatpos)
hipcenter_mod_forw_aic = step(
  hipcenter_mod_start, 
  scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg, 
  direction = "forward")

Start:  AIC=311.71
hipcenter ~ 1

          Df Sum of Sq    RSS    AIC
+ Ht       1     84023  47616 275.07
+ HtShoes  1     83534  48105 275.45
+ Leg      1     81568  50071 276.98
+ Seated   1     70392  61247 284.63
+ Weight   1     53975  77664 293.66
+ Thigh    1     46010  85629 297.37
+ Arm      1     45065  86574 297.78
<none>                 131639 311.71
+ Age      1      5541 126098 312.07

Step:  AIC=275.07
hipcenter ~ Ht

          Df Sum of Sq   RSS    AIC
+ Leg      1   2781.10 44835 274.78
<none>                 47616 275.07
+ Age      1   2353.51 45262 275.14
+ Weight   1    195.86 47420 276.91
+ Seated   1    101.56 47514 276.99
+ Arm      1     75.78 47540 277.01
+ HtShoes  1     25.76 47590 277.05
+ Thigh    1      4.63 47611 277.06

Step:  AIC=274.78
hipcenter ~ Ht + Leg

          Df Sum of Sq   RSS    AIC
+ Age      1   2896.60 41938 274.24
<none>                 44835 274.78
+ Arm      1    522.72 44312 276.33
+ Weight   1    445.10 44390 276.40
+ HtShoes  1    

In [30]:
coef(hipcenter_mod_forw_aic)

---
again, by default R uses  
AIC
  as its quality metric when using the `step()` function. Also note that now the rows begin with a + which indicates addition of predictors to the current model from any step.

In [31]:
# using BIC
hipcenter_mod_forward_bic <- step(hipcenter_mod_start, scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg, k = log(n))

Start:  AIC=313.35
hipcenter ~ 1

          Df Sum of Sq    RSS    AIC
+ Ht       1     84023  47616 278.34
+ HtShoes  1     83534  48105 278.73
+ Leg      1     81568  50071 280.25
+ Seated   1     70392  61247 287.91
+ Weight   1     53975  77664 296.93
+ Thigh    1     46010  85629 300.64
+ Arm      1     45065  86574 301.06
<none>                 131639 313.35
+ Age      1      5541 126098 315.35

Step:  AIC=278.34
hipcenter ~ Ht

          Df Sum of Sq    RSS    AIC
<none>                  47616 278.34
+ Leg      1      2781  44835 279.69
+ Age      1      2354  45262 280.05
+ Weight   1       196  47420 281.82
+ Seated   1       102  47514 281.90
+ Arm      1        76  47540 281.92
+ HtShoes  1        26  47590 281.96
+ Thigh    1         5  47611 281.98
- Ht       1     84023 131639 313.35


In [29]:
coef(hipcenter_mod_forward_bic)

In [36]:
# compare adjust R square of 3 models
# we will not ue R square because it does not take model complexity into account
summary(hipcenter_mod)$adj.r.squared
summary(hipcenter_mod_forw_aic)$adj.r.squared
summary(hipcenter_mod_forward_bic)$adj.r.squared

## 2.3 Stepwise search

Stepwise search checks going both backwards and forwards at every step. It considers the addition of any variable not currently in the model, as well as the removal of any variable currently in the model.

Here we perform stepwise search using   AIC as our metric. We start with the model `hipcenter ~ 1` and search up to `hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg`. Notice that at many of the steps, some row begin with -, while others begin with +.

In [39]:
hipcenter_mod_stepwise <- step(hipcenter_mod_start, 
                              scope = hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh + Leg,
                              direction = "both")


Start:  AIC=311.71
hipcenter ~ 1

          Df Sum of Sq    RSS    AIC
+ Ht       1     84023  47616 275.07
+ HtShoes  1     83534  48105 275.45
+ Leg      1     81568  50071 276.98
+ Seated   1     70392  61247 284.63
+ Weight   1     53975  77664 293.66
+ Thigh    1     46010  85629 297.37
+ Arm      1     45065  86574 297.78
<none>                 131639 311.71
+ Age      1      5541 126098 312.07

Step:  AIC=275.07
hipcenter ~ Ht

          Df Sum of Sq    RSS    AIC
+ Leg      1      2781  44835 274.78
<none>                  47616 275.07
+ Age      1      2354  45262 275.14
+ Weight   1       196  47420 276.91
+ Seated   1       102  47514 276.99
+ Arm      1        76  47540 277.01
+ HtShoes  1        26  47590 277.05
+ Thigh    1         5  47611 277.06
- Ht       1     84023 131639 311.71

Step:  AIC=274.78
hipcenter ~ Ht + Leg

          Df Sum of Sq   RSS    AIC
+ Age      1    2896.6 41938 274.24
<none>                 44835 274.78
- Leg      1    2781.1 47616 275.07
+ Arm 

## 2.4 Exhaustive search

Check every possibility

> use the `regsubsets()` function in the R package `leaps`.



In [59]:
library(leaps)

A few points about this line of code. First, note that we immediately use `summary()` and store those results. That is simply the intended use of `regsubsets()`. Second, inside of regsubsets() we specify the model `hipcenter ~ .`. This will be the largest model considered, that is the model using all first-order predictors, and R will check all possible subsets.

In [45]:
all_hip_center_mod <- summary(regsubsets(hipcenter ~ ., data = seatpos))

all_hip_center_mod

Subset selection object
Call: regsubsets.formula(hipcenter ~ ., data = seatpos)
8 Variables  (and intercept)
        Forced in Forced out
Age         FALSE      FALSE
Weight      FALSE      FALSE
HtShoes     FALSE      FALSE
Ht          FALSE      FALSE
Seated      FALSE      FALSE
Arm         FALSE      FALSE
Thigh       FALSE      FALSE
Leg         FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         Age Weight HtShoes Ht  Seated Arm Thigh Leg
1  ( 1 ) " " " "    " "     "*" " "    " " " "   " "
2  ( 1 ) " " " "    " "     "*" " "    " " " "   "*"
3  ( 1 ) "*" " "    " "     "*" " "    " " " "   "*"
4  ( 1 ) "*" " "    "*"     " " " "    " " "*"   "*"
5  ( 1 ) "*" " "    "*"     " " " "    "*" "*"   "*"
6  ( 1 ) "*" " "    "*"     " " "*"    "*" "*"   "*"
7  ( 1 ) "*" "*"    "*"     " " "*"    "*" "*"   "*"
8  ( 1 ) "*" "*"    "*"     "*" "*"    "*" "*"   "*"

In [50]:
names(all_hip_center_mod)

In [51]:
str(all_hip_center_mod)

List of 8
 $ which : logi [1:8, 1:9] TRUE TRUE TRUE TRUE TRUE TRUE ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "1" "2" "3" "4" ...
  .. ..$ : chr [1:9] "(Intercept)" "Age" "Weight" "HtShoes" ...
 $ rsq   : num [1:8] 0.638 0.659 0.681 0.685 0.686 ...
 $ rss   : num [1:8] 47616 44835 41938 41485 41313 ...
 $ adjr2 : num [1:8] 0.628 0.64 0.653 0.647 0.637 ...
 $ cp    : num [1:8] -0.534 -0.489 -0.525 1.157 3.036 ...
 $ bic   : num [1:8] -31.4 -30 -28.9 -25.7 -22.2 ...
 $ outmat: chr [1:8, 1:8] " " " " "*" "*" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "1  ( 1 )" "2  ( 1 )" "3  ( 1 )" "4  ( 1 )" ...
  .. ..$ : chr [1:8] "Age" "Weight" "HtShoes" "Ht" ...
 $ obj   :List of 28
  ..$ np       : int 9
  ..$ nrbar    : int 36
  ..$ d        : num [1:9] 38 47371 359 208 273 ...
  ..$ rbar     : num [1:36] 155.6 89 32.2 38.7 36.3 ...
  ..$ thetab   : num [1:9] -164.88 -1.07 -7.12 -3.17 -2.69 ...
  ..$ first    : int 2
  ..$ last     : int 9
  ..$ vorder   : int [

In [52]:
# best model
# the i-th row is the predictors of the i-th model
all_hip_center_mod$which

Unnamed: 0,(Intercept),Age,Weight,HtShoes,Ht,Seated,Arm,Thigh,Leg
1,True,False,False,False,True,False,False,False,False
2,True,False,False,False,True,False,False,False,True
3,True,True,False,False,True,False,False,False,True
4,True,True,False,True,False,False,False,True,True
5,True,True,False,True,False,False,True,True,True
6,True,True,False,True,False,True,True,True,True
7,True,True,True,True,False,True,True,True,True
8,True,True,True,True,True,True,True,True,True


In [48]:
# adjust R square of each model
all_hip_center_mod$adjr2

In [49]:
# residuals sum of squares of each model
all_hip_center_mod$rss

In [55]:
# extract model with highest adjusted R squared
all_hip_center_mod$which[which.max(all_hip_center_mod$adjr2), ] %>% print()

(Intercept)         Age      Weight     HtShoes          Ht      Seated 
       TRUE        TRUE       FALSE       FALSE        TRUE       FALSE 
        Arm       Thigh         Leg 
      FALSE       FALSE        TRUE 


# 3. Higher order terms

In [60]:
autompg = read.table(
  "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
  quote = "\"",
  comment.char = "",
  stringsAsFactors = FALSE)
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", 
                      "year", "origin", "name")
autompg = subset(autompg, autompg$hp != "?")
autompg = subset(autompg, autompg$name != "plymouth reliant")
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
autompg$hp = as.numeric(autompg$hp)
autompg$domestic = as.numeric(autompg$origin == 1)
autompg = autompg[autompg$cyl != 5,]
autompg = autompg[autompg$cyl != 3,]
autompg$cyl = as.factor(autompg$cyl)
autompg$domestic = as.factor(autompg$domestic)
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", 
                                     "wt", "acc", "year", "domestic"))

In [62]:
# Read the book for the explanation of the formula
autompg_big_mod = lm(
  log(mpg) ~ . ^ 2 + I(disp ^ 2) + I(hp ^ 2) + I(wt ^ 2) + I(acc ^ 2), 
  data = autompg)


In [63]:
# large number of coefficients
length(coef(autompg_big_mod))

In [65]:
# backward search using metric bic and aic
# suppress mesage by setting trace = 0
autompg_mod_back_bic <- step(autompg_big_mod, direction = "backward", k = log(n), trace = 0)
autompg_mod_back_aic <- step(autompg_big_mod, direction = "backward",  trace = 0)

# 4. Explanation versus Prediction

Read the book. 
- When to use **Explanation**? small model, easy to interpret. We need to care about model assumptions to inference, correlation and causation.
- When to use **Prediction**? big model, high goodness-of-fit. If we only care about prediction, we don’t need to worry about correlation vs causation, and we don’t need to worry about model assumptions.