# MATH 3375 Examples Notebook #7

# Variable and Model Selection

When choosing one of several possible models, there are many factors to consider. We will use the **cars2004** data set.


In [None]:
#Look at data set
car_data <- read.csv("cars2004.csv", stringsAsFactors=TRUE)
head(car_data)

## 1. Significance of Predictors

When one model has a SUBSET of predictors of another model, we can determine if the bigger model is significantly better by performing a _**Partial F-Test**_ using ANOVA, as shown below.


In [None]:
model_Invoice1 <- lm(Invoice ~ HP, data=car_data)
model_Invoice2 <- lm(Invoice ~ HP + City.MPG, data=car_data)
model_Invoice3 <- lm(Invoice ~ HP + City.MPG + Hwy.MPG, data=car_data)
summary(model_Invoice1)
summary(model_Invoice2)
summary(model_Invoice3)


In [None]:
anova(model_Invoice1,model_Invoice2)

In [None]:
anova(model_Invoice2,model_Invoice3)

### CAUTION - An Incorrect Strategy: 

**_Question: Why can't we just start with a model that uses every available predictor, then discard the ones that are not significant?_**

_Answer:_ The significance of each predictor in a given model is **CONDITIONAL**. This means the predictor is significant _based on_ the presence of the other predictors in the model. Recall the interpretation of each coefficient: "Holding all other predictors constant..." - This changes when other predictors are added to the model or removed from the model. 


## 2. Multicollinearity

First, we examine potential issues with multicollinearity.


First we'll consider a subset of possible predictors for horsepower (HP).

In [None]:
plot(car_data[,c(4,5,9,10)])

#### Correlated Predictors

As we might expect, MSRP and Invoice have a very strong linear relationship, and the relationship between City.MPG and Hwy.MPG is also strong and linear. We will create the model using these 4 predictors and then examine a measure of collinearity, the Variance Inflation Factor (VIF).

In [None]:
hp_model1 <- lm(HP ~ City.MPG + Hwy.MPG + Invoice + MSRP, data=car_data)
summary(hp_model1)

#### R Library for Computing VIF

The cell below loads a library with the VIF function. The _**install**_ command is commented out and should only be executed if the library is not found. (Un-comment the line, run the cell, then comment the line back out.)

In [None]:
#ONLY run install if library fails to load (uncomment install line, run the cell, then put comment back)
#install.packages("regclass")
library("regclass")

In [None]:

VIF(hp_model1)

#### Evaluating the VIF numbers

The original model has $R^2 = 0.7987$.  Therefore, the threshold is: 

$max \left( 10,\frac{1}{1-0.7987} \right) = max \left( 10,4.968 \right) = 10$

The City.MPG and Hwy.MPG variables have VIF just under the threshold of 10, so are considered acceptable (although City.MPG is perilously close to 10!) The Invoice and MSRP variables both greatly exceed the threshold, indicating an unacceptable level of multicollinearity between each of these variables and one or more others in the model. 

This means we should not use all of these variables as predictors at the same time. Let's drop MSRP and try the model again.

In [None]:
hp_model2 <- lm(HP ~ City.MPG + Hwy.MPG + Invoice, data=car_data)
summary(hp_model2)

VIF(hp_model2)

### Another Consequence of Multicollinearity

If the linear relationship among certain coefficients is perfect or close to perfect, the coefficients cannot be computed at all, as illustrated below. 

In [None]:
car_data$MPG_diff <- car_data$Hwy.MPG - car_data$City.MPG
head(car_data)

#### Notice that MPG_diff is simply a linear combination of Hwy.MPG and City.MPG

In [None]:
hp_model3 <- lm(HP ~ City.MPG + Hwy.MPG + MPG_diff + Invoice, data=car_data)
summary(hp_model3)

## 3. Model Complexity

There is a trade-off between model complexity and predictive value. AIC and BIC scores can help compare models and choose the one with the better "balance".

### AIC Comparison

In [None]:
hp_model4 <- lm(HP ~ City.MPG + Invoice, data=car_data)
hp_model5 <- lm(HP ~ City.MPG, data=car_data)
hp_model6 <- lm(HP ~ Invoice, data=car_data)

AIC(hp_model2, k=2)
AIC(hp_model4, k=2)
AIC(hp_model5, k=2)
AIC(hp_model6, k=2)

#### Interpretation

Based on the AIC criterion, Model 4 is the best balance of fit and model complexity.

### BIC Comparison

Compare the same 4 models using BIC:

In [None]:
n = length(car_data[,1])

AIC(hp_model2, k=log(n))
AIC(hp_model4, k=log(n))
AIC(hp_model5, k=log(n))
AIC(hp_model6, k=log(n))

### Stepwise Regression

In [None]:
hp_model0 <- lm(HP ~ 1, data=car_data)

hp_model_full <- lm(HP ~ MSRP+Invoice+EngineSize+Cylinders+City.MPG+Hwy.MPG+Weight+WheelBase, data=car_data)
hp_model_best = step(hp_model0, scope=list(lower=hp_model0, upper=hp_model_full), direction = "forward", k=2)


In [None]:
summary(hp_model_best)