In [None]:
library(tidyverse)
library(olsrr) ## "pretty" ols plots
library(magrittr) ## more pipe operators like  %<>% 
library(car) ## for the vif() function
bb <- readRDS("bbdata.rds")
crime <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", sep = "\t", header = TRUE)
# Assign more meaningful variable names
colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed",
                     "exp.per.cap.1960", "exp.per.cap.1959", "labor.part",
                     "male.per.fem", "population", "nonwhite",
                     "unemp.youth", "unemp.adult", "median.assets", "num.low.salary")

## Multicollinearity
You may recall that there was an additional issue/assumption about OLS we discussed, which was not relevant with simple linear regression where we only had one IV.  This issue is multicollinearity - when two of our IVs are *too* correlated with each other.  They "overlap" in their prediction of the DV.

#### What's the big deal?
- redundancy in predicted variance - both IVs are trying to explain the same variation in the DV.
- Low correlation in IVs = little to no effect to the model
- Medium correlation in IVs = affects regression coefficients
- High correlation = near perfect redundancy - the model "blows up"

I've loaded the datasets we've used previously for our simple and multiple regression notebooks.  Let's look at some examples of multicollinearity.

### 1. Improper dummy coding
We saw this previously in the simple linear regression notebook - we can't include dummy codes for all levels in our model - we need to "leave one out" to serve as the reference variable.

In [None]:
mpg2 <- mpg  %>% select(cyl, hwy)  %>%  
    mutate(cyl4 = ifelse(cyl == 4, 1, 0))  %>% 
    mutate(cyl5 = ifelse(cyl == 5, 1, 0))  %>% 
    mutate(cyl6 = ifelse(cyl == 6, 1, 0))  %>% 
    mutate(cyl8 = ifelse(cyl == 8, 1, 0))  %>% 
    select (-cyl) ## remove cyl from the new df
summary(lm(hwy ~ ., data = mpg2))

What if I instead made a mistake in my dummy coding and had overlapping groups?

In [None]:
mpg2 <- mpg  %>% select(cyl, hwy)  %>%  
    mutate(cyl4 = ifelse(cyl == 4, 1, 0))  %>% 
    mutate(cyl5 = ifelse(cyl <= 5, 1, 0))  %>% ## less than or equal to 5 overlaps with cyl == 4
    mutate(cyl6 = ifelse(cyl == 6, 1, 0))  %>% 
    select (-cyl) ## remove cyl from the new df
summary(lm(hwy ~ ., data = mpg2))

The coefficient for cyl4 is no longer significant - because cyl4 and cyl5 overlap.  This is why data cleaning is important preparation for model fitting.

### 2. Including X that is computed from other Xs
Let's look at our Big Brother data again.  We have total_hoh, total_vetos, AND total_wins.  Total wins is total comp wins, which *includes* HOH wins and veto wins.  Let's look at a small scatterplot matrix and correlation matrix.

In [None]:
colnames(bb)

In [None]:
pairs(bb[c("days_in_game", "total_hoh", "total_vetos", "total_wins")])

In [None]:
cor(bb[c("days_in_game", "total_hoh", "total_vetos", "total_wins")])

Look at that 0.75 correlation between total_hoh and total_wins.  Let's see what bad stuff happens?

In [None]:
mcol <- lm(days_in_game ~ total_hoh + total_wins + total_vetos, data=bb) 
summary(mcol)

In [None]:
mod_w <- lm(days_in_game ~ total_hoh + total_vetos, data=bb) 
summary(mod_w)

In [None]:
anova(mod_w, mcol)

### 3. Using more than 1 X that estimates similar/same measure
For this one we go back to the crime data.

In [None]:
pairs(crime[c(1, 5:6)])
cor(crime[c(1, 5:6)])

In [None]:
mod_exp <- lm(crime.per.million ~ exp.per.cap.1959 + exp.per.cap.1960, data=crime) 
summary(mod_exp)

In [None]:
mod_exp_small <- lm(crime.per.million ~ exp.per.cap.1960, data=crime) 
summary(mod_exp_small)

In [None]:
anova(mod_exp_small, mod_exp)

Because exp.per.cap.1960 and exp.per.cap.1959 are the same measure from two different years (sequential) there is little difference in the measures and they don't belong in the same model - either one will work to explain the relationship between expenditure and crimes reported.

### Consequences of Multicollinearity

OLS estimators are still unbiased, but
- They have large variances and covariances, which leads to
- Standard error increase, which leads to
- Smaller coefficient t-scores, and
- Confidence intervals tend to be wider, which leads to
- Larger p-values, which leads to
- Not rejecting the “zero null hypothesis”, which leads to
- Greater Type II errors

How can we check for multicollinearity?

#### VIF (Variance Inflation Factor)

Estimate of the amount of inflated variance in coefficient due to multicollinearity in the model. The higher the VIF, the less reliable the regression model.

In [None]:
## calculating VIF
vif(mcol)

In [None]:
vif(mod_exp)

### Interpreting VIF
- Scores start at 1 – all clear
- Scores >1 to 4: Still safe
- Score of 5: Threshold for VIF issues (prefer 4 & under)
- Score of 10 & up: Need to do something about it, massive multicollinearity

When we reach a VIF of 4 that means the standard errors, which build the confidence intervals, would be twice as big
as they would be otherwise.

In [None]:
mod_exp_vars <- lm(crime.per.million ~ exp.per.cap.1959 + exp.per.cap.1960 + median.assets + average.ed, data=crime)
summary(mod_exp_vars)

In [None]:
vif(mod_exp_vars)

##### When is multicollinarity ok?
1. control variables
2. powers and products
3. dummy groups with small frequencies

##### So if we can fix it, why not just fix it?
1. Std. Errors & p-values are ok before fix
2. ‘Good’ reason to keep the offending variables (theory)
3. Interpretive ease

### *“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”*  
– George Box

<img src="images/box1.jpg" width="400" height="400">

## Other error-based model evaluation measures.
Beyond r-squared and our F-test of our model, which tells us about model fit, we can also look at a variety of error measures to evaluate our model.

RSE - Residual Standard Error


MSE - Mean Squared Error

RMSE - Root Mean Square Error

MAE - Mean Absolute Error

RSE is based on the residual sum of squares.  It is the default error measure printed in the lm() summary.

RSS = $\sum(residuals)$ = $\sum x_i - \bar{x}$

RSE is the $\sqrt{RSS/df}$

By taking the square root of RSS, and dividing by degrees of freedom (which is n - # of parameters - 1) we return the error to the *units* of Y, instead of squared units.

In [None]:
## calculate RSS
RSS <- sum(residuals(mod_exp_vars)^2) ## mod_exp_vars is the last crime model we just fit
RSS

In [None]:
## use RSS to calculate RSE
RSE <- sqrt(RSS / mod_exp_vars$df.residual)
RSE

What does this mean? It essentially works out to the standard deviation of our residuals - where RSS is the variance. 

General interpretation - smaller error is better.  We can compare RSE between models to see which predicts our y variable the best (which has the smallest standard deviation of the residuals).

Moving to MSE - Mean Squared Error - it is the average of the residuals squared.

RMSE is Root Mean Squared Error - like RSE is the square root of RSS, RMSE is the square root of MSE.

In [None]:
## calculate MSE
MSE <- mean(residuals(mod_exp_vars)^2)
MSE

In [None]:
RMSE <- sqrt(MSE)
RMSE

RMSE is another measure of the error - the average magnitude of the residuals.  Because the residuals are squared before they are averaged (and then we ultimately take the square root), it can put high emphasis on larger errors.  This can be helpful when large errors are particularly concerning.

And finally, MAE (Mean Absolute Error).  This time instead of squaring our residuals so that the positives don't cancel out the negatives, we use absolute value instead.  Therefore we just directly take the mean of the absolute value of the residuals - no need to square and then take the square root.

In [None]:
## calculate MAE
MAE <- mean(abs(residuals(mod_exp_vars)))
MAE

Notice that MAE is lower, this is because unlike RMSE, each residual has equal weight.

**In general, when interpreting any of these error measures, smaller is better.  We typically use these to compare two models, therefore look at the values of our error measurement for each model and compare.**  They are in the units of the observation, so we can sort of "eyeball" it and use our expert opinion to see if the difference is "large enough" for us to care.

These measures become important when we begin to talk about model building based on reducing error.

## Stepwise Selection
Every good model starts with theory and careful review of the relevant literature, right?  If, instead, we want to fit the *best* model according to the fit statistics, we can ask R to add variables one by one until we reach some stopping threshold where adding another variable will not improve the model in a significant way.

Is it a good way to build a model for the purposes of describing a phenomenon and understanding important relationships between variables? no - use theory

Is it the way to build the best model for predictive purposes? yes.  The majority of machine learning supervised modeling is essentially stepwise regression for the purposes of increasing predictive power and reducing error.

Stepwise regression with OLS == MACHINE LEARNING!

<img src="images/ml.jpg" width="400" height="400">

Stepwise regression means adding or removing variables based on their contribution to the model until the "perfect" model is achieved.  This can be done by starting with an empty model and adding variables ("forwards"), starting with a full model and removing variables ("backwards") or by sequentially adding and/or removing variables ("stepwise").

The most basic way of doing this is adding/removing variables based on p-value of the coefficient of those variables.  `olsrr` lets us do this with their variable selection functions

First, we'll look at all possible models and plot the various error measures.  We'll use the crime dataset.

In [None]:
names(crime)

In [None]:
model <- lm(crime.per.million ~ average.ed + is.south + median.assets + 
            num.low.salary + exp.per.cap.1960 + exp.per.cap.1959 , data = crime) 
all <- ols_step_all_possible(model)
plot(all)

That's pretty computationally intensive if we want to look at all possible combinations of predictors, instead lets use forward regression, adding the "best" predictor based on the lowest p-value at each round.

In [None]:
# stepwise forward regression
## remember that ~ . means include all possible IVs in the dataset
model <- lm(crime.per.million ~ ., data = crime)
ols_step_forward_p(model)

If we instead add everything to the model and work backward based on removing the variable with the highest p-value, do we get the same model?

In [None]:
# stepwise backward regression
## same model
ols_step_backward_p(model)

We end up with 8 variables in the model, instead of the 6 in our forward regression.  We end up with essentially the same RMSE - recalling parsimony - which model should we keep?

Finally, we can use stepwise regression in which we can go either direction at each step of the model fitting process.

In [None]:
ols_step_both_p(model)

In this case we ended up with the same model as forward selection, since no variables were removed after they were added.

### Overfitting
Overfitting our models is less of a concern if our goal is simply to investigate the relationship between some variables and see if one variable is a significant predictor of the other, the way a social scientist would use a model to see which characteristics of people are related to their rating of feminists on a scale from 0-100.  If the goal is instead to predict individuals ratings of feminists, based on the characteristics of previous individuals ratings, we then have a prediction problem.  When modeling for the purpose of prediction, we want to have the model with the best predictive accuracy, but we don't want to make the model so specific to the data used to fit the model so that when we apply it to other data the predictive accuracy is poor.  By forming models based purely on model performance measures, we run the risk of "overfitting" models.