### What is multicollinearity?
* Multicollinearity exists when two independent variables are highly correlated.
* When two X variables are highly correlated, they both convey essentially the same
information.

### Why is multicollinearity a problem?
* If the goal is simply to predict Y from a set of X variables, then multicollinearity is not
a serious problem. If the goal is to understand how the various X variables impact Y,
then multicollinearity is a big problem.
* Automated regression algorithms tends to overfit in the presence of multicollinearity.
* Leads to unreasonable coefficient estimates, large standard errors, large pvalues and
consequently bad interpretation/inference.
* Leads to mis-specification of the model.
Confidential

### Causes of multicollinearity
* Improper use of dummy variables (e.g. failure to exclude one category)
* Including a variable that is computed from other variables in the equation
(e.g. family income = husband‟s income + wife‟s income, and the regression
includes all 3 income measures)
* In effect, including the same or almost the same variable twice (height in
feet and height in inches; or, more commonly, two different
operationalizations of the same identical concept)
* The above all imply some sort of error on the analyst‟s part. But, it may
just be that variables really and truly are highly correlated.

### Detection of multicollinearity
* High pairwise correlations between the X variables. (But three or more X
variables can be multicollinear together without having high pairwise
correlations.)
* Regression coefficients change drastically when adding or deleting an X
variable.
* A regression coefficient has the opposite sign than what is expected.
* Formally, variance inflation factors (VIF) measure how much the variance of
the estimated coefficients are increased over the case of no correlation
among the X variables. If no two X variables are correlated, then all the VIFs
will be 1. “A general rule is that the VIF should not exceed 10” 

### Dealing with multicollinearity
* Increasing the sample size is a common first step since when sample size is increased,
standard error decreases.
* Combine variables into a composite variable. Example: if height and weight are collinear
independent variables, remove height and weight from the model, and use surface area
(calculated from height and weight) instead.
* The easiest solution: Remove the most intercorrelated variable(s) from analysis. This
method is misguided if the variables were there due to the theory of the model, which
they should have been. 

In [11]:
vif_func<-function(in_frame,thresh=10,trace=T,...){
  
  require(fmsb)
  
  if(class(in_frame) != 'data.frame') in_frame<-data.frame(in_frame)
  
  #get initial vif value for all comparisons of variables
  vif_init<-NULL
  var_names <- names(in_frame)
  for(val in var_names){
    regressors <- var_names[-which(var_names == val)]
    form <- paste(regressors, collapse = '+')
    form_in <- formula(paste(val, '~', form))
    vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, ...))))
  }
  vif_max<-max(as.numeric(vif_init[,2]), na.rm = TRUE)
  
  if(vif_max < thresh){
    if(trace==T){ #print output of each iteration
      prmatrix(vif_init,collab=c('var','vif'),rowlab=rep('',nrow(vif_init)),quote=F)
      cat('\n')
      cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
    }
    return(var_names)
  }
  else{
    
    in_dat<-in_frame
    
    #backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
    while(vif_max >= thresh){
      
      vif_vals<-NULL
      var_names <- names(in_dat)
      
      for(val in var_names){
        regressors <- var_names[-which(var_names == val)]
        form <- paste(regressors, collapse = '+')
        form_in <- formula(paste(val, '~', form))
        vif_add<-VIF(lm(form_in, data = in_dat, ...))
        vif_vals<-rbind(vif_vals,c(val,vif_add))
      }
      max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2]), na.rm = TRUE))[1]
      
      vif_max<-as.numeric(vif_vals[max_row,2])
      
      if(vif_max<thresh) break
      
      if(trace==T){ #print output of each iteration
        prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)
        cat('\n')
        cat('removed: ',vif_vals[max_row,1],vif_max,'\n\n')
        flush.console()
      }
      
      in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]
      
    }
    
    return(names(in_dat))
    
  }
  
}

In [12]:
df <- read.csv('vif.csv')

In [13]:
eqn <- "y ~ X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15"

In [14]:
mod1<-lm(eqn,data=df)

In [15]:
summary(mod1)


Call:
lm(formula = eqn, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-65.855 -13.150   0.647  13.548  53.344 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    0.631      1.489   0.424  0.67233   
X1             5.523      2.345   2.356  0.01954 * 
X2            -7.089      3.389  -2.091  0.03787 * 
X3             2.512      1.685   1.490  0.13788   
X4             3.811      5.430   0.702  0.48362   
X5            -4.776      1.766  -2.704  0.00749 **
X6            11.374      8.777   1.296  0.19665   
X7            -3.426      6.207  -0.552  0.58166   
X8            -0.310      7.855  -0.039  0.96857   
X9            -6.424      3.279  -1.959  0.05160 . 
X10           11.559      5.761   2.007  0.04626 * 
X11            6.768      3.546   1.909  0.05782 . 
X12            4.830      3.235   1.493  0.13712   
X13            2.271      1.465   1.550  0.12294   
X14           -1.182      3.619  -0.327  0.74433   
X15            2.849   

In [17]:
vif_func(in_frame=subset( df, select = -y ),thresh=5,trace=T)

 var vif             
 X1  27.7352782121335
 X2  36.894719655662 
 X3  12.5694198119223
 X4  50.7385544946882
 X5  8.35069942576805
 X6  114.685122257738
 X7  67.3415419903541
 X8  153.59701278562 
 X9  48.2266628032823
 X10 50.7371404020434
 X11 33.9720046914933
 X12 43.2541022327502
 X13 12.0823286945155
 X14 74.6186892748866
 X15 29.8722458974499

removed:  X8 153.597 

 var vif             
 X1  6.67306561905865
 X2  7.98347501276441
 X3  4.56187657664034
 X4  8.03048468539905
 X5  7.70736760798638
 X6  19.6743072243842
 X7  52.9521669899093
 X9  17.8683960667889
 X10 46.2484642812219
 X11 18.247944611281 
 X12 42.1336977950459
 X13 10.8973377459043
 X14 37.929695266819 
 X15 21.5847028895811

removed:  X7 52.95217 

 var vif             
 X1  6.54376168047226
 X2  7.68236114724478
 X3  4.04873004990446
 X4  5.08958904355854
 X5  2.65685239969468
 X6  9.126853848419  
 X9  2.89940351016053
 X10 4.24712217527482
 X11 4.45202381076983
 X12 12.8835110865084
 X13 1.92759852497866
 X14 