# MATH 3375 Examples Notebook #9

# Dimension Reduction and Regularization

## LASSO and Ridge Regression

We continue looking at methods for improving a linear regression model. 


In [None]:
#Look at data set
car_data <- read.csv("cars2004.csv", stringsAsFactors=TRUE)
head(car_data,3)
tail(car_data,3)

In [None]:
car_data$Length <- as.integer(as.character(car_data$Length))
car_data$Width <- as.integer(as.character(car_data$Width))
tail(car_data,3)

In [None]:
car_data$Length[is.na(car_data$Length)] <- as.integer(median(car_data$Length[!is.na(car_data$Length)]))
car_data$Width[is.na(car_data$Width)] <- as.integer(median(car_data$Width[!is.na(car_data$Width)]))
head(car_data,3)
tail(car_data,3)

## 1. Stepwise Regression

Recall that we used stepwise regression as a method for selecting variables to include in the model. We review this method below.

In [None]:
hp_model0 <- lm(HP ~ 1, data=car_data)

hp_model_full <- lm(HP ~ MSRP+Invoice+EngineSize+Cylinders+City.MPG+Hwy.MPG+Weight+WheelBase+Length+Width, data=car_data)
hp_model_step = step(hp_model0, scope=list(lower=hp_model0, upper=hp_model_full), direction = "forward", k=2)


In [None]:
summary(hp_model_step)

### Pros and Cons of Stepwise 

#### Pros
* Easy and quick
* Algorithm helps balance bias and variance
* Algorithm uses a metric (e.g., AIC) to avoid adding variables that do not improve this balance

#### Cons
* Greedy algorithm does not necessarily choose the **_optimal_** model

## 2. LASSO Regression

LASSO (**L**east **A**bsolute **S**hrinkage and **S**election **O**perator) is a regression _regularization_ method with the following properties:

* Instead of purely minimizing the SSE (Sum of Squared Residuals) to determine the model coefficients, the algorithm minimizes:
<center>
$SSE + \lambda\sum_{j=1}^{p}\left | \beta_j \right |$
<\center>
    
* The sum of the coefficient magnitudes is limited 
* The effect is to _**shrink**_ coefficients
* This drives some coefficients to zero, which eliminates the variable from the model 


#### Points of Caution

* The value of $\lambda$ must be determined through trials with cross-validation to see which $\lambda$ value is most effective.
* Note that when $\lambda = 0$, the model is the same as an ordinary least squares (OLS) regression model.
* Unlike ordinary least squares (OLS) regression, there is no "closed form" mathematical equation to compute model coefficients
* A _numerical algorithm_ is applied to minimize the sum
* Data MUST be standardized for LASSO algorithm to be applied
* After LASSO has been used to **_select_** the variables, use those variables again in an OLS (not LASSO) model to obtain the final regression model.

### R Libraries to Implement LASSO

There are multiple libraries that will implement LASSO regression. Below we show an implementation with the **glmnet** library.

_Only un-comment the install line if the library fails to load. Run the install once and re-comment the line to avoid running the install again._

In [None]:
#install.packages("glmnet")
library(glmnet)

#### Features of glmnet

* **glmnet** scales the data for you by default
* **glmnet** will use cross-validation for you to select the best $\lambda$ value

#### Notes about the R code below

* The first step is to find the best $\lambda$
* Because there is cross-validation involved, note that we use **set.seed** before starting to ensure reproducible results.
* We are using 5-fold cross-validation by specifying **nfolds=5**
* We specify **alpha=1** to indicate that we are using **LASSO**

In [None]:
set.seed(3375)
hp_model_lasso <- cv.glmnet(x=as.matrix(car_data[c(4:7,9:14)]),y=as.matrix(car_data[8]),alpha=1,nfolds=5)
plot(hp_model_lasso)

### Interpreting the Graph

* Each red dot is the MSE obtained for a certain $\lambda$ value
* The vertical line on the LEFT at $log(\lambda)$ close to $-3$ represents the minimum MSE
* Both vertical lines together give an interval where we can remain within one _standard error_ of that minimum
* Below we display the lambda values for these two thresholds

In [None]:
#Lambda with minimum MSE
hp_model_lasso$lambda.min

In [None]:
#Lambda threshold for MSE within 1 SE of minimum
hp_model_lasso$lambda.1se

### Viewing Model Coefficients

Below we view the LASSO model coefficients for the $\lambda$ with minimum MSE

In [None]:
coef(hp_model_lasso,s=hp_model_lasso$lambda.min)

#### Reduce Dimensions (Number of Predictors)

In this example, no coefficients were eliminated from the model, but the **Invoice** coefficient is very small.

Now we view the LASSO model coefficients for the largest possible $\lambda$ that still gives results within 1 SE of the minimum MSE. This is how LASSO can be used to reduce the complexity of the model.

In [None]:
coef(hp_model_lasso,s=hp_model_lasso$lambda.1se)

#### This is the 'default' LASSO model, as shown below

In [None]:
coef(hp_model_lasso)

#### Using the LASSO Result

The default LASSO model with the most shrinkage (largest acceptable $\lambda$) has dropped the coefficients for all coefficients except MSRP, EngineSize, Cylinders, and City.MPG.

To use these results, we create an OLS model with the predictors that were not dropped by the LASSO model.  

In [None]:
hp_model_ols_lasso <- lm(HP ~ MSRP+EngineSize+Cylinders+City.MPG, data=car_data)
summary(hp_model_ols_lasso)

## 3. Ridge Regression

Another regularization method for regression models is **Ridge Regression**. It is _**not**_ used for variable selection, but it does have the potential to better balance bias and variance to give a model with better predictions.

Ridge Regression computes model coefficients that minimize:

<center>
$SSE + \lambda\sum_{j=1}^{p} {\beta_j}^2$


### Using glmnet for Ridge Regression

The process is very similar to the one we used for LASSO, but **alpha=0** indicates we are using Ridge Regression instead of LASSO.

In [None]:
set.seed(3375)
hp_model_ridge <- cv.glmnet(x=as.matrix(car_data[c(4:7,9:14)]),y=as.matrix(car_data[8]),alpha=0,nfolds=5)
plot(hp_model_ridge)


### Interpreting the Plot

Again, we can examine the $\lambda$ with minimum MSE and the largest acceptable $\lambda$ to stay within 1 standard error of that minimum MSE.  Notice again the the model with the larger lambda is the default.

In [None]:
coef(hp_model_ridge, s=hp_model_ridge$lambda.min)

In [None]:
coef(hp_model_ridge, s=hp_model_ridge$lambda.1se)

In [None]:
coef(hp_model_ridge)

### Notes About the Ridge Model

Unlike LASSO, we keep our selected Ridge Model coefficients as our model.  The predictions are computed using these coefficients rather than those in the OLS model with the same predictors. 

In [None]:
#Predict HP for first 5 rows of original data set
predict(hp_model_ridge, s="lambda.min", newx=as.matrix(car_data[1:5,c(4:7,9:14)]))

In [None]:
#Predict HP for first 5 rows of original data set
predict(hp_model_ridge, s="lambda.1se", newx=as.matrix(car_data[1:5,c(4:7,9:14)]))

### Considerations for Using Ridge Regression

Since Ridge does not eliminate variables that are highly correlated It is advisable to select a reasonable subset of variables **first** (to avoid multicollinearity) and then apply Ridge Regression.

### Suggestion

Compare the Ridge regression predictions with those of the other models (stepwise and OLS model from LASSO dimension reduction). Compute the in-sample MSE for each.