In [None]:
---
title: "Chapter 6 - Linear Model Selection and Regularisation"
author: "Dan"
date: "11 February 2018"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Lab 1 - Subset Selection Methods

Best Subset Selection

```{r}
## We want to predict Salary on the basis of various statistics associated with performance in the previous year

library(ISLR)
data("Hitters")
str(Hitters)
summary(Hitters)
names(Hitters)
```

59 NAs in Salary
```{r}
sum(is.na(Hitters$Salary))
nrow(Hitters)
```
 In this case we will just remove the players with NAs
 
```{r}
Hitters <- na.omit(Hitters)
sum(is.na(Hitters$Salary))

```

The regsubsets() function from the leaps library performs best subset selection by identifying the best model that contains a given number of predictors, where best is defined as having the lowest RSS. The syntax is the same for lm(). The summary() command outputs the best set of variables for each model size

```{r}
library(leaps)
regfit.full <- regsubsets(Salary ~ ., data = Hitters) # Running best subset selection up to 8 variables max. Trying every possible combo and returning the one with the lowest RSS for each possible size 1-8 variables
class(regfit.full)

summary(regfit.full)
```

Here it outputs the best subset of each size up to an 8 predictor model. A star in the corresponding row means that its in the best model for the given number of predictors. 

By default only goes up to 8, but the nvmax argument can be defined

```{r}
regfit.full <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
reg.summary <- summary(regfit.full)
reg.summary
```

The summary also returns R2, RSS, Adjusted R2, Cp and BIC. We can examime these to try and select the best overall model.

```{r}
names(reg.summary)
```

```{r}
reg.summary$rsq
```

We see that the R2 statistic increases from 32% when only one variable is included in the model, to almost 55% when all variables are included. As expected, the R2 statistic increases monotonically as more variables are included.

Plotting RSS, adjusted R2, Cp and BIC for all of the models at once will help us decide which model to select. Note the type="l" option tells R to connect the plotted points with lines

```{r}
par(mfrow=c(2,2))
plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS",
type="l")
```

```{r}
plot(reg.summary$adjr2, xlab="Number of Variables", ylab="Adjusted RSq", type="l")

```

The points () command works just like the plot() command, except that it puts points on a plot that has already been created, instead of creating a new plot. The which.max() function can be used to identify the location of a maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted R2 stat.

```{r}
which.max(reg.summary$adjr2) # = 11
points(11,reg.summary$adjr2[11], col="red",cex=2,pch =20)
```

Similarly, we can plot the Cp and BIC statistics, and indicate the models with the smallest statistic using which.min()

```{r}
plot(reg.summary$cp, type="l", xlab="Number of Variables", ylab="Cp") 
which.min(reg.summary$cp) # 10
points(10, reg.summary$cp[10], col="red", cex=2, pch=4)# points(x, y)

plot(reg.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")
which.min(reg.summary$bic)
points(6,reg.summary$bic[6], col="red", cex=2, pch=4) #points (x,y)
```

The regsubsets() function has a built in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, Cp, adjusted R2, or AIC.

```{r}
plot(regfit.full, scale="r2")
plot(regfit.full, scale="adjr2")
plot(regfit.full, scale="Cp")
plot(regfit.full, scale="bic")
```

The top row of each plot contains a black square for each variable selected according to the optimal model associated with the statistic. For instance, we see that several models show a BIC close to -150. However, the model with the lowest BIC is the six-variable model that only contains AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef() function to see the coefficients associated with this model. 

```{r}
coef(regfit.full,6)
```

Forward and Backward Stepwise Selection

We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method="forward" or method="backward"

```{r}
regfit.fwd <- regsubsets(Salary~., data=Hitters, nvmax = 19, method="forward")
summary(regfit.fwd)

regfit.bwd <- regsubsets(Salary~.,data=Hitters, nvmax = 19, method="backward")
summary(regfit.bwd)
```
For instance, we see that using forward stepwise selection, the best one variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable model through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection and best subset selection are different.

```{r}
coef(regfit.full, 7) # best subset selection
coef(regfit.fwd, 7) # forward stepwise 
coef(regfit.bwd, 7) # backward stepwise


```

Choosing Among Models Using the Validation Set Approach and Cross-Validation

We just saw that it is possible to choose among a set of models of different sizes using Cp, BIC and adjusted R2. We will now consider how to do this using validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform *all* aspects of model fitting, including variable selection. 

Therefore, the determination of which model of a given size is best must be made using *only* the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We do this by creating a random vector, train, of elements equal to TRUE if the corresponding observation is in the training set and FALSE otherwise. The vector test has a TRUE if the observation is in the test set and FALSE otherwise. We set a random seed so that the user will obtain the same training/test set split

```{r}
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep=TRUE) # created a vector of length nrow(Hitters) of randomly distributed TRUEs and FALSEs

test <- !train

regfit.best <- regsubsets(Salary ~ ., data=Hitters[train,], nvmax=19) # fits the lowst RSS model of each size

#model.matrix creates a model matrix so that we can test the best subset selection fits. It expands factors to dummy variables. Here, the 'league' 'division' and 'newleague' factors, each with two levels are turned into dummy variables

test.matrix <- model.matrix(Salary~., data=Hitters[test,])

val.errors <- rep(NA, 19)

# regsubsets returns the best model for each model size (number of variables 1-19). The loop extracts the coefficients from each 'best' model, multiplies them into the correct column of the test matrix to form the predictions, and computes the test MSE

for (i in 1:19) {
  coefi <- coef(regfit.best, id=i) # id is the model size
  pred <- test.matrix[,names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test]-pred)^2)
}

val.errors

which.min(val.errors) # = 10, so the 10 variable model has the smallets test MSE

coef(regfit.best,10)
```
This was a little tedious, we had to write the for loop because there is no equivalent to the predict() function for regsubsets(). Since we will be using this function again, we can capture our steps above and write our own predict method

```{r}
predict.regsubsets <- function (object,newdata,id,...) {
  formula <- as.formula(object$call[[2]]) # extracts the model formula from the regsubsets() function
  test.mat <- model.matrix(formula,newdata)
  coefi <- coef(object, id=id) # id = number of variables in the model
  xvars <- names(coefi)
  test.mat[,xvars]%*%coefi
}

predict.regsubsets(regfit.best,Hitters[test,],1 )

# Finally, we perform best subset selection on the full data set, and select the best ten-variable model. It is important now that we have determined which size model to use that we make use of the full data set in order to obtain the most accurate coefficient estimates. 

# Note that we perform best subset selection on the full data set and select the best ten-variable model, rather than simply using the variables that were obtained from the training set, because the best ten-varibale model on the full datat set may differ from the best 10-variable model on the training set.

regfit.best <- regsubsets(Salary~., data=Hitters, nvmax=19)
coef(regfit.best,10) # These are the coefficients and their estimates that we have selected using the validation method

## Cross Validation Method ##
```

We now try to choose among the models of different sizes using cross-validation. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this quite easy. First, we create a vector that allocates each observation to one of k=10 folds, and we create a matrix in which to store the results.

```{r}
k=10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace=TRUE) # 263 element vector of 1:10s to decide which observation is in which fold

cv.errors <- matrix(NA, k, 19, dimnames=list(NULL,(1:19))) # creates a matrix of NAs, of k rows and 19 columns, with NULL row names and 1:19 as column names
```
Now we can write a for loop that performs cross validation. In the jth fold, the elements of folds that equal j are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict() method), compute the test errors on the appropriate subset, and sore tham in the appropriate slot in the matrix cv.errors.

```{r}
for (j in 1:k) {
  best.fit <- regsubsets(Salary~., data=Hitters[folds!=j,], nvmax=19)
  for (i in 1:19) {
    pred <- predict.regsubsets(best.fit, Hitters[folds==j,],id=i)
    cv.errors[j,i] <- mean((Hitters$Salary[folds==j]-pred)^2)
  }
}

# First iteration fits the 19 best models using best subset selection on folds 2-10. Fold 1 is now the test set for each of the 19 models and their MSE is returned. We end up with a matrix of the MSE of each different size model with each different fold having a turn as the test set.

mean.cv.errors <- apply(cv.errors, MARGIN = 2, FUN = mean) # apply function passes a function over a matrix, here we apply the mean function over the matrix cv.errors. MARGIN = 2 tells apply to only apply the specified function to the columns of the matrix. Returns a vector of the mean of each column of the matrix

mean.cv.errors
which.min(mean.cv.errors) # cross-validation tells us that the 11-variable model has the smallest test MSE
```
Just as with the validation set method, we apply this finding to the whole data set in order to get the most accurate coefficient estimates

```{r}
best.fit <- regsubsets(Salary ~ .,data = Hitters, nvmax = 19)
coef(best.fit, 11) # This is the result that the cross validation method picked
```

Lab 2 - Ridge Regression and the Lasso

We will use the glmnet package to perform ridge regression and the lasso. The main function of this package is glmnet(), which can be used oto fit ridge regression models, lasso models and more. 

This function has slightly different syntax from other model fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y~x syntax.

We will not perform ridge regression and the lasso in order to predict Salary on the Hitters data. 

Before proceeding, ensure that the missing values have been removed from the data

```{r}
sum(is.na(Hitters)) # = 0
x <- model.matrix(Salary~., data=Hitters)[,-1] # all data minus the names
y <- Hitters$Salary
```
The model.matrix function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables. This is necessary because glmnet() can only take numberical inputs

Ridge Regression

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, if alpha =1 then a lasso model is fit.

```{r}
library(glmnet)
grid <- 10^seq(10,-2, length=100) # 10 to the powers of 10 (huge) to -2 (0.01, tiny) in 100 steps
ridge.mod <- glmnet(x[,-1],y,alpha=0, lambda=grid)#it was showing two intercepts, the second one as a zero so removed 
```

By default, the glmnet() function performs ridge regression for an automatically selected range of ?? values. However, here we have chosed to implement the function over a grid of values ranging from ??=10^10 to ??=10^-2, essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of ?? that is not one of the original grid values. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE.

Associated with each value of ?? is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20x100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of ??).

```{r}
names(ridge.mod)
dim(coef(ridge.mod)) 
```

We expect the coefficient estimates to be much smaller in terms of l2 norm, when a large value of ?? is used, as compared to when a small value of ?? is used. These are the coefficients when ?? = 11,498 along with their l2 norm.

```{r}
ridge.mod$lambda[50] # the 50th value of lambda
coef(ridge.mod)[,50] # the coefficients associated with the 50th value for lambda

options(digits=4) # reduce the number of decimal places shown globally
```

In contrast, here are the coefficients when ?? = 705, along with their l2 norm. Note the much larger l2 norm of the coefficients associated with this smaller value of ??

```{r}
ridge.mod$lambda[60] # the 60th value for lambda
coef(ridge.mod)[,60] # the coefficients associtaed with the 50th value of lambda
```
We can use the predict() function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of ??, say 50:

```{r}
predict(ridge.mod, s=50, type="coefficients")[1:20,] # s = the budget, our chosen lambda value. Without the subsetting, this returns a 20x1 matrix of the coefficient estimates. Subsetting [1:20,] displays nice tabular format
```

We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso. There are two common ways to split a test set.

1) Produce a random vector of TRUE and FALSE elements and select the observations corresponding to TRUE for the training data

2) Randomly choose a subset of numbers from 1 to n; these can be used as the indices for the training observations

The two approaches work equally well. We used the former method in the previous lab so here we will use the latter method.

```{r}
set.seed(1)

x <- model.matrix(Salary~.,data=Hitters)[,-1] # remove useless 'intercept=1' term
y <- Hitters$Salary

train <- sample(1:nrow(x), nrow(x)/2) #take 263/2 samples of 1 to 263 with no replacement
test <- x[-train,]
y.test <- y[-train]
```

Next we fit a ridge regression model on the training set, and evaluate its test MSE using ?? = 4. We use the predict() function again but instead of type="coefficients" we use newx=x[test,]

```{r}
ridge.mod <- glmnet(x=x[train,], y=y[train], alpha=0, lambda=grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod, newx = test, s=4)
mean((ridge.pred - y.test)^2)
```

Test MSE is 101,037. Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set MSE like this:

```{r}
mean((mean(y[train])-y.test)^2) # = 193,253
```

We could also get the same result by fitting a ridge regression model with a very large value of ??. Note that 1e10 = 10^10

```{r}
ridge.pred <- predict(ridge.mod, s=1e10, newx=test)
mean((ridge.pred - y.test)^2) # = 193,253
```

So fitting a ridge regression model with ??=4 leads to a much lower test MSE than fitting a model with just an intercept. We now check whether there is any benefit to performing ridge regression with ??=4 instead of just performing least squares regression. Recall that least squares is simply ??=0

```{r}
ridge.pred <- predict(ridge.mod, s=0, newx=test, exact=T) # exact in there to stop R interpolating over the grid of lambda values

mean((ridge.pred - y.test)^2) # = 114,724 using least squares
```

In general, instead of arbitrarily choosing ??=4, it would be better to do this using the built in cross-validation function, cv.glmnet(). By default, the function performs a ten-fold cross validation, though this can be changed using the argument nfolds.

```{r}
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=0) # creates a big old list

par(mfrow=c(1,1))

plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam #  212 is the value for ?? that produces the lowest MSE so we plug this back into the prediction model to get the test MSE associated with this ?? value

ridge.pred <- predict(ridge.mod, newx = test, s=bestlam)
mean((ridge.pred - y.test)^2) # 96,061
```

This represents a further improvement over the test MSE that we got using ??=4. Finally, we refit our ridge regression model on the full data set, using the value of ?? chosen by cross-validation, and example the coefficient estimates

```{r}
out <- glmnet(x,y,alpha=0)
predict(out, type="coefficients", s=bestlam)
predict(out, type="coefficients", s=bestlam)[1:20,]
```

As expected, none of the coefficients are zero - ridge regression doesn't perform variable selection!

The Lasso

We saw that ridge regression with a wise choice of ?? can outperform least squares as well as the null model on the Hitters data set. We now ask whether the lasso can yield either a more accurate of a more interpretable model than ridge regression. In order the fit a lasso model, we once again use the glmnet() function; however, this time we use the argument alpha=1. Other than that change, we proceed just as we did with fitting the ridge model.

```{r}
x <- model.matrix(Salary~., data=Hitters)[,-1] # all data minus the names
y <- Hitters$Salary

set.seed(1)
 train=sample (1: nrow(x), nrow(x)/2)
 test=(-train)
 y.test=y[test]

 
library(glmnet)
lasso.mod <- glmnet(x[train,], y[train], alpha=1, lambda=grid)

plot(lasso.mod)
```
We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. We now perform cross validation and compute the associated error

```{r}
set.seed(1)
cv.out=cv.glmnet(x[train ,],y[ train],alpha=1)
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.pred=predict (lasso.mod ,s=bestlam ,newx=x[test ,])
mean((lasso.pred - y.test)^2) # 100,743
```

This is substantially lower than the test MSE of the null model and of least squares, and very similar to the MSE of ridge regression with ?? chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 12 of the 19 coefficients are exactly zero. So the lasso model with ?? chosen by cross, validation contains only 7 variables

```{r}
out <- glmnet(x,y,alpha = 1, lambda=grid) # putting the whole data set in to get the most accurate coefficient estimates possible

lasso.coef <- predict(out, type="coefficients", s=bestlam)[1:20,]
lasso.coef
lasso.coef[lasso.coef!=0] # These are our lasso coefficients

```

Lab 3 - PCR and PLS Regression

Principal Components Regression (PCR) can be preformed using the pcr() function from the pls library. We not apply PCR to the Hitters data, in order to predict Salary. Again, ensure that the missing values have been removed from the data.


```{r}
library(pls)
set.seed(2)
pcr.fit=pcr(Salary~., data=Hitters,scale=TRUE,validation="CV")
```

The syntax is similar to lm() with a few additional options. Setting scale=TRUE has the effect of standardising each predictor prior to generating the principal components, so that the scale on which each variable is measured will not have an effect. Setting validation="CV" causes PCR to compute the 10-fold cross validation error for each possible value of M, the number of principal components used. The resulting fit can be analysed using summary()

```{r}
summary(pcr.fit)
```

The CV score is provided for each possible number of components, ranging from M=0 onwards. Note that pcr() reports the *root* MSE; in order to obtain the usual MSE, we must square the quantity. 

One can plot the cross-validation scores using the validationplot() function. Using val.type="MSEP" will cause the cross-validation MSE to be plotted

```{r}
validationplot(pcr.fit, val.type = "MSEP")
```

We see that the smallest cross-validation error occurs when M=16 components are used. This is barely fewer than M=19, which amounts to simply performing least squares, because when all of the components are used in PCR no dimension reduction occurs. However, from the plot we also see that cross validation error is roughly the same when only one component is included in teh model. This suggests that a model that uses just a small number of components might suffice.

The summary() function also provides the *precentage of variance explained* in the predictors and in the response using different numbers of components. This is discussed in greater detail in Chapter 10. Briefly, we can think of this as the amount of information about the predictors or the response that is captured using M principal components. 

For example, setting M=1 only captures 38.31% of all the variance, or information, in the predictors. In contrast, M=6 increases the value to 88.63%. If we were to use m=p=19 components, this would increase to 100%. We now perform PCR on the training data and evaluate its test performance

```{r}
set.seed(1)
pcr.fit=pcr(Salary~., data=Hitters,subset=train,scale=TRUE, validation="CV")
validationplot(pcr.fit,val.type="MSEP")
```

We find that the lowest cross-validation error occurs when M=7 components are used. We compute the test MSE as follows

```{r}
pcr.pred=predict(pcr.fit,x[test,],ncomp=7)
mean((pcr.pred-y.test)^2)
```

The test MSE is competitive with the result obtained using ridge regression and the lasso. However, as a result of the way PCR is implemented, the final model is more difficult to interpret because is does not perform any kind of variable selection or even directly produce coefficient estimates. 

Finally, we fit PCR on the full data set using M=7, the number of components identified by cross-validation

```{r}
pcr.fit <- pcr(y~x, scale=TRUE, ncomp=7)
summary(pcr.fit)
```

Partial Least Squares

We implement partial least squares (PLS) using the plsr() function, also in the pls library. The syntax is just like that of the pcr() function

```{r}
set.seed(1)
pls.fit <- plsr(Salary~., data=Hitters, subset=train, scale=TRUE, validation="CV")
summary(pls.fit)
validationplot(pls.fit)
```

The lowest cross-validation error occurs when only M=2 partial least squares directions are used. We now evaluate the corresponding test set MSE

```{r}
pls.fit <- plsr(Salary~.,data=Hitters, scale=TRUE, ncomp=2)
summary(pls.fit)

```

Notice that the percentage of variance in Salary that the two-component PLS fit explains, 46.6% is almost as much as that explained using the final seven-component model PCR fit, 46.69%. This is because PCR only attempts to maximise the amount of variance explained in the predictors, while PLS searches for the directions that explain variance in both the predictors and the response. 

```{r}
###############
## Exercises ##
###############

#1a) Best subset has the smallest training RSS because it can try every combination of predictors to find the one with the smallest training RSS. It is the most flexible of the methods.

#b) We have no way to tell without a test set // Best subset because it can use combinations that forward and backwars stepwise cannot

#ci) True
#ii) True
#iii) False
#iv) False
#v) False

#2a) iii. Lasso is less flexible because it reduces some of the coefficients to zero. Fewer coefficients means less variance but slightly more bias

#b) iii. Ridge regression shrinks the coefficients given my least squares in order to reduce the variance of the coefficient estimates. 

#c) ii. Least Squares is less flexible than non-linear methods and so can lead to very high bias when modelling highly non-linear relationships. Non-linear methods can show a big improvement in prediction accuracy by trading a small increase in variance for a big decrease in model bias

#3a) iii. s=lambda so lambda = 0 starts as just the least squares fit. This is fit to minimise training RSS so any adjustment will increase training RSS.

### Incorrect ###

#s is not lambda! It is the budget for the lasso. So as the budget gets #bigger we become LESS constrained, RSS can fit the model more closely so #training RSS decreases steadily (iv)

#b) iv. The test RSS will likely decrease if the true form of f(X) is anywhere close to linear as the lasso budget gets larger and we move away from just an intercept.

### Incorrect ###

# test RSS will decrease initially but as the model gets more and more flexible it will start to overfit and increase the test RSS

#c) iii. Variance will increase monotonically as the s increases and the model becomes more flexible because more and more coefficients are re-introduced. 

#d) iv. Squared bias will steadily decrease as the model becomes more and more flexible with increasing values of the budget

#e) v. The model fitting approach has no bearing on the irreducible error

#4a) iii. lambda = 0 is the least squares fit so as lambda increases we start to reduce the coefficients towards zero. The training RSS will steadily increase as the model becomes less flexible.

#b) ii. Test RSS will initially decrease as the tuning parameter reduces the variance for only a slight increases the bias. As the tuning parameter gets larger the increase in bias will eventually cause the test RSS to increase again

#c)iv. Variance will decrease monotonically as the model becomes more contrained

#d) iii. Bias will increase monotonically as the model becomes more constrained and cannot hug the training data so closely

#e) v. Different model approaches only affect the reducible error

#6a)

# p=1, y1 = 5, lambda = 3
# B1 = x
# Equation 6.15 says that Betahat(lasso) for y > lambda/2 is minimised at y-lambda/2 = 5-3/2 = 3.5

beta <- seq(-10,10,length=100)
lasso.eq <- function (beta, y=5, lambda=3) (y-beta)^2 + lambda*abs(beta)
plot(beta, lasso.eq(beta))

points(3.5, lasso.eq(3.5), pch=16, col="red", cex=2)

#8a)

set.seed(1)
X <- rnorm(100)
e <- rnorm(100)

B0 = 5
B1 = 10
B2 = -3
B3 = 2

Y <- B0 + B1*X + B2*X^2 + B3*X^3 + e

plot(X,Y)

library(leaps)
mydf <- data.frame(Y,X)

reg.full <- regsubsets(Y~poly(X, 10, raw=T), data=mydf, nvmax = 10)
reg.summary <- summary(reg.full)
reg.summary

names(reg.summary)

par(mfrow=c(2,2))

plot(reg.summary$cp, type="l")
which.min(reg.summary$cp) # = 4
points(4, reg.summary$cp[4], col="red", pch=16)

plot(reg.summary$bic, type="l")
which.min(reg.summary$bic) # = 3
points(3,reg.summary$bic[3], pch=16, col="red")


plot(reg.summary$adjr2, type="l")
which.max(reg.summary$adjr2) # =4
points(4,reg.summary$adjr2[4], pch=16, col="red")

coef(reg.full, 3)

## Best subset selection with the lowest BIC statistic gives the model with 3 variables with coefficients B1 = 9.975, B2 = -3.124 and B3 = 2.018 ##

## Forward Stepwise ##

reg.fwd <- regsubsets(Y~poly(X, 10, raw=T), data=mydf, nvmax = 10, method="forward")

summary.fwd <- summary(reg.fwd)
summary.fwd

plot(summary.fwd$cp, type="l")
which.min(summary.fwd$cp) # = 4
points(4,summary.fwd$cp[4], col="red", pch=16)


plot(summary.fwd$bic, type="l")
which.min(summary.fwd$bic) # = 3
points(3, summary.fwd$bic[3], col="red", pch=16)

plot(summary.fwd$adjr2, type="l")
which.max(summary.fwd$adjr2) # = 4
points(4,summary.fwd$adjr2[4], col="red", pch=16)

coef(reg.fwd, 3)

## Forward stepwise picks the same model as best subset ##

reg.bwd <- regsubsets(Y~poly(X,10,raw=T), data=mydf, nvmax=10, method="backward")

summary.bwd <- summary(reg.bwd)
summary.bwd

## Same model ##

## Lasso Model ##

library(glmnet)

# glmnet takes a matrix of x, a vector of Y and a grid of lambdas

# Use model.matrix to make the matrix

xmat <- model.matrix(Y~poly(X, 10, raw=T))[,-1] # take of the first columns of ones

cv.out <- cv.glmnet(xmat,Y,alpha=1) # use cross-validation to choose a value of lambda

par(mfrow=c(1,1))
plot(cv.out)

bestlambda <- lasso.mod$lambda.min

lasso.mod <- glmnet(xmat, Y, alpha=1, lambda = bestlambda)

lasso.coef <- predict(lasso.mod, type="coefficients", s=bestlambda)
lasso.coef[1:11,]

## The lasso keeps X, X2, X3 and X5 in the model ##
## X5 has a much smaller coefficient than the others ##

# New response vector #

Y2 <- 5 + 1.5*X^7 + e

## Best Subset ##

reg.full <- regsubsets(Y2 ~ poly(X, 10, raw=T), data=data.frame(Y2,X), nvmax=10)
reg.summary <- summary(reg.full)
reg.summary

par(mfrow=c(3,1))

plot(reg.summary$cp, type="l", xlab="Number of Variables", ylab = "Cp")
cp.min <- which.min(reg.summary$cp)
points(cp.min, reg.summary$cp[cp.min], col="red", pch=16)

plot(reg.summary$bic, type="l", xlab="Number of Variables", ylab="BIC")
bic.min <- which.min(reg.summary$bic)
points(bic.min, reg.summary$bic[bic.min], col="red", pch=16)

plot(reg.summary$adjr2, type="l", xlab="Number of Variables", ylab="Adj R2")
adjr2.min <- which.max(reg.summary$adjr2) # = 4
points(adjr2.min, reg.summary$adjr2[adjr2.min], col="red", pch=16)

coef(reg.full,cp.min)
coef(reg.full,bic.min)
coef(reg.full,adjr2.min)

## Best subset doesn't know wtf to do, wants to use between 1 and 4 predictors 

## Lasso ##

xmat <- model.matrix(Y2~poly(X, 10, raw=T))[,-1]

## Find the best value of lambda to use

cv.out <- cv.glmnet(xmat, Y2, alpha=1)
plot(cv.out)
names(cv.out)

bestlam <- cv.out$lambda.min

lasso.mod <- glmnet(xmat, Y2, alpha=1, lambda = bestlam)

lasso.pred <- predict(lasso.mod, s=bestlam, type="coefficients")

lasso.pred[1:11,]

# Only wants one predictor, X7 #

#9a)

data(College)
dim(College)

set.seed(1)
trainid <- sample(1:nrow(College), nrow(College)/2)
train <- College[trainid,]
test <- College[-trainid,]

#b)

fit.lm <- lm(Apps ~ ., data=training.set)
pred.lm <- predict(fit.lm, newdata = test.set)


err.lm <- mean((pred.lm-test.set$Apps)^2)
err.lm # MSE = 1,108,531

#c) # Ridge Regression #

# Make x matrix
xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]

# Make Y vector
Y <- training.set$Apps
# Find best value for lambda

library(glmnet)
fit.ridge <- cv.glmnet(xmat.train, Y, alpha=0) # alpha=0 for ridge, 1 for lasso
best.lam <- fit.ridge$lambda.min
best.lam # 450.7435

# Fit the model

pred.ridge <- predict(fit.ridge, s=best.lam, newx=xmat.test)

err.ridge <- mean((pred.ridge-test.set$Apps)^2)
err.ridge

# MSE = 1,037,616

#d) # Lasso Model #

xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]


fit.lasso <- cv.glmnet(xmat.train, train$Apps, alpha=1)
lambda <- fit.lasso$lambda.min  # optimal lambda

predict.lasso <- predict(fit.lasso, newx = xmat.test, s=lambda)
err.lasso <- mean((predict.lasso - test$Apps)^2)
err.lasso <- 1025248

coef.lasso <- predict(fit.lasso, s=lambda, type="coefficients")
length(coef.lasso[coef.lasso!=0])

# MSE lower than ridge regression model
# 14 non-zero coefficient estimates

#e) # PCR - Principal Components Regression

library(pls)
set.seed(1)
pcr.fit <- pcr(Apps~., data=train,scale=TRUE,validation="CV")
validationplot(pcr.fit, val.type="MSEP")

## Cross-validation shows that the smallest MSE is at M=16 ##

pcr.pred <- predict(pcr.fit, newdata = test, ncomp = 16)
err.pcr <- mean((pcr.pred - test$Apps)^2)

# Fitting the 16 component model to the full data

pcr.fit <- pcr(Apps~., data=College, ncomp=16)
summary(pcr.fit) # 16 component model explains 92.8% of the variance in Apps

#f) PLS model

pls.fit <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type="MSEP")

# Cross validation selects M=8 #

pls.pred <- predict(pls.fit, newdata = test, ncomp=10)
err.pls <- mean((pls.pred-test$Apps)^2)

# Test MSE = 1,132,021
# Fit to the whole data

pls.fit <- plsr(Apps~.,data=College, ncomp=10)
summary(pls.fit)

# 10 component model explains 92.28% of the variance in Apps

#g)

err.total <- c(err.lm, err.ridge, err.lasso, err.pcr, err.pls)
names(err.total) <- c("Least Squares","Ridge","Lasso","PCR","PLS")

err.total
barplot(err.total)

# Lasso gave the smallest test error, PCR and PLS are no better than the full linear regression model

#10a)

set.seed(1)
eps <- rnorm(1000)
xmat <- matrix(rnorm(1000*20), ncol=20) # createds 1000x20 matrix of normally distributed random variables with mean=0, sd=1
betas <- sample(-5:5, 20, replace=TRUE)
betas[c(3,6,7,10,13,17)] <- 0 # we've made some of the beta values zero to make the model size with the minimum test MSE !=20

betas

y <- xmat %*% betas + eps


#b) 

set.seed(1)
trainid <- sample(1:1000, 100, replace=FALSE)
xmat.train <- xmat[trainid,]
xmat.test <- xmat[-trainid,]
y.train <- y[trainid,]
y.test <- y[-trainid,]

train <- data.frame(y=y.train, xmat.train)
test <- data.frame(y=y.test, xmat.test)

#c)

# Fit the best subset selection for the training data
regfit.full <- regsubsets(y~., data=train, nvmax=20)

# Generate empty vector to store the training MSEs for the best model for each size generated by best subset selection
err.full <- rep(NA,20)

# Create model matrix with the data we want to run the best subset MSEs for

  mat <- model.matrix(y~., train)
  
  # For each model size i, we pull out the coefficient values and the names of those coefficients. We multiply the coefficients by the data in our matrix created above to get the predicted value. Then store the test MSE in the vector created above.

for (i in 1:20) {

  coefi <- coef(regfit.full, id=i)
  xvars <- names(coefi)
  pred.full <- mat[,xvars]%*%coefi
  err.full[i] <- mean((pred.full-train$y)^2)
}

err.full

# Plot the errors
plot(seq_along(err.full), err.full, main="Training MSEs", xlab="Number of Variables", ylab="Training MSE", type="b")

#d) 

mat <- model.matrix(y~., test)
err.test <- rep(NA,20)

for (i in 1:20) {
  coefi <- coef(regfit.full, id=i)
  pred.full <- mat[,names(coefi)] %*% coefi
  err.full[i] <- mean((pred.full - test$y)^2)
}

err.full

plot(seq_along(err.full), err.full, type="b", main="Test MSEs", xlab="Number of Variables", ylab="Test MSE")

#e)

which.min(err.full) # The 13 variable model has the smallest test MSE

#f) 

best.mod.coefs <- coef(regfit.full, 13)
best.mod.coefs
betas

# The coefficients that the model found very very close to the true values of the betas

#g) 
error.coefs <- sqrt(sum((betas[betas!=0] - best.mod.coefs[-1])^2))

plot(1:20, error.coefs, type="b")

#11a)

# Predicting per capita crime rate (crim) using all methods #

library(MASS)
data(Boston)
str(Boston)

## Splitting data into training and test set ##

dim(Boston)
set.seed(1)
trainid <- sample(1:506, 253, replace=F)

train <- Boston[trainid,]
test <- Boston[-trainid,]

dim(train)
dim(test)

# Best Subset Selection #


regfit.full <- regsubsets(crim~., data=train, nvmax = 13)
regfit.summary <- summary(regfit.full)

# Model selection # 

par(mfrow=c(3,1))

plot(regfit.summary$cp, type="l", xlab="Number Of Variables", ylab="Best Subset Cp")
min.cp <- which.min(regfit.summary$cp)
points(min.cp, regfit.summary$cp[min.cp], pch=16, col="red")

plot(regfit.summary$bic, type="l", xlab="Number Of Variables", ylab="Best Subset BIC")
min.bic <- which.min(regfit.summary$bic)
points(min.bic, regfit.summary$bic[min.bic], pch=16, col="red")

plot(regfit.summary$adjr2, type="l", xlab="Number Of Variables", ylab="Best Subset Adj R2")
max.adjr2 <- which.max(regfit.summary$adjr2)
points(max.adjr2, regfit.summary$adjr2[max.adjr2], pch=16, col="red")

## Lets try the 2 variable model and the 8 variable model ##

eightvar.coefs <- coef(regfit.full, 8)
twovar.coefs <- coef(regfit.full, 2)

##  Attaining test MSE for both ##

mat <- model.matrix(crim~., data=test)
pred <- mat[,names(eightvar.coefs)] %*% eightvar.coefs
err.8var <- mean((pred - test$crim)^2)
err.8var # = 39.567

pred2 <- mat[,names(twovar.coefs)] %*% twovar.coefs
err.2var <- mean((pred2 - test$crim)^2)
err.2var # = 39.467

coef(regfit.full, 8) # Using these coefficients

## Ridge Regression ##

xmat.train <- model.matrix(crim~., data=train)[,-1]
y.train <- train$crim
xmat.test <- model.matrix(crim~., data=test) [,-1]
y.test <- test$crim

## Find best value for lambda

ridge.fit <- cv.glmnet(xmat.train, y.train, alpha=0)

best.lambda <- ridge.fit$lambda.min
best.lambda

ridge.pred <- predict(ridge.fit, newx = xmat.test, s=best.lambda)
ridge.err <- mean((ridge.pred-y.test)^2)
ridge.err # = 38.364

ridge.coefs <- predict(ridge.fit, s=best.lambda, type="coefficients")

## Lasso Fit ##

lasso.fit <- cv.glmnet(xmat.train, y.train, alpha=1)

best.lambda <- lasso.fit$lambda.min
best.lambda

lasso.pred <- predict(lasso.fit, s=best.lambda, newx = xmat.test)
lasso.err <- mean((lasso.pred-y.test)^2)
lasso.err # = 38.38

lasso.coefs <- predict(lasso.fit, s=best.lambda, type="coefficients", newx = xmat.test)

lasso.coefs # age and tax coefficients reduced to zero

## PCR ##

pcr.fit <- pcr(crim~., data=train, validation="CV", scale=TRUE)

validationplot(pcr.fit, val.type = "MSEP")

# 10 components yields the smallest cross-validated MSE

pcr.pred <- predict(pcr.fit, newdata = test, ncomp = 10)

pcr.err <- mean((pcr.pred-y.test)^2)

pcr.err # = 39.67

err.all <- c(err.8var, ridge.err, lasso.err, pcr.err)
names(err.all) <- c("Best Subset", "Ridge","Lasso","PCR")


barplot(err.all, main="MSEs", ylim=c(0,40))

# All methods yield very similar test MSE, ridge regression comes out slightly on top but I would go with the lasso since it offers greater interpretability in return for a negligible difference in test MSE #

## Model Proposal ##

# Lasso with all features except age and tax with the following coefficients






```