In [None]:
---
title: "Chapter 7 - Beyond Linearity"
author: "Dan"
date: "16 February 2018"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
library(ISLR)
attach(Wage)

# Polynomial Regression and Step Functions #

fit <- lm(wage~poly(age, 4), data=Wage)
coef(summary(fit))

# raw = FALSE by default, this creates a matrix of linear combinations of the polynomials up to the fourth degree. There are orthogonal polynomials, the problem with these is that as the number of predictors increases, the k orthogonal polynomial and the k+1 orthogonal polynomial are near indistinguishable. This is solved by using raw=TRUE

# raw = TRUE creates a matrix of just the returned values of polynomials up to the fourth degree, not the linear combinations.

poly(1:10, 2)
poly(1:10, 2, raw=T)

# Later we see that this does not affect the model in a meaningful way-though the choice of basis clearly affects the coefficient estimates, it does not affect the fitted values obtained.

fit2 <- lm(wage~poly(age, 4, raw=TRUE), data=Wage)
coef(summary(fit2))

# We now create a grid of values for age at which we want predictions, and then call the generic predict() function, specifying that we want standard errors as well.

agelims <- range(age) # returns 2 element vector of the limits
age.grid <- seq(agelims[1], agelims[2]) # vector from beginning to end of agelims

preds <- predict(fit, newdata = list(age=age.grid), se=TRUE) # Not using list(age=age.grid) returns an error

se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit) # standard error bands are the predictions (preds$fit) +/- 2*SE (preds$se.fit)

# Finally, we plot the data and add the fit from the degree-4 polynomial

par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0)) # mar and omo arguments to par() allow us to control the margins of the plot, see help
plot(age, wage, xlim=agelims, cex=.5, col="darkgrey") 
title("Degree 4 Polynomial", outer=T) # adds title , outer=T titles the whole pane
lines(age.grid, preds$fit, lwd=2, col="blue") # adds a line with x values of the age limits and y values from the predict() function
matlines(age.grid, se.bands, lwd=1, col="blue", lty=3) # adds the standard error bands, approximage 95% CI. matlines() plots the columns of matrices. Here se.bands is the matrix we created earlier

# We We mentioned earlier that whether or not an orthogonal set of basis functions is produced in the poly() function will not affect the model obtained in a meaningful way. What do we mean by this? The fitted values obtained in either case are identical:

preds2 <- predict(fit2, newdata=list(age=age.grid), se=TRUE)
max(abs(preds$fit - preds2$fit)) # Almost no difference

# In performing a polynomial regression, we must decide on the degree of polynomial to use. One way to do this is by using hypothesis tests. We now fit models ranging from linear to a degree-5 polynomial and seek to determine the simplest model which is sufficient to explain the relationship between wage and age.

# We use the anova() function, which performs an analysis of variance (ANOVA, using an F-test) in order to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. In order to use the anova() function, M1 and M2 must be nested models: the predictors in M1 must be a subset of the predictors in M2. In this case, we fit five different models and sequentially compare the simpler model to the more complex model

fit.1 <- lm(wage~age, data=Wage)
fit.2 <- lm(wage~poly(age, 2), data=Wage)
fit.3 <- lm(wage~poly(age, 3), data=Wage)
fit.4 <- lm(wage~poly(age, 4), data=Wage)
fit.5 <- lm(wage~poly(age, 5), data=Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)

# The p-value comparing the linear Model 1 to the quadratic model 2 is essentially zero, indicating that a linear fit is not sufficient. At a p-value <0.05 we REJECT the null that the model is sufficient and accept the alternative that a more complex model is required.

# The p-value for comparing the linear model to the quadratic model is essentially zero, so we can reject the linear model. The p-value comparing the quadratic model to the cubic model is also very low so we discard the quadratic model. The p-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree 5 polynomial Model 5 seems unnecessary because its p-value is 0.37. Hence, either a cubic or quartic polynomial appear to provide a resonable fit to the data, but lower or higher order models are not justified.

# In this case, instead of using the anova() function we could have optained the p-values more succinctly by exploiting the fact that poly() creates orthogonal polynomials.

coef(summary(fit.5))

# Notice that the p-values are the same, and in fact the square of the t-statistics are equal to the f-statistics from the anova() function; for example:

(-11.983)^2 # = 143.59 (t-stat from quadratic coef = F-stat from anova)

# However, the ANOVA method works whether or not we use the orthogonal polynomials; it also works when we have other terms in the model as well. For example, we can use anova() to compare these three models

fit.1 <- lm(wage~education+age, data=Wage)
fit.2 <- lm(wage~education+poly(age,2), data=Wage)
fit.3 <- lm(wage~education+poly(age,3), data=Wage)

anova(fit.1, fit.2, fit.3)

# As an alternative to hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation, as discussed in Chapter 5.

# Next we consider the task of predicting whether an individual earns more than $250,000 per year. We proceed much as before, except that first we create the appropriate response vector, and then apply the glm() function using family="binomial" in order to fit a polynomial logistic regression model

fit <- glm(I(wage>250)~poly(age, 4), data=Wage, family="binomial")

# We use the wrapper I() to create the binary response on the fly rather than having to create dummy variables in the data set. The expression wage>250 evaluates to a logical variable which glm() coerces to binary

# Once again, we make predictions using the predict() function

preds <- predict(fit, newdata=list(age=age.grid), se=T)

# However, calculating the confidence intervals is slightly more involved than in the linear case. The default prediction type for a glm() model is type="link", which is what we use here. This means we get predictions for the logit (log odds): that is, we have fit a model of the form 

# log(Pr(Y=1|X)/1-Pr(Y=1|X)) = XB

# and the predictions given are also of the form XB. The standard errors given are also in this form. In order to obtain confidence intervals for Pr(Y=1|X), we use the transformation

# Pr(Y=1|X) = exp(XB) / 1 + exp(XB)

pfit <- exp(preds$fit)/ (1+exp(preds$fit))

se.bands.logit <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

se.bands <- exp(se.bands.logit) / (1+exp(se.bands.logit))

# Finally, the right-hand plot from Fig 7.1 was made as follows:

plot(age, I(wage>250), xlim = agelims, type="n", ylim=c(0,0.2)) # type=n for no plots, we add these separately. The max Pr(Y=1|X is 0.09 from max(pfit) so we've set the y limit to 0.2)

points(jitter(age), I((wage>250/5)), cex=0.5, pch="|", col="darkgrey") # the indicator variable is divided by 5 to fit into our 0.2 max y limit. 
lines(age.grid, pfit, lwd=2, col="blue")
matlines(age.grid, se.bands, lwd=1, col="blue", lty=3)

# In order to fit a step function, we can use the cut() function

table(cut(age,4))
fit <- lm(wage~cut(age,4), data=Wage)
coef(summary(fit))

# Here cut() automatically picked the cut points of age. We could also have specified our own cutpoints directly using the breaks option. The function cut() returns an ordered categorical variable; the lm() function then creates a set of dummy variables for use in the regression. The age<33.5 category is left out, so the intercept coefficients of $94,158 can be interpreted as the mean wage for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional wage for those in other age groups. We can produce predictions and plots just as we did in the polynomial fit.

pred <- predict(fit, newdata = list(age=age.grid), se=TRUE)

se.bands <- cbind(pred$fit - 2*pred$se.fit, pred$fit - 2*pred$se.fit)

plot(age, wage, cex=.5, col="darkgrey")
lines(age.grid, pred$fit, lwd=3)
matlines(age.grid, se.bands, lwd=1, lty=3, col="blue")

# Splines #

# In order to fit regression splines in R, we use the splines library. In Section 7.4, we saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The bs() function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic splines are produced. Fitting wage to age using a regression spline is simple:

library(splines)
fit <- lm(wage~bs(age, knots=c(25,40,60)), data=Wage)
pred <- predict(fit, newdata = list(age=age.grid), se=T)
plot(age, wage, col="gray")
lines(age.grid, pred$fit, lwd=2)
lines(age.grid, pred$fit+2*pred$se.fit, lty="dashed")
lines(age.grid, pred$fit-2*pred$se.fit, lty="dashed")

# Here we have pre-specified knots at ages 25, 40 and 60. This produces a spline with six basis functions (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions. )

dim(bs(age, knots=c(25,40,60))) # 6 column matrix for the 6 basis functions

dim(bs(age, df=6))
attr(bs(age, df=6),"knots") # extracts the "knots" attribute from the bs() call, = 33.75, 42 and 51

# In this case, R chooses knots at ages 33.8, 42 and 51, which correspond with the 25th, 50th and 75th percentiles of age. The function bs() also has a degree argument, so we can fit splines of any degree, rather than the default degree of 3 (which yields a cubic spline).

# In order to instead fit a natural spline, we use the ns() function. Here we fit a natural spline with four degrees of freedom.

fit2 <- lm(wage~ns(age, df=4), data=Wage)
pred2 <- predict(fit2, newdata = list(age=age.grid), se=T)
lines(age.grid, pred2$fit, col="red", lwd=2)

#As with the bs() function, we could instead specify the knots directly using the knots argument

# In order to fit a smoothing spline, we use the smooth.spline() function. Figure 7.8 was produced with the following code:

plot(age, wage, xlim=agelims, cex=0.5, col="darkgrey")
title ("Smoothing Spline")
fit <- smooth.spline(age, wage, df=16)
fit2 <- smooth.spline(age, wage, cv=TRUE) # cross validation to choose best value for df
fit2$df
lines(fit,col="red", lwd=2)
lines(fit2 ,col="blue",lwd=2)
legend("topright", legend=c("16 DF", "6.8 DF"), col=c("red","blue"), lty=1, lwd = 2, cex=.8)

# Notice that in the first call to smooth.spline() we specified df=16. The function then determines which value of lambda leads to 16 degrees of freedom. In the second call to smooth.spline(), we select smoothness level by cross-validation; this results in a value of lambda that yields 6.8 degrees of freedom. 

# In order to preform local regression, we use the loess() function

plot(age,wage,xlim=agelims, cex=0.5, col="darkgrey")
title("Local Regression")
fit <- loess(wage~age, span=.2, data=Wage)
fit2 <- loess(wage~age, span=.5, data=Wage)

lines(age.grid, predict(fit, newdata = data.frame(age=age.grid)), col="red", lwd=2) 

lines(age.grid, predict(fit2, newdata = data.frame(age=age.grid)), col="blue", lwd=2)

legend("topright", legend=c("Span=0.2","Span=0.5"), col=c("red","blue"), lty=1, lwd=2, cex=.8)

# Here we have performed local linear regression using spans of 0.2 and 0.5: that is, each neighbourhood consists of 20% or 50% of the observations. The larger the span, the smoother the fit. The locfit library can also be used to fit local regression models in R #

## GAMs ##

# We now fit a generalised additive model to predict wage using natural spline functions of year and age, treating education as a qualitative predictor, as in (7.16). Since this is just a big linear regression model using an appropriate choice of basis functions, we can simply do this by using the lm() function

gam1 <- lm(wage~ns(year, df=4)+ns(age, df=5)+education, data=Wage)

# We now fit the model (7.16) using smoothing splines rather than natural splines. In order to fit more general sorts of GAMs using smoothing splines or other components that cannot be expressed in terms of basis functions and then fit using least squares regression, we will need to use the 'gam' library. 

# The s() function is used to indicate that we would like to use a smoothing spline. We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is qualitative, we leave it as it is, and it is converted into four dummy variables. We use the gam() function in order to fit a GAM using these components. All of the terms in (7.16) are fit simulataenously, taking each other into account to explain the response.

library(gam)
gam.m3 <- gam(wage~s(year, 4)+s(age,5)+education, data=Wage)
par(mfrow=c(1,3))
plot(gam.m3, se=TRUE, col="blue")

# The generic plot() function recognises that gam.m3 is an object of class gam, and invokes the appropriate plot.gam() method. Conveniently, even though gam1 is not of class gam but rather of class lm, we can still use plot.gam() on it. Figure 7.11 was produced using the following expression:

plot.gam(gam1, se=TRUE, col="red")

# Notice here that we had to use plot.gam() rather than the generic plot() function

# In these plots, the function of year looks rather linear. We can perform a series of ANOVA tests in order to determine which of these three models is best: a GAM that excludes year (M1), a GAM that uses a linear function of year (M2), or a GAM that uses a spline function of year (M3). 

gam.m1 <- gam(wage~s(age, 5)+education, data=Wage)
gam.m2 <- gam(wage~year+s(age, 5)+education, data=Wage)

anova(gam.m1, gam.m2, gam.m3, test="F")

# We fund that there is compelling evidence that a GAM with a linear function of year is better than a GAM that does not include year at all (p-value=0.00014). However, there is no evidnce that a non-linear function of year is needed (p-value=0.35). In other words, based on the results of his ANOVA, M2 is preferred.

# The summary() function prodices a summary of the GAM fit

summary(gam.m3)

# The p-values for year and age correspond to the null hypothesis of a linear relationship versus the alternative of a non-linear relationship. The large p-value for year reinforces our conclusion from the ANOVA test that a linear function is adequate for this term. However, there is very clear evidence that a non-linear term is requred for age.

# We can make predictions from gam objects, just like from lm objects, using the predict() method for the class gam. Here we make predictions on the training set

preds <- predict(gam.m2, newdata=Wage)

# We can also use local regression fits as building blocks in a GAM, using the lo() function

gam.lo <- gam(wage~s(year, df=4)+lo(age, span=0.7)+education, data=Wage)

plot.gam(gam.lo, se=TRUE, col="green")

# Here we have used local regression for the age term, with a span of 0.7. We can also use the lo() function to create interactions before calling the gam() function. For example:

gam.lo.i <- gam(wage~lo(year, age, span=0.5)+education, data=Wage)

# fits a two-term model, in which the first term is an interaction between year and age, fit by a local regression surface. We can plot the resulting two-dimensional surface if we first install the akima package.

library(akima)
plot(gam.lo.i)

# In order to fit a logistic regression GAM, we once again use the I() function in constructing the binary response variable, and set family="binomial"

gam.lr <- gam(I(wage>250)~year+s(age, df=5)+education, family="binomial", data=Wage)
par(mfrow=c(1,3))
plot(gam.lr, se=TRUE, col="green")

# It is easy to see from the huge SE that there are no high earners in the <HS category of education

table(education, I(wage>250))

# Hence, we fit a logistic regression GAM using all but this category. This provides more sensible results

gam.lr.s <- gam(I(wage>250)~year+s(age, df=5)+education, family="binomial", data=Wage, subset=(education!="1. < HS Grad"))
plot(gam.lr.s, se=T, col="green")

#############
# Exercises #
#############

# 3)

X <- seq(-2, 2, 0.1)
Y <- 1 + X + (-2*(X-1)^2)*I(X>=1)

par(mfrow=c(1,1))
plot(X,Y, type="b")
abline(v=0, lty="dashed")
abline(h=0, lty="dashed")

# y intercept = 1
# x intercept = -1
# line slopes are y=x and y=-x

#4) 

Y <- 1+I(X==0|X==1|X==2)-(X-1)*I(X==1|X==2)

plot(X,Y, type="b", ylim=c(-2,2))
abline(v=0, lty="dashed")
abline(h=0, lty="dashed")

#####
# Applied
#####

#6a)

library(boot)
set.seed(1)
deltas <- rep(NA,10)

for (i in 1:10) {
  fit <- glm(wage~poly(age, i), data=Wage) 
  deltas[i] <- cv.glm(Wage, fit, K=10)$delta[1]
}

which.min(deltas) # = 4th degree

plot(deltas, type="b")

fit.poly1 <- lm(wage~age, data=Wage)
fit.poly2 <- lm(wage~poly(age, 2), data=Wage)
fit.poly3 <- lm(wage~poly(age, 3), data=Wage)
fit.poly4 <- lm(wage~poly(age, 4), data=Wage)
fit.poly5 <- lm(wage~poly(age, 5), data=Wage)

anova(fit.poly1, fit.poly2, fit.poly3, fit.poly4, fit.poly5)

# ANOVA also picks the 4 degree polynomial #

agerange <- range(age)
agelims <- seq(agerange[1],agerange[2])

plot(age, wage, col="darkgrey", cex=.5)

pred <- predict(fit.poly4, newdata = list(age=agelims), se=T)

lines(agelims, pred$fit, lwd=2, col="blue")

se.bands <- cbind(pred$fit - 2*pred$se.fit, pred$fit + 2*pred$se.fit)

matlines(agelims, se.bands, lty="dashed", lwd=1)
title("4th Degree Polynomial")

## Find optimal number of cuts for step function using CV ## 

cv.err <- rep(NA,9)

for (i in 2:10) {
  Wage$age.cut <- cut(Wage$age, i)
  glm.fit <- glm(wage~age.cut, data=Wage)
  cv.err[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}

plot(cv.err, type="b")
which.min(cv.err) # cutting into 8 shows the smallest CV error

glm.fit <- glm(wage~cut(age, 8), data=Wage)
pred <- predict(glm.fit, newdata = list(age=agelims), se=T)

se.bands <- cbind(pred$fit - 2*pred$se.fit, pred$fit + 2*pred$se.fit)

plot(age,wage, cex=.5, col="darkgrey")
lines(agelims, pred$fit, lwd=3, col="blue")
matlines(agelims, se.bands, lty="dashed", col="red", lwd=1)

#7)

## Is a linear model sufficient? ANOVA hypothesis test or cross-valitation ##

## Exploring the relationships between maritl and wage, jobclass and wage ##

plot(maritl,wage) # Married seems to have the highest wage overall but this could be carried by age

plot(jobclass,wage) # People in Information job class tend to have higher wages than age

## Lets see if adding maritl or jobclass to the models adds anything ##

library(gam)

gam <- gam(wage~ns(age, 5), data=Wage)
gam2.1 <- gam(wage~ns(age, 5)+maritl, data=Wage)
gam2.2 <- gam(wage~ns(age, 5)+jobclass, data=Wage)
gam3 <- gam(wage~ns(age, 5)+jobclass+maritl, data=Wage)

## The models need to stack in an ANOVA test, so we compare a to a+b to a+b+c not a to a+b to a+c to a+b+c

anova(gam,gam2.1,gam3)

## In an anova test, the null is that the current model M1 is sufficient. Alternative is that M2 is better. Here all tiny p-values say that adding maritl is an improvement, as is adding jobclass.

anova(gam, gam2.2, gam3)

# Same results

## This shows us that both maritl and jobclass are significant predictors of wage even with age already included in the model

par(mfrow=c(1,3))
plot.gam(gam3, se=T, col="blue")

gam3$coefficients

#8)

data(Auto)

pairs(Auto)
attach(Auto)

# horsepower and mpg seem to have a non-linear relationship

# Fit linear model for demonstration purposes

plot(horsepower, mpg, cex=.5, col="darkgrey", data=Auto)

fit.lm <- lm(mpg~horsepower, data=Auto)
pred.lm <- predict(fit.lm, newdata = list(horsepower=horsepower.grid), se=T)
lines(horsepower.grid, pred.lm$fit, lwd=2, col="red")
se.bands <- cbind(pred.lm$fit - 2*pred.lm$se.fit, pred.lm$fit + 2*pred.lm$se.fit)
matlines(horsepower.grid, se.bands, lty="dashed", col="red")

MSE.lm <- mean((fit.lm$residuals)^2)
MSE.lm

horsepowerlims <- range(horsepower)
horsepower.grid <- seq(range(horsepower)[1], range(horsepower)[2])

# work out optimal number of degrees of freedom
cv.err.ns <- rep(NA, 20)

for (i in 1:20) {
fit <- glm(mpg~ns(horsepower, i), data=Auto)
cv.err.ns[i] <- cv.glm(Auto, glmfit = fit, K=10)$delta[1]
}

plot(cv.err.ns, type="b")
best.df <- which.min(cv.err.ns)
best.df # = 16
points(best.df, cv.err.ns[best.df], pch=16, col="red")

# Smallest CV error is at df=16

# Plotting the linear approximation vs the natural spline with df=16



fit.glm <- glm(mpg~ns(horsepower, 16), data=Auto)
pred.glm <- predict(fit.glm, newdata = list(horsepower=horsepower.grid), se=T)
lines(horsepower.grid, pred.glm$fit, lwd=2, col="blue")
se.bands.glm <- cbind(pred.glm$fit - 2*pred.glm$se.fit, pred.glm$fit + 2*pred.glm$se.fit)
matlines(horsepower.grid, se.bands.glm, lty="dashed", col="blue")

MSE.ns <- mean((fit.glm$residuals)^2)
MSE.ns

legend("topright", legend=c("Linear Model","Natural Spline","glm fit"), col=c("red","blue","brown"), cex=0.8, lty=1)

# Step function approximation for horsepower

# Cross-validation to find the optimal number of cuts

cv.err.cuts <- rep(NA,20)

for (i in 2:20) {
Auto$horsepower.cuts <- cut(Auto$horsepower, i)
fit.step <- glm(mpg~horsepower.cuts, data = Auto)
cv.err.cuts[i] <- cv.glm(Auto, fit.step, K=10)$delta[1]
}

plot(cv.err.cuts, type="b")
best.cv.err.cuts <- which.min(cv.err.cuts) 
best.cv.err.cuts
cv.err.cuts[15] # 15 cuts shows the smallest CV error

glm.step.fit <- glm(mpg~cut(horsepower,15), data=Auto)
glm.step.pred <- predict(glm.step.fit, newdata = list(horsepower = horsepower.grid), se=T)

MSE.step <- mean((glm.step.fit$residuals)^2)
MSE.step

lines(horsepower.grid, glm.step.pred$fit, lwd=2, col="green")

## Smoothing spline ##

# CV for optimal df

cv.err.ns <- rep(NA,20)

data(Auto)

for (i in 1:20) {
fit <- glm(mpg~ns(horsepower, df=i), data=Auto)
cv.err.ns[i] <- cv.glm(Auto, glmfit = fit, K=10)$delta[1]
}

plot(cv.err.ns, type="b", main="Natural Spline CV degrees of freedom", xlab="Degrees of Freedom", ylab="CV Error")

##################

fit.ss <- smooth.spline(horsepower, mpg, cv=TRUE) # x value first in this call
fit.ss$df # = 5.78

lines(fit.ss, col="red", lwd=2)

gam.ss <- gam(mpg~s(horsepower, df=6.06), data=Auto)

MSE.ss <- mean((gam.ss$residuals)^2)
MSE.ss

gam.lo <- gam(mpg~lo(horsepower, span = 0.75), data=Auto)

lines(lowess(mpg~horsepower, f=.75), col="blue", lwd=2)

MSE.lo <- mean((gam.lo$residuals)^2)
MSE.lo


MSE.lm <- mean((fit.lm$residuals)^2)
MSE.lm

errors <- c(MSE.ns, MSE.ss, MSE.lm, MSE.lo)
names(errors) <- c("Natural Spline","Smoothing Spline","Linear Model","Local Regression")
barplot(errors, ylim=c(0,25))

# ANOVA to compare the models

anova(fit.lm, fit.glm) # natural spline is large improvement over linear model

fit1 <- gam(mpg~ns(horsepower, 16), data=Auto)
fit2.1 <- gam(mpg~ns(horsepower, 16)+cylinders, data=Auto)
fit2.2 <- gam(mpg~ns(horsepower, 16)+weight, data=Auto)
fit3 <- gam(mpg~ns(horsepower, 16)+cylinders+weight, data=Auto)

anova(fit1, fit2.1, fit3)
# cylinders and weight in there is an improvement in the model after horsepower is included

anova(fit1, fit2.2, fit3)

# confirms tha both cylinders and weight add to the model

par(mfrow=c(1,3))
plot.gam(fit3, se=T, col="blue")

Auto$cylinders <- factor(Auto$cylinders)

#9)

library(MASS)
data(Boston)
attach(Boston)

# Fit cubic polynomial, report regression output, plot fit

fit.poly <- glm(nox~poly(dis,3), data=Boston)
summary(fit.poly)

par(mfrow=c(1,1))

plot(dis, nox)

dislims <- range(round(dis))
dis.grid <- seq(dislims[1], dislims[2], 0.1)

pred <- predict(fit.poly, newdata = list(dis=dis.grid), se=T)
lines(dis.grid, pred$fit, lwd=2, col="red")
se.bands <- cbind(pred$fit - 2*pred$se.fit, pred$fit + 2*pred$se.fit)
matlines(dis.grid, se.bands, lty=3, lwd=1)


# Plot polynomial fits 1-10 and report RSS

RSS.poly <- rep(NA, 10)

for (i in 1:10) {
  poly.mod <- glm(nox~poly(dis,i), data=Boston)
  RSS.poly[i] <- sum((poly.mod$residuals)^2)
}

plot(RSS.poly, type="b")


# Cross validation to select the optimal degree for the polynomial

cv.errors.poly <- rep(NA,10)

for (i in 1:10) {
  poly.mod <- glm(nox~poly(dis,i), data=Boston)
  cv.errors.poly[i] <- cv.glm(Boston, poly.mod, K=10)$delta[1]
}

plot(cv.errors.poly, type="b")
best.degree <- which.min(cv.errors.poly)

best.degree

# Degree showing the smallest cross validated prediction error is the third. Crazy stuff happens with big polynomials!

# Use bs() function to fit regression spline. Report output using df=4. How did you choose the knots? Plot the resulting fit


dim(bs(dis,df=4))
attr(bs(dis, df=4),"knots")

# knots on the boundaries and one at 3.201

fit.bs <- glm(nox~bs(dis, df=4), data=Boston)
pred <- predict(fit.bs, newdata = list(dis=dis.grid), se=T)
se.bands <- cbind(pred$fit - 2*pred$se.fit, pred$fit + 2*pred$se.fit)
matlines(dis.grid, se.bands, lty=3, col="blue")

plot(dis,nox)
lines(dis.grid, pred, lwd=2, col="blue")

# knots chosen by bs() function

## Fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

spline.RSS <- rep(NA,20)

for (i in 1:20) {
  fit.spline <- glm(nox~bs(dis, df=i), data=Boston)
  spline.RSS[i] <- sum((fit.spline$residuals)^2)
}

spline.RSS

plot(spline.RSS, type="b")

# As expected, as the spline becomes more flexible the training RSS decreases as the spline starts to over-fit the data

# Perform CV to select best DF

cv.err.df <- rep(NA, 20)

for (i in 1:20) {
  fit <- glm(nox~bs(dis, df=i), data=Boston)
  cv.err.df[i] <- cv.glm(Boston, fit, K=10)$delta[1]
  
}
  which.min(cv.err.df)
  
  # 5 df has the smallest CV prediction error
  
#10a)
  
  data(College)
  attach(College)
  
dim(College)

set.seed(1)
trainid <- sample(1:777, 777/2, replace=F)

train <- College[trainid,]
test <- College[-trainid,]

library(leaps)

modelmat <- model.matrix(Outstate~., data=train)[,-1]
y.train <- train$Outstate


regfit.full <- regsubsets(modelmat, data=train, y=y.train, nvmax=ncol(train), method = "forward")
regfit.summary <- summary(regfit.full)

par(mfrow=c(3,1))
plot(regfit.summary$bic, type="b")
min.bic <- which.min(regfit.summary$bic)
points(min.bic, regfit.summary$bic[min.bic], pch=16, col="red")

# Six variable model shows the lowest BIC statistic

plot(regfit.summary$cp, type="b")
min.cp <- which.min(regfit.summary$cp)
points(min.cp, regfit.summary$cp[min.cp], pch=16, col="red")

# 13 variable model has the smallest Cp score but past 6 they are all pretty similar. Still looking at 6.

plot(regfit.summary$adjr2, type="b")
max.adjr2 <- which.max(regfit.summary$adjr2)
points(max.adjr2, regfit.summary$adjr2[max.adjr2], pch=16, col="red")

# 13 variable model has the largest Adjusted R2 score but above 6 they are all very similar so we're going with 6.

# The 6-varible model has the following predictors:

coef(regfit.full, 6)

# PrivateYes, Room.Board, Terminal, perc.alumni, Expend and Grad.Rate

mod.fwd <- glm(Outstate~Private+Room.Board+Terminal+perc.alumni+Expend+Grad.Rate, data=train)

mod.gam <- gam(Outstate~Private+s(Room.Board,3)+s(Terminal,3)+s(perc.alumni,3)+s(Expend,3)+s(Grad.Rate,3), data=train)

par(mfrow=c(3,2))

plot.gam(mod, se=T, col="blue")

# A university being private or a university having more of any of the predictors identified increases their predicted out of state tuition fee.

# Using test set

test.y <- test$Outstate

pred.fwd <- predict(mod.fwd, newdata = test)
fwd.testMSE <- mean((pred.fwd-test.y)^2)
fwd.testMSE # = $4,357,411

pred.gam <- predict(mod.gam, newdata=test)
gam.testMSE <- mean((pred.gam-test.y)^2)
gam.testMSE # = $3,841,676

pct.err.fwd <- fwd.testMSE/sum(College$Outstate)
pct.err.fwd # = 53.7% error

pct.err.gam <- gam.testMSE/sum(College$Outstate)
pct.err.gam # = 47.4% error

summary(mod.gam)

# Strong evidence of non-linear effects for Expend, some evidence for Grad.Rate and Room.Board, no evidence for Terminal or perc.alumni

#11a)

set.seed(1)
X1 <- rnorm(100)
X2 <- rnorm(100)
beta_0 <- -3.8
beta_1 <- 0.3
beta_2 <- 4.1
eps <- rnorm(100, sd = 1)
Y <- beta_0 + beta_1*X1 + beta_2*X2 + eps
par(mfrow=c(1,1))
plot(Y)

bhat_1 <- 1

a <- Y - bhat_1*X1
(bhat_2 <- lm(a~X2)$coef[2])

#c) 



bhat_0 <- bhat_1 <- bhat_2 <- rep(0, 1000)
for (i in 1:1000) {
  a <- Y - bhat_1[i] * X1
  bhat_2[i] <- lm(a ~ X2)$coef[2]
  a <- Y - bhat_2[i] * X2
  bhat_0[i] <- lm(a ~ X1)$coef[1]
  # bhat_1 will end up with 1001 terms
  bhat_1[i+1] <- lm(a ~ X1)$coef[2]
}

```


