In [None]:
---
title: "Chapter 8 - Tree Based Models"
author: "Dan"
date: "19 February 2018"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
# The tree library is used to construct classification and regression trees

library(tree)

# We first use classification trees to analyse the Carseats data. In these data, Sales is a continuous variable, and so we begin by recording it as a binary variable. We use ifelse() function to create a variable called High, which takes on the value Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

library(ISLR)
attach(Carseats)

High <- ifelse(Sales <= 8, "No", "Yes")

# Finally, we merge High with the rest of the Carseats data

Carseats <- data.frame(Carseats, High)

# We now used the tree() function to fit a classification tree in order to predict High using all of the variables except Sales. The syntax of tree() is quite similar to that of lm()

tree.carseats <- tree(High~.-Sales, data=Carseats)

# The summary() function lists the variables that are used in the internal nodes in the tree, the number of terminal nodes, and the (training) error rate. 

summary(tree.carseats)

# We see that the training error rate is 9%. For classification trees, the deviance reported in the output os summary() is given on page 339. A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by n-|T0|, which in this case is 400-27=373. 

# One of the most attractive properties of trees is that they can be graphically displayed. We use the plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty=0 instructs R to include the category names for any qualitative predictors, rather than simply displaying a letter for each category.

plot(tree.carseats)
text(tree.carseats, pretty=0)

# The most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations.

# If we just type the name of the tree object, R prints output corresponding to each branch of the tree. R displays the split criterion (e.g. Price<92.5), the number of observations in that branch, the deviance, the overall prediction for the branch (Yes or No), and the fraction of observations in that branch that take on values of Yes and No. Branches that lead to terminal nodes are indicated using asterisks

tree.carseats

# In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data. The predict() function can be used for this purpose. In the case of a classification tree, the argument type="class" instructs R to return the actual class prediction. This approach leads to correct predictions for around 71.5 % of the locations in the test data set.

set.seed(2)
train <- sample(1:nrow(Carseats), 200)

Carseats.test <- Carseats[-train,]
High.test <- High[-train]

tree.carseats <- tree(High~.-Sales, data=Carseats, subset=train)

tree.pred <- predict(tree.carseats, newdata = Carseats.test, type="class")

table(tree.pred, High.test)

(86+57)/200 # = 71.5% correct predictions

# Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration.  We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to alpha in (8.4)).

set.seed(3)
cv.carseats <- cv.tree(tree.carseats, FUN=prune.misclass)
names(cv.carseats)
cv.carseats

# Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with 9 terminal nodes results in the lowest cross-validation error rate, with 50 cross-validation errors. We plot the error rate as a function of both size and k.

par(mfrow=c(1,2))

plot(cv.carseats$size, cv.carseats$dev, type="b")
plot(cv.carseats$k, cv.carseats$dev, type="b")

# We now apply the prune.misclass() function in order to prune the tree to obtain the nine-node tree

par(mfrow=c(1,1))

pruned.carseats <- prune.misclass(tree.carseats, best=9)
plot(pruned.carseats)
text(pruned.carseats, pretty=0)

# How well does this pruned tree perform on the test data? Once again we apply the predict() function

tree.pred <- predict(pruned.carseats, newdata = Carseats.test, type="class")
table(tree.pred, High.test)

(94+60)/200 # = 77% correctly classified. Not only has the pruning process produced a more interpretable tree, it has also improved the classification accuracy

# If we increase the value of best, we obtain a larger pruned tree with lower classification accuracy.

##########
# Fitting Regression Trees
##########

# Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data.

library(MASS)
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston <- tree(medv~., data=Boston, subset=train)
summary(tree.boston)

# Notice that the output of summary() indicates that only three of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree.

plot(tree.boston)
text(tree.boston, pretty=0)

# The variable lstat measures the percentage of individuals with lower socioeconomic status. The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $46, 400 for larger homes in suburbs in which residents have high socioeconomic status (rm>=7.437 and lstat<9.715)

# Now we use the cv.tree() function to see whether pruning the tree can improve performance

cv.boston <- cv.tree(tree.boston) # Don't have to specify FUN=prune.misclass here because its the deviance that we want in regression problems

cv.boston
plot(cv.boston$size, cv.boston$dev, type="b")

# Here the most complex tree is chosen by cross validation. However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function

pruned.boston <- prune.tree(tree.boston, best=5)
plot(pruned.boston)
text(pruned.boston, pretty=0)

# In keeping with the cross validation results, we use the unpruned tree to make predictions on the test set.

yhat <- predict(tree.boston, newdata = Boston[-train,])
boston.test <- Boston[-train,"medv"]
plot(yhat, boston.test)
abline(0,1)
mean((yhat-boston.test)^2) # = 25.05

# In other words, the test MSE associated with the regression tree is 25.05. The square root of the MSE is therefore around 5.005 indicating that this model leads to test predictions that are within around $5005 of the true median home value for the suburb.

## Bagging and Random Forests ##

# Here we apply bagging and random forests to the Boston data, using the randomForest package. Recall that bagging is simply a special case of a random forest with m=p. Therefore, randomForest() function can be used to perform both random forests and bagging.

## Bagging ##

library(randomForest)
set.seed(1)
bag.boston <- randomForest(medv~., data=Boston, subset=train, mtry=13, importance=TRUE)

bag.boston

# The argument mtry=13 indicates that all 13 predictors should be considered for each split of the tree - in other words, that bagging should be done. How well does this bagged model perform on the test set?

yhat.bag <- predict(bag.boston, newdata = Boston[-train,])
plot(yhat.bag, boston.test)
abline(0,1)
mean((yhat.bag-boston.test)^2) # MSE = 13.47
sqrt(13.47) # = 3.67 so the bagged prediction is out by an average of $3,670

# The test MSE associated with the bagged regression tree is almost half that obtained using an optimally-pruned single tree. We could change the number of trees grown by randomForest() using the ntree argument:

bag.boston <- randomForest(medv~.,data=Boston, subset=train, mtry=13, ntree=25)

yhat.bag <- predict(bag.boston, newdata = Boston[-train,])
mean((yhat.bag-boston.test)^2) # = 13.43

# Growing a random forest proceeds in exactly the same way, except that we use a smaller value for the mtry argument. By default, randomForest uses p/3 variables when building a forest of regression trees, and sqrt(p) variables when building a random forest of classification trees. Here we use mtry=6

set.seed(1)
rf.boston <- randomForest(medv~.,data=Boston, subset=train, mtry=6, importance=TRUE)

yhat.rf <- predict(rf.boston, newdata = Boston[-train,])
mean((yhat.rf-boston.test)^2) # = 11.48 which is an improvement on the bagging case

# Using the importance() function, we can view the importance of each variable

importance(rf.boston)

# Two measures of variable importance are reported. The former is based upon the mean decrease of accuracy of predictions on the out of bag samples when a given variable is left out of the model. The latter is a measure of the tota decrease in node impurity that results from splits over that variable, arranged over all trees (this was plotted in Fig 8.9). In the case of regression trees, the node impurity is measured by the training RSS, and for classification trees by the deviance. Plots of these importance measures can be produced using the varImpPlot() function.

varImpPlot(rf.boston)

# The results indicate that across all of the trees considered in the random forest, the wealth level of the community (lstat) and the house size (rm) are by far the two most important variables.

## Boosting ##

# Here we use the gbm package, and within it the gbm() function , to fit boosted regression trees to the Boston data set. We run gbm() with the option distribution="gaussian" since this is a regression problem; if it were a binary classification problem, we would use distribution="bernoulli". The argument n.trees=5000 indicates that we want 5000 trees, and the option interaction.depth=4 limits the depth of each tree

library(gbm)
set.seed(1)
boost.boston <- gbm(medv~., data=Boston[train,], distribution="gaussian", n.trees=5000, interaction.depth = 4)

# The summary() function produces a relative influence plot and also outputs the relative influence statistics.

summary(boost.boston)

# We see that lstat and rm are by far the most important variables. We can also produce partial dependence plots for these two variables. These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables. In this case, as we might expect, median house prices are increaseing with rm and decreasing with lstat

par(mfrow=c(1,2))
plot(boost.boston, i="rm")
plot(boost.boston, i="lstat")

# We now use the boosted model to precict medv on the test set:

yhat.boost <- predict(boost.boston, newdata=Boston[-train,], n.trees = 5000)
mean((yhat.boost-boston.test)^2) # = 11.84 which is similar to the test MSE for random forests and superior to that for bagging. If we want to, we can perform boosting with a different value of the shrinkage parameter lambda in (8.10). The default value is 0.001, but this is easily modified. Here we take lambda=0.2

boost.boston <- gbm(medv~., data=Boston[train,], distribution = "gaussian", n.tree=5000, interaction.depth = 4, shrinkage = 0.2, verbose = F ) # verbose shows stuff in the console

yhat.boost <- predict(boost.boston, newdata = Boston[-train,], n.trees = 5000)

mean((yhat.boost-boston.test)^2) # = 11.43

# In this case, using lambda=0.2 leads to a slightly improved MSE than lambda=0.001

#######
# Exercises
#######

#3)

p <- seq(0,1,0.01)

par(mfrow=c(1,1))

plot(NA, NA, type="n", xlim=c(0,1), ylim=c(0,1), xlab="p", ylab="Value")

gini <- p*(1-p) * 2 # 
entropy <- -1*(p*log(p) + (1-p)*log(1-p))
class.error <- 1-pmax(p,1-p)

matplot(p, cbind(gini, entropy, class.error), col=c("red","green","blue"))



legend("topright", legend=c("Gini","Entropy","Classification"), col=c("red","blue","green"), lty=1)

# 7) 
mtry13.errors <- mtry8.errors <- mtry5.errors <- mtry3.errors <- mtry2.errors <- rep(NA,500)


for (i in 1:500) {
boston.rf <- randomForest(medv~., data=Boston[train,], mtry=2, ntree=i)
boston.rf.pred <- predict(boston.rf, newdata=Boston[-train,])
mtry2.errors[i] <- mean((boston.rf.pred-boston.test)^2)
}

matplot(cbind(mtry13.errors, mtry8.errors, mtry5.errors, mtry3.errors, mtry2.errors), type="l", col=c("green","blue","brown","red","black"), lwd=3)

legend("topright", legend=c("mtry=13","mtry=8","mtry=5","mtry=3","mtry=2"), col=c("green","blue","brown","red","black"), lty=1, cex=1)

##################
# The way they did it uses MSE from the randomforest function, inputting all of x, y, xtest and ytest

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]
rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = ncol(Boston) - 1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = (ncol(Boston) - 1) / 2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = sqrt(ncol(Boston) - 1), ntree = 500)
plot(1:500, rf.boston1$test$mse, col = "green", type = "l", xlab = "Number of Trees", ylab = "Test MSE", ylim = c(10, 19))
lines(1:500, rf.boston2$test$mse, col = "red", type = "l")
lines(1:500, rf.boston3$test$mse, col = "blue", type = "l")
legend("topright", c("m = p", "m = p/2", "m = sqrt(p)"), col = c("green", "red", "blue"), cex = 1, lty = 1)

#8a)

data("Carseats")
set.seed(1)

trainid <- sample(1:nrow(Carseats), nrow(Carseats)/2)

carseats.train <- Carseats[trainid,]
carseats.test <- Carseats[-trainid,]

tree.carseats <- tree(Sales~., data=carseats.train)
pred.tree <- predict(tree.carseats, newdata = carseats.test)
mean((pred.tree-carseats.test$Sales)^2)
# = 4.149

cv.carseats <- cv.tree(tree.carseats)
which.min(cv.carseats$dev) # size of 8 terminal nodes has smallest CV error

plot(cv.carseats, type="b")

prune.carseats <- prune.tree(tree.carseats, best = 8)
prune.preds <- predict(prune.carseats, newdata = carseats.test)
mean((prune.preds-carseats.test$Sales)^2) # MSE increased to 5.09

carseats.train <- Carseats[trainid,]
carseats.test <- Carseats[-trainid,]

y.train <- Carseats$Sales[trainid]
y.test <- Carseats$Sales[-trainid]


bag.carseats <- randomForest(Sales~., data=carseats.train,  mtry=ncol(carseats.train)-1, ntree=500, importance=T)

yhat.bag <- predict(bag.carseats, newdata = carseats.test)

mean((yhat.bag-y.test)^2) # MSE = 2.64

importance(bag.carseats)

# Price and ShelveLoc are the two most important variables

rf.carseats <- randomForest(Sales~., data=carseats.train, mtry=3)

pred.rf <- predict(rf.carseats, newdata = carseats.test)

mean((pred.rf-y.test)^2) # = 3.28
importance(rf.carseats) # Still price and shelveloc. Decreasing m has much lowered the error rate!

#9a)

data(OJ)
attach(OJ)

set.seed(1)
trainid <- sample(1:nrow(OJ), 800, replace=FALSE)

train <- OJ[trainid,]
test <- OJ[-trainid,]

#b)

oj.tree <- tree(Purchase~., data=train)
summary(oj.tree)

# Tree uses 4 variables, training error rate = 16.5%
# Tree has 8 terminal nodes

#c)
oj.tree

plot(oj.tree)
text(oj.tree, pretty=0)

# interpreting 11) The split criterion is PriceDiff > 0.195, there are 101 observations in that node with deviance of 139.2 and the prediction for 45.5% of those observations that are in the terminal node is CH

# LoyalCH is the most important criterion. The customers with brand loyalty to Cyprus Hill >= 0.51 all those Cyprus Hill. Among the less loyal customers, all of those with loyalty < 0.26 chose Minute Maid. Customers with CH loyalty between 0.26 and 0.5 CH unless the price difference was < 0.195 and there was no special offer on CH.

 # Customers with very low loyalty to CH or customers with low brand loyalty and very small price difference and no special offer chose to buy MM

#e)

test.y <- test$Purchase

pred.tree <- predict(oj.tree, newdata = test, type = "class")

table(pred.tree, test.y)

dim(test)

(12+49)/270 # = test error rate of 22.6%

#f)
oj.cv <- cv.tree(oj.tree, FUN=prune.misclass)

# Here dev means the cross validation error, the value is the same for all tree sizes down to two so we'll use that for ease of interpretation. Plotting shows up something different though, always good practice to plot

#g)
plot(oj.cv$size, oj.cv$dev, type="b")

#h) Optimal tree size = 5

#i)
oj.prune <- prune.tree(oj.tree, best=2)

plot(oj.prune)
text(oj.prune, pretty=0)

summary(oj.tree) # = 16.5% training error rate
summary(oj.prune)# = 18.25% training error rate

prune.pred <- predict(oj.prune, test, type="class")
table(prune.pred, test$Purchase)

1-(119+81)/270 # = 26% test error rate vs 22.6% test error rate, so the pruning decreased the model performance.

#10a)
library(ISLR)
data(Hitters)
attach(Hitters)

summary(Hitters)

hitters2 <- na.omit(Hitters)
hitters2$Salary <- log(hitters2$Salary) # log transformation solves histogram with lots of right skew, can bring it towards normality

#b)
dim(hitters2) 

trainid <- 1:200

train <- hitters2[trainid,]
test <- hitters2[-trainid,]

#c)
library(gbm)

set.seed(1)
powers <- seq(-10,-0.2,by=0.1)
lambdas <- 10^powers

train.err <- rep(NA,99)
y.train <- hitters2$Salary[trainid]

for(i in 1:99) {
  
  boost.hitters <- gbm(Salary~., data=train, n.trees = 1000, shrinkage = lambdas[i], distribution = "gaussian")
  boost.pred <- predict(boost.hitters, newdata = train, n.trees = 1000)
  train.err[i] <- mean((boost.pred-y.train)^2)
  
}

plot(lambdas, train.err, type="b")


#d)

powers <- seq(-3,-1,by=0.1)
lambdas <- 10^powers

test.err <- rep(NA,length(lambdas))
y.test <- hitters2$Salary[-trainid]

for(i in 1:length(lambdas)) {
  
  boost.hitters <- gbm(Salary~., data=train, n.trees = 1000, shrinkage = lambdas[i], distribution = "gaussian")
  boost.pred <- predict(boost.hitters, newdata = test, n.trees = 1000)
  test.err[i] <- mean((boost.pred-y.test)^2)
  
}

plot(lambdas, test.err, type="b")

which.min(test.err)
lambdas[15] # 0.025 is best value for lambda

#e)

boost.hitters <- gbm(Salary~., data=train, n.trees = 1000, shrinkage = 0.025, distribution = "gaussian")
boost.pred <- predict(boost.hitters, newdata = test, n.trees = 1000)
mse.boost <- mean((boost.pred-y.test)^2) # MSE = 0.264
mse.boost

## Comparing the MSE for boosting vs MSE for linear regression lasso and KNN models ##

## Linear regression

modelmat.train <- as.data.frame(model.matrix(Salary~., data=train))
modelmat.test <- as.data.frame(model.matrix(Salary~., data=test))

train <- hitters2[trainid,]
test <- hitters2[-trainid,]
test.y <- hitters2$Salary[-trainid]

hitters.lm <- lm(Salary~., data=train)
hitters.pred <- predict(hitters.lm, newdata=test)
mse.lm <- mean((hitters.pred-test.y)^2) # MSE = 0.49

## Ridge Regression ##

modelmat.train <- model.matrix(Salary~., data=train)[,-1]

library(glmnet)

## docs say to use a grid for possible lambda values ##

grid <- 10^seq(10, -2, length=100)

ridge.mod <- glmnet(modelmat.train, train$Salary, alpha=0, lambda=grid) # alpha=0 for ridge, 1 for lasso

## Choosing optimal lambda ##

ridge.cv.out <- cv.glmnet(modelmat.train, train$Salary, alpha=0)
plot(ridge.cv.out)
bestlam <- ridge.cv.out$lambda.min
bestlam # = 0.0634

modelmat.test <- model.matrix(Salary~., data=test)[,-1]

ridge.mod <- glmnet(modelmat.train, train$Salary, alpha=0)
ridge.pred <- predict(ridge.mod, newx = modelmat.test, s=bestlam) #s= shrinkage, the best model lambda we found earlier
mse.ridge <- mean((ridge.pred-test.y)^2)
mse.ridge # = 0.4568

## Lasso ##

# Finding the best value for lambda ##

lasso.cv.out <- cv.glmnet(modelmat.train, train$Salary, alpha=1)
plot(lasso.cv.out)
bestlam.lasso <- lasso.cv.out$lambda.min
bestlam.lasso

lasso.mod <- glmnet(modelmat.train, train$Salary, alpha=1, lambda=bestlam.lasso)
lasso.pred.coefs <- predict(lasso.mod, type="coefficients", newx = modelmat.test)
lasso.pred.response <- predict(lasso.mod, newx = modelmat.test)
mse.lasso <- mean((lasso.pred.response-test.y)^2)
mse.lasso # = 0.469

## Best subset selection ##

library(leaps)

regfit.full <- regsubsets(Salary~., data=train, nvmax=ncol(train)-1)
summary.regfit.full <- summary(regfit.full)
summary.regfit.full

par(mfrow=c(3,1))

plot(1:19, summary.regfit.full$bic, type="b")
min.bic <- which.min(summary.regfit.full$bic)
points(min.bic, summary.regfit.full$bic[min.bic], col="red", pch=16)

plot(1:19, summary.regfit.full$cp, type="b")
min.cp <- which.min(summary.regfit.full$cp)
points(min.cp, summary.regfit.full$cp[min.cp], col="red", pch=16)

plot(1:19, summary.regfit.full$adjr2, type="b")
max.adjr2 <- which.max(summary.regfit.full$adjr2)
points(max.adjr2, summary.regfit.full$adjr2[max.adjr2], col="red", pch=16)

# choosing 7 variables as a happy medium

regfit.best <- regsubsets(Salary~., data=train, nvmax=7)

predict.regsubsets =function (object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}

bss.pred <- predict.regsubsets(regfit.full, test, 7)
mse.bss <- mean((bss.pred-y.test)^2)
mse.bss # = 0.479

# Plotting the MSEs #

mse.methods <- c(mse.boost, mse.lm, mse.ridge, mse.lasso, mse.bss)

barplot(mse.methods, names=c("boost","linear","ridge","lasso","bss"), xlab="Method", ylab="Test MSE")

#f)



summary(boost.hitters)

# CAtBat is the most important factor, followed by CRBI, CRuns, CHits and CWalks

#g) # Bagging on the training set

library(randomForest)
bag.hitters <- randomForest(Salary~., data=train, mtry=ncol(train)-1)
pred.bag <- predict(bag.hitters, newdata = test)
bag.mse <- mean((pred.bag-y.test)^2)
bag.mse # = 0.233

#11)
#a)
data("Caravan")

dim(Caravan)

trainid <- 1:1000

train <- Caravan[trainid,]
test <- Caravan[-trainid,]

#b)

Caravan$Purchase <- ifelse(Caravan$Purchase=="Yes",1,0)
table(Caravan$Purchase)

train <- Caravan[trainid,]
test <- Caravan[-trainid,]

caravan.boost <- gbm(Purchase~., data=train, distribution = "bernoulli", n.trees = 1000, shrinkage=0.01) ## Would usually cross-validate for the shrinkage value, but here it is provided

summary(caravan.boost)

#PPERSAUT is the most important, followed by MKOOPKLA and MOPLHOOG

#c)

boost.preds <- predict(caravan.boost, newdata = test, n.trees = 1000, type="response")

head(boost.preds)

boost.preds <- ifelse(boost.preds > 0.2, 1, 0)
table(boost.preds, test$Purchase)

# The model predicts that 144 people will make a purchase, and 31 actually do for a correct prediction rate of 21.5%

## KNN ##

# Cross validation for best k-value

knn.errs <- rep(NA,20)

x.train <- train[,-train$Purchase]
x.test <- test[,-test$Purchase]
y.train <- train$Purchase
y.test <- test$Purchase

for(i in 1:20) {

caravan.knn <- knn(x.train, x.test, y.train, k=i)
knn.errs[i] <- sum(caravan.knn!=y.test)

}

par(mfrow=c(1,1))
plot(knn.errs, type="b")

# k=11 seems to show the smallest error

caravan.knn <- knn(x.train, x.test, y.train, k=11)

head(caravan.knn) # KNN predict thats nobody would make a purchase

## Logistic Regression ##

logistic.mod <- glm(Purchase~., data=train, family=binomial)
logistic.pred <- predict(logistic.mod, family="binomial", newdata = test, type="response")
table(logistic.pred)

logistic.pred <- ifelse(logistic.pred > 0.2, 1, 0)

table(logistic.pred, y.test)
# Logistic regression predicts that 408 would buy and 58 did so correct = 14.2%

#12)

data(Weekly)

## Predicting Direction based on all other variables using bagging, boosting, random forest, logistic regression ##

dim(Weekly)

set.seed(1)
trainid <- sample(1:nrow(Weekly), nrow(Weekly)/2, replace=F)

Weekly$Direction <- ifelse(Weekly$Direction =="Up", 1, 0)
Weekly <- Weekly[,-c(1,8)]

train <- Weekly[trainid,]
test <- Weekly[-trainid,]

# Logistic Regression #

y.test <- test$Direction


dir.glm <- glm(Direction~., data=train, family="binomial") # Take away Today because it 100% predicts Direction, returns an error
glm.probs <- predict(dir.glm, newdata = test, type = "response")
glm.pred <- ifelse(glm.probs >=0.5,1,0)

table(y.test, glm.pred) # Classification error rate = 233/545 = 42.75%. We would expect the null model to have a classification error rate of 50%.

logit.accuracy <- 0.5376

# Boosting #

# Cross validate for best lambda value

powers <- seq(-3,-0.1,0.1)
lambdas <- 10^powers
errors <- rep(NA,length(powers))


for(i in seq_along(lambdas)) {

  dir.boost <- gbm(Direction~., data=train, n.trees=1000, distribution = "bernoulli", shrinkage=lambdas[i])

  pred <- predict(dir.boost, newdata = test, n.trees = 1000, type="response")
  pred.binary <- ifelse(pred >= 0.5, 1, 0)
  
  errors[i] <- sum(pred.binary!=y.test)
}

plot(errors, type="b") 

points(which.min(errors), errors[which.min(errors)], pch=16, col="red")

best.lambda <- lambdas[which.min(errors)]
best.lambda # = 0.1585

# Cross validating for best value for interaction depth

cv.errors <- rep(NA, 10)

for (i in 1:10) {
  
  dir.boost <- gbm(Direction~., data=train, n.trees=1000, distribution = "bernoulli", shrinkage=best.lambda, interaction.depth = i)

  pred <- predict(dir.boost, newdata = test, n.trees = 1000, type="response")
  pred.binary <- ifelse(pred >= 0.5, 1, 0)
  
  cv.errors[i] <- sum(pred.binary!=y.test)
}

plot(cv.errors, type="b") 
points(6, cv.errors[6], pch=16, col="red")
best.interaction.depth <- which.min(cv.errors)
best.interaction.depth

# Fitting the full model with the test set

dir.boost <- gbm(Direction~., data=train, n.trees=1000, distribution = "bernoulli", shrinkage=best.lambda, interaction.depth = best.interaction.depth)
boost.probs <- predict(dir.boost, newdata = test, n.trees = 1000, type="response")
boost.preds <- ifelse(boost.probs >= 0.5, 1, 0)

table(boost.preds, y.test) # Prediction accuracy of 299/545 = 54.86%

boost.accuracy <- 0.5486

# Bagging #

dir.bag <- randomForest(Direction~., data=train, mtry=ncol(train)-1)
probs.bag <- predict(dir.bag, newdata = test, type="response")
preds.bag <- ifelse(probs.bag >= 0.5, 1, 0)
table(preds.bag, y.test) # Prediction accuracy of 302/545 = 55.41%

bag.accuracy <- 0.5541

# Random Forest #

dir.rf <- randomForest(Direction~., data=train, mtry=(ncol(train)-1)/2)
dir.rf2 <- randomForest(Direction~., data=train, mtry=(ncol(train)-1)/3)
dir.rf3 <- randomForest(Direction~., data=train, mtry=sqrt(ncol(train)))
dir.rf4 <- randomForest(Direction~., data=train, mtry=2, ntree=500)

probs.rf <- predict(dir.rf, newdata = test, type = "response")
probs.rf2 <- predict(dir.rf2, newdata = test, type = "response")
probs.rf3 <- predict(dir.rf3, newdata = test, type = "response")
probs.rf3 <- predict(dir.rf3, newdata = test, type = "response")

preds.rf <- ifelse(probs.rf >= 0.5, 1, 0)
preds.rf2 <- ifelse(probs.rf2 >= 0.5, 1, 0)
preds.rf3 <- ifelse(probs.rf3 >= 0.5, 1, 0)

table(preds.rf, y.test) # 300/545
table(preds.rf2, y.test) # 298/545
table(preds.rf3, y.test) # 290/545

# mtry=(ncol(train)-1)/2 has the best predictive results

rf.accuracy <- 0.5504

results <- c(0.5, logit.accuracy, boost.accuracy, bag.accuracy, rf.accuracy)

barplot(results, main="Prediction Accuracy", names=c("Chance","Logit","Boosting","Bagging","Random Forest"), ylim=c(0.4,0.6))

# Bagging shows the greatest predictive capability 



```

