# MATH 3375 Examples Notebook #20

# Another Ensemble Method: Boosting

Whereas **_bagging_** uses multiple independent models in parallel, **_boosting_** iteratively constructs the model from several weaker models by assigning weights at each iteration, depending on how each weaker model performed. 

This has the effect of:
* Reducing bias
* Improving (reducing) the training error 

However, the above effects also mean that the method can lead to overfitting.  This should be controlled with effective **_parameter tuning_** and **_cross-validation_**.

We will illustrate with the **spam_base** data set.


In [None]:
#Look at data set
spam_data <- read.csv("spambase.csv")
head(spam_data)

The **gbm** package offers an implementation of boosting.  The acronym **gbm** stands for **G**radient **B**oosted **M**achine.

In [None]:
#install.packages("gbm")
library(gbm)

In [None]:
#Create test and train set

set.seed(3375)
testsize <- round(0.2 * nrow(spam_data),0)
test_rows <- sample(1:nrow(spam_data),testsize)
spam_test <- spam_data[test_rows,]
spam_train <- spam_data[-test_rows,]

nrow(spam_train)
nrow(spam_test)

In [None]:
#Create model to predict if record is spam
spam_model_boost_01 <- gbm(is_spam~.,data=spam_train,distribution='bernoulli',n.trees=500,cv.folds=5)
summary(spam_model_boost_01)

In [None]:
# Find optimal value for M

gbm.perf(spam_model_boost_01, method="cv")

## Using the Boosted Model for Prediction

Using the test set that we set aside, we will see how the boosted model can be used for prediction.  We are using the optimal value obtained above for number of trees (value of M).

In [None]:
spam_test


In [None]:
test_prob <- predict(spam_model_boost_01,spam_test,n.trees=488,type="response")
head(test_prob)

### Types of response predictions

Like logistic regression, the binary classifier returns log odds of positive response unless **type='response'** is specified. Then the prediction is a probability. It is up to the user to find the best cutoff for this probability to make the final prediction.

In [None]:
test_pred <- as.integer(test_prob > 0.5)
head(test_pred)

### Comparing Predictions with Actual Values

In [None]:
data.frame(Actual=spam_test$is_spam,Probability=test_prob,Predicted=test_pred)

### Confusion Matrix for Predictions on Test Set

In [None]:
table(Actual=spam_test$is_spam,Predicted=test_pred)

## Parameters for Fine Tuning gbm Models

Ideally, you can try several different values of each parameter to see what yields the best results.

The documentation for **gbm** gives more detail on the options (parameters) and on what is stored in the model that is returned.

In [None]:
?gbm