# MATH 3375 Examples Notebook #19

# Bootstrap Samples and Ensemble Methods: Bagging and Random Forests

**_Ensemble methods_** take the average prediction of multiple models:
* For continuous (quantitative) response variable: mean or weighted mean of predicted value from all models
* For classification: majority "vote" of the the predicted class from all models

This has the effect of:
* Reducing variance
* Reducing the effect of any one model being overfit 
* Improving prediction overall

We will illustrate with the **iris** data set.


In [None]:
#Look at data set
head(iris)

## Generating More Samples for Training: Bootstrap Samples

To create multiple models, we need more than one training data set. Rather than subdivide the one data set we have into smaller samples, we can leverage the power of **bootstrapping** to create many samples that are just as large as our full original data set. Each 'bootstrap sample' is created by sampling _**with replacement**_ from the original data set. It represents what _another_ sample from the same population **_could_** look like.

An example is given below. The iris data set has 150 records.  We create a list of row numbers (**idx**) to create another set of 150, but it is not identical to the original, because some rows may be chosen multiple times, while others may not be chosen at all. The ability to choose rows more than once is the result of sampling with replacement. 

### Sample Row Numbers with Replacement

The first step is to sample the possible row numbers with replacement to identify which rows will be included in the bootstrap sample. (Note that the row numbers are NOT selected in order, but they are displayed in order to make it easier to see which row numbers were included multiple times, and which were not included at all.)

In [None]:
#Choose rows for bootstrap sample
set.seed(3375)
idx <- sample(1:nrow(iris),nrow(iris),replace=TRUE)
sort(idx)

### Use Row Numbers to Create Sample

In [None]:
boot1 <- iris[idx,]
head(boot1)

### Compare Bootstrap Sample to Original Data Set 

The comparison below further illustrates that the bootstrap sample is the same size as our original data set, but has slightly different composition.

In [None]:
#Size of data sets
nrow(iris)
nrow(boot1)
head(boot1)
summary(boot1)

In [None]:
#Summary of data set variables
summary(iris$Species)
summary(boot1$Species)

In [None]:
summary(iris$Petal.Length)
summary(boot1$Petal.Length)

## Bagging

Bagging (**B**ootstrap **agg**regat**ing**) is a process in which several bootstrap samples are created, and a model (such as a tree) is created for each sample. Then the average prediction from all models combined is used for each data point.

## Random Forest

A random forest uses the principle of bagging with decision trees, but also _**randomly selects which features to use as predictors**_ for each tree. Thus, each tree has both a different training set and a different feature set of predictors.

We will implement random forests with the **randomForest** package.

In [None]:
#install.packages("randomForest")
library(randomForest)

In [None]:
#Create test and train set

test_rows <- c(14,23,80,119,123)
iris_test <- iris[test_rows,]
iris_train <- iris[-test_rows,]

In [None]:
#Create model to predict species
iris_model_forest_01 <- randomForest(Species~.,data=iris_train)

In [None]:
#Display some information available from model 
#Variable Importance (Average across all trees) 

importance(iris_model_forest_01)
varImpPlot(iris_model_forest_01)

In [None]:
#Create model
iris_model_forest_02 <- randomForest(Petal.Length~.,data=iris_train)

#Variable importance (continuous response variable)

importance(iris_model_forest_02)
varImpPlot(iris_model_forest_02)

## Using the Random Forest Models for Prediction

Using the test set that we set aside, we will see how each random forest can be used for prediction.  

In [None]:
iris_test


In [None]:
test_pred_species <- predict(iris_model_forest_01,iris_test)
test_pred_species

In [None]:
test_pred_length <- predict(iris_model_forest_02,iris_test)
test_pred_length

### Comparing Predictions with Actual Values

In [None]:
data.frame(Actual=iris_test$Species,Predicted=test_pred_species)
data.frame(Actual=iris_test$Petal.Length,Predicted=test_pred_length)

## Parameters for Fine Tuning Random Forests

Just as decision trees can be fine tuned (e.g., pruned to desired level), so can random forests. Some of the most important tuning parameters are:

* ntree = number of tress to grow, and the default is 500. 
* mtry = number of variables randomly sampled as candidates at each split. 
  The default is sqrt(p) for classfication and p/3 for regression
* nodesize = minimum size of terminal nodes. 
  The default value is 1 for classification and 5 for regression
  
Ideally, you can try several different values of each parameter to see what yields the best results.

The documentation for randomForest gives more detail on the options (parameters) and on what is stored in the model that is returned.

In [None]:
?randomForest