## Random Forests


Random Forest is the choice of algorithm when one can’t think of any algorithm irrespective of situation, to apply on a dataset or if one wants to learn about the data before applying any more apt complex algorithms. It is considered to be a solution of all data science problems. 

Random Forests are capable of performing both regression and classification tasks. It helps in dimensional reduction, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. Before trying to get into the details of random forest understanbd how decision trees work.

**Decision trees** are a type of supervised learning algorithm mostly used in classification problems. It works for both categorical and continuous input and output variables. The main idea behind algorithm is to split the population or sample into two or more sub-populations based on a most significant differentiator in input variables.

<img src="../images/decision_tree.png">


image source: [AnalyticsVidhya](https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/)

**Example:** Consider a problem of predicting whether a customer will pay his loan debt amount either yes or no. The income of the customer is the deciding variable in this case. But the company doesn't income details of all customers. Based on the insight that incomes drives this decision a decision tree can be built to predict customer's income based on occupation, education level and sex and various other variables. Here continuous variable is being predicted. 

Decision tree builds a single tree whether if its classification or regression using CART model() but Random forest algorithm builds multiple trees. A random forest to classify an object based on attributes, each tree that is built gives a classification and votes for a class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

Illustrate this idea on iris dataset and compare the results. Load the default iris dataset.

In [None]:
iris_data=(data=iris)
head(iris_data)

In [None]:
# Visually inspect the data on a graph

library(ggplot2)
qplot(Petal.Length,Petal.Width,colour=Species,data=iris_data)

In [None]:
# Install below packages for building a CART model.
library(rpart)
library(caret)

**Reference: ** 

- [rpart](https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
- [caret](https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf)
- [Tree based models](http://www.statmethods.net/advstats/cart.html)

Divide the population in to training and testing sets. Compare the predictive power of decision tree and random forest on testing set of data.

In [None]:
# Create a vector called flag such that 70% of the data is put into training set and rest in to testing set. 
# flag will have row numbers corresponding to observations that will be put into training set and the rows remaining in iris_data
# will be put into testing set.
flag <- createDataPartition(y=iris_data$Species,p=0.7,list=FALSE)

# training will have rows from iris_data for the row numbers present in flag vector.
training <- iris_data[flag,]
nrow(training)

# testing will have rows from iris_data which are not present in flag vector.
testing <- iris_data[-flag,]
nrow(testing)

So we have 105 observations in training set and 45 in testing set.

Build a CART model. "caret" and "rpart" packages will be used to build the model. To create a more graphically appealing graph in R, a package called “rattle” is used to make the decision tree. "Rattle" builds more fancy and clean trees which are easy to interpret.

In [None]:
# install.packages("rpart.plot",repo="http://cran.mtu.edu/")

fit <- train(Species~.,method="rpart",data=training)

# Code for generating decision tree plot
# rpart_fit <- rpart(Species~.,method="class",data=training) 
# library(rpart.plot)
# rpart.plot(rpart_fit)

Now check the predictive power of the CART model that is just built. Check for the number of misclassifications in the tree as the decision criteria.

In [None]:
train.pred<-predict(fit,newdata=training)
table(train.pred,training$Species)

In [None]:
# Misclassification rate = 4/105
4/105

There are 4 misclassifications out of 105 observations. The misclassification rate signifies its predictive power. Once the model is built, it should be validated on a test set to see how well it performs on unknown data. This will help in determining the model is not over fitted. In case the model is over fitted, validation will show a sharp decline in the predictive power.

In [None]:
test.pred<-predict(fit,newdata=testing)
table(test.pred,testing$Species)

In [None]:
# Misclassification rate = 3/45
3/45

The predictive power decreased in testing set as compared to training. This is generally true in most cases. The reason being, the model is trained on the training data set, and just overlaid on validation training set.

### Random Forest

Run random forest algorithm on iris_data to compare the results with CART model.

In [None]:
library(randomForest)
# library(randomForestSRC)

In [None]:
RandomForest_fit <- randomForest(Species~.,method="class",data=training,importance=TRUE) 


plot(RandomForest_fit)
legend("topright", colnames(RandomForest_fit$err.rate),col=1:4,cex=0.8,fill=1:4)

The plot shows the amount of error with the variation in the number of trees constructed. Play with number of trees to generate. 

In [None]:
varImpPlot(RandomForest_fit)

#### Gini importance: 

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

**Reference: **[Variable importance](https://en.wikipedia.org/wiki/Random_forest#Variable_importance)

In [None]:
importance(RandomForest_fit)

In [None]:
# install.packages("party",repo="http://cran.mtu.edu/")

# library(party)
 
# ct = ctree(Species~., data = training)
# plot(ct, main="Tree")
 
# #Table of prediction errors
# table(predict(ct), training$Species)
 
# # Estimated class probabilities
# train.pred = predict(ct, newdata=training, type="prob")

In [None]:
RF_fit <- train(Species~ .,method="rf",data=training)

In [None]:
train_RF_pred <- predict(RF_fit,training)

In [None]:
table(train_RF_pred,training$Species)

Misclassification rate in training data is 0/105. Validate to make sure that the model is not over fitted on the training data by testing on tets data.

In [None]:
test_RF_pred<-predict(RF_fit,newdata=testing)

In [None]:
table(test_RF_pred,testing$Species)

There are 3 misclassified observations out of 45, which is similar to CART model prediction power. There is a significant drop in predictive power of the model when compared to training misclassification rate.

**Feature Reduction using ANOVA, MANOVA and Random Forests.**

Apply Random Forests and the techniques we have seen in other lab notebooks on bikeshare dataset.

In [None]:
bikeshare_data = read.csv("../../../datasets/bikeshare/hour.csv")
head(bikeshare_data)

In [None]:
str(bikeshare_data)

In [None]:
bikeshare_data$hr = factor(bikeshare_data$hr)
bikeshare_data$weekday = factor(bikeshare_data$weekday)

In [None]:
fit <- aov(mnth ~ hr, data=bikeshare_data)
summary(fit)

The p-value suggests there is no variation in the means of data by hour for all the months.

In [None]:
fit <- aov(mnth ~ weekday, data=bikeshare_data)
summary(fit)

In [None]:
fit <- aov(temp ~ hr, data=bikeshare_data)
summary(fit)

In [None]:
# Do pairwise comparison between group means for each hour
pairwise.t.test(bikeshare_data$temp, bikeshare_data$hr,p.adjust="bonferroni")

Although there is not much variation in the temparature for some of the hours there is a lot of variation in the temparature of the day based on hour for some of the hours. Lets find the mean temparature of each hour using tapply().

In [None]:
t(tapply(bikeshare_data$temp,bikeshare_data$hr,mean))

In [None]:
# Do a MANOVA on variables temp,hum,windspeed,holiday and weathersit using hr and weekday variables
summary(manova(cbind(temp,hum,windspeed,holiday,weathersit) ~ hr * weekday,
               data = bikeshare_data), test = "Hotelling-Lawley")

According to manova, these variables temp,hum,windspeed,holiday and weathersit vary by hr and weekday. Lets analyze the same for rest of the variables.

In [None]:
names(bikeshare_data)

In [None]:
summary(manova(cbind(season,mnth) ~ hr * weekday,
               data = bikeshare_data), test = "Hotelling-Lawley")

month and season are not contributing anything as they have little variation in their data. 

### Measuring variable importance using Random Forests

#### Gini importance

The mean Gini gain that is produced by a feature over all trees. Consider `RF` is the Random Forest model fitted on the data. 

$RF <- randomForest(..., importance=TRUE)$

There are 2 ways of checking the impoortance

* RF$importance       **column**: MeanDecreaseGini

* importance(RF, type=2)

Note: For variables of different types: there will be a bias in favor of continuous variables and variables with many categories.

#### Permutation importance
 
The mean decrease in classification accuracy after permuting the feature over all trees 

$RF <- randomForest(..., importance=TRUE)$

- RF$importance **column**: MeanDecreaseAccuracy
- importance(RF, type=1)

obj <- cforest(...)
varimp(obj)

Note: For variables of different types: unbiased only when subsampling is used as in cforest(..., controls = cforest unbiased())

In [None]:
# Train a model across all the training data and plot the variable importance
rf <- randomForest(bikeshare_data[,c("season","holiday","workingday","weathersit","temp","atemp","hum","windspeed","hr")], 
                                  bikeshare_data$count, ntree=50, importance=TRUE)
imp <- importance(rf, type=2)
Imp_features <- data.frame(Feature=row.names(imp), Importance=imp[,1])

p <- ggplot(Imp_features, aes(x=reorder(Feature, Importance), y=Importance)) +
     geom_bar(stat="identity", fill="blue") +
     coord_flip() + 
     theme_light(base_size=20) +
     xlab("Importance") +
     ylab("") + 
     ggtitle("Random Forest Feature Importance\n") +
     theme(plot.title=element_text(size=18))
p

In [None]:
cbind(importance(rf, type=1),importance(rf, type=2))

In [None]:
varImpPlot(rf)

From the plots hr is the most important variable followed by holiday, atemp and so on for MeanDEcreaseAccuracy measure of importance. hum is the most important variable according to MeanDecreaseGini measure. 