# Machine Learning (and Deep Learning) in just over 200 minutes with R
Created by [Ajay Hemanth](https://www.linkedin.com/in/ajayhemanth/?originalSubdomain=au) with [Yoni Nazarathy](https://yoninazarathy.com/). 

See more material at the [workshop's GitHub page](https://github.com/ajayhemanth/Machine-Learning-Workshop).

--- 
# Activity D: Random Forests with H2O
---

# H2O.ai

[H2O.ai](https://www.h2o.ai/) is one of the leading tools for machine learning. It provides a wide range of machine learning algorithms, and is very fast in execution. <br>

H2O.ai provides API support for R (along with many other tools) and also provide an easy to use user interface. R programmers can simply import H2O the package and use it like any other R package.<br>

Below we show the positioning of H2O.ai compared to other market leaders in machine learning space. <br>

<img src="img/h2oGartner.png" width="50%"/>

We will explore Tree Based Machine Learning algorithms with H2O.ai.<br>

### Getting started with the H2O engine
Running the [h2o.init](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/starting-h2o.html#from-r) function spawns a new H2O process which we can access via the `localhost:54321` link by default. Once an H2O process is spawned any number of users can be connected to this H2O service to collaborate on model building processes. <br>
Make sure not to run `h2o.init()` multiple times since this will end up creating multiple H2O services.

In [None]:
# install.packages('h2o')
library('h2o')
h2o.init() # comment this line after an h2o instance created else multiple instance will be created

The welcome page of H2O at `localhost:54321` is below. Apart from providing a collection of machine learning models, H2O also provides an interface to
- cleanse the data
- create a series of steps to perform data mining, prediction and analysis
- manage job statuses

and many more which can be explored from the interface


<img src="img/h2oWelcome.PNG" width="500"/>

## Data Handling

The same covertype dataset will be used for working with Random Forest.

In [None]:
# Make + as string concat operator as well.. Dont worry about this cell...
"+" = function(x,y) { if(is.character(x) || is.character(y)) return(paste(x , y, sep="")) else .Primitive("+")(x,y) }

In [None]:
d1 = read.csv("dataset/covtype.data", header=FALSE, stringsAsFactors=FALSE)
colnames(d1) = c('Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'WildernessArea'+(1:4), 'SoilType'+(1:40), 'CoverType')

MLP in keras requires categorical data to be in binary matrix format. However, tree based algorithms like Decision Tree, Gradient Boosting Machines can work with categorical data as is.<br>
Below code converts the binary matrix back to categorical column. This reduces total number of features from 55 to 13.

In [None]:
ncol(d1)

d1$WildernessArea = apply(d1[,'WildernessArea'+(1:4)], 1, function(x) sum(x*(1:4)))
d1$SoilType = apply(d1[,'SoilType'+(1:40)], 1, function(x) sum(x*(1:40)))
d1 = d1[,-(11:54)]

ncol(d1)

** Type of the response variable decides the type of prediction -- if the response variable is a factor or string then classification happens, and if it is numeric then regression happens.

In [None]:
d1$CoverType = as.factor(d1$CoverType)
d1$WildernessArea = as.factor(d1$WildernessArea)
d1$SoilType = as.factor(d1$SoilType)

In [None]:
str(d1)

#### Import R object to the H2O cluster
"d1" R dataframe can be imported into H2O using `as.h2o` function. This function takes as input the input dataframe "d1" as the 1st argument, name to be given for the dataframe inside H2O as 2nd argument, and it returns a handle for the H2O object. The handle can be used to execute Machine Learning algorithms in H2O using R.

In [None]:
test_index   = sample(x=1:nrow(d1), size=round(0.2*nrow(d1)), replace=FALSE)
test_handle  = as.h2o(d1[ test_index,], "test_data") 
train_handle = as.h2o(d1[-test_index,], "train_data") 

Data imported to H2O can be viewed in H2O UI under `Data > List All Frames`.<br>
Clicking on the H2O data frame shows the structure of data frame

<img src="img/h2oData.png" width="800"/>

### Building Model
List of models in H2O can be access from `Model` tab. <br>
Let's build a Random Forest by `Model > Distributed Random Forest`. <br>
It takes a number of parameters, few of which has been shown. <br>
The train and test dataset created previously will be available in the drop down of "training_frame" and "validation_frame" respectively. Lets set only train, validation set, response column, and seed as shown, and leave all other parameters at default and train the model.
<img src="img/DecisionTreePara1.PNG" /> <img src="img/DecisionTreePara2.PNG" /> 
<div align="center">
    .................................................. <br>
    .................................................. <br>
    .................................................. <br>
    .................................................. <br>
</div>

Once the parameters are chosen, select `Build Model`    

Once the Job is complete, clicking on `View` button opens up the result. The 1st result is the model performance for different "number of trees" for the set hyperparameters.

<img src="img/ModelPerformanceTrees.PNG" width="300"/>

Variable importance represents the importance of each feature for the prediction. This serves very useful to find the most correlated feature with the response variable.

<img src="img/VariableImportance.PNG" width="600"/>

<img src="img/ConfusionMatrix.PNG" width="700"/>

### Distributed Random Forest using R
The same operation as above can be easily executed using R using `h2o.randomForest` function.

In [None]:
model = h2o.randomForest(
    x = colnames(train_handle)[colnames(train_handle) != "CoverType"],
    y = "CoverType",
    training_frame = train_handle,
    validation_frame = test_handle,
    seed = 1
)

## Auto-Train Hyperparameters
The hyperparameters were left to defaults for the previous Distributed Random Forest model. Choosing of hyperparameters can be very tricky and time consuming. H2O provides `AutoML` option to iterated through a set of hyperparameters, and algorithms to identify the combination closest to a good choice. This can be done by `Model > Run AutoML`. <br> 
One of the parameters for AutoML is the algorithms to be trained as shown below. Only Distributed Random Forest has been selected.
<img src="img/AutoAlgo.PNG" width="700"/>
Once the hyper-parameters are chosen select `Build Models`

#### Leadership Board
After the `AutoML` has completed execution, the leadership board shows all the models that has been created along with its performance. Choosing the model at the top shows the results of the best model. Its output will be similar to the previous DRF model built previously.

<img src="img/LeadershipBoard.PNG" width="800"/>

#### Model Hyperparameters
The hyperparmaters of all the model built using AutoML can be viewed from Output - Model Summary.
<img src="img/ModelSummary.PNG" width="300"/>

### H2O with R
All the operations in H2O can be completely executed with its R API, and the functionality can be easily compared with UI. <br>
[R Booklet](https://www.h2o.ai/wp-content/uploads/2018/01/RBooklet.pdf) is a good starting point to work with H2O using R.