<h1 style="display: inline;" >0. Preliminaries (BONUS)

Here we go over the prerequisites that are required to run and install the following examples from scratch. However, these notebooks are currently hosted on nodes where this has already been done — so you shouldn't need to replicate:

**Setting up a test environment:**

We set up a test environment using two AWS EC2 c4.x nodes, so that we could have the minimal cluster for testing H2O. Basic programs that are needed:

~~~~
$ sudo yum -y update
$ sudo yum -y install java-1.7.0-openjdk*
$ sudo rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum -y install gcc
$ sudo yum -y install unzip
$ sudo yum -y install wget
~~~~

** Setting up Jupyter notebooks:**

```
$ sudo yum jupyter
$ jupyter notebook --generate-config
```
Then, in the file `.jupyter/jupyter_notebook_config.py`, set the following to allow remote access of Jupyter:

```c.NotebookApp.allow_origin = '*' #allow all origins
c.NotebookApp.ip = '0.0.0.0' # listen on all IPs
```

** Installing Python packages:**

```
$ sudo yum -y install python-pip
$ sudo pip install --upgrade pip
$ sudo yum -y install python-devel
$ sudo pip install requests tabulate scikit-learn colorama future
```

**Installing R: **

```
$ sudo yum install R
```
You may need to manually install some repos if you encounter dependency problems; see <a href=https://superuser.com/questions/841270/installing-r-on-rhel-7>here.</a> 

To make R kernel available to jupyter, in R:

```
install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))
devtools::install_github('IRkernel/IRkernel')
IRkernel::installspec()
```

To install necessary R packages for the examples, for example:

```
install.packages('GGally')
```

** Installing H2O:**

Follow the instructions <a href="https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/6/index.html">here</a> for 'Download and Run', 'Install in Python', and 'Install in R.'

To install a multi-node H2O cluster, create a file 'flatfile.txt' containing the <em>private</em> IP addresses and port:

```
172.31.43.155:54321
172.31.47.194:54321
```

Then start H2O on both nodes using:

```
$ java -jar h2o.jar -flatfile flatfile.txt
```

**Installing Hadoop: **

For the most part I followed the instructions <a href="https://dzone.com/articles/setting-up-multi-node-hadoop-cluster-just-got-easy-2">here</a>.

To transfer files from the local filesystem to HDFS:

```
$ hadoop fs -put file.csv /dest_folder/
```

**Reading and writing from H2O to Hadoop:**

Assuming our H2O nodes are still running, the following should recognize a two-node cluster: 

In [9]:
library(h2o)
h2o.init(nthreads = -1)

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         14 seconds 690 milliseconds 
    H2O cluster version:        3.14.0.6 
    H2O cluster version age:    9 days  
    H2O cluster name:           ec2-user 
    H2O cluster total nodes:    2 
    H2O cluster total memory:   1.55 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.1 (2017-06-30) 



Here we import from HDFS, translate the iris lengths from cm to inches, and export back to HDFS:

In [2]:
iris_hex = h2o.importFile("hdfs://ec2-34-204-73-232.compute-1.amazonaws.com:9000/iris.csv")




In [3]:
iris_hex['sepal_length'] = iris_hex['sepal_length'] / 2.54
iris_hex['petal_length'] = iris_hex['petal_length'] / 2.54
iris_hex['sepal_width'] = iris_hex['sepal_width'] / 2.54
iris_hex['petal_width'] = iris_hex['petal_width'] / 2.54
iris_hex

  sepal_length sepal_width petal_length petal_width species
1     2.007874    1.377953    0.5511811  0.07874016  setosa
2     1.929134    1.181102    0.5511811  0.07874016  setosa
3     1.850394    1.259843    0.5118110  0.07874016  setosa
4     1.811024    1.220472    0.5905512  0.07874016  setosa
5     1.968504    1.417323    0.5511811  0.07874016  setosa
6     2.125984    1.535433    0.6692913  0.15748031  setosa

[150 rows x 5 columns] 

In [6]:
h2o.exportFile(iris_hex, "hdfs://ec2-34-204-73-232.compute-1.amazonaws.com:9000/iris_inches.csv", force=TRUE)

