# Install H2O

In [1]:
! apt-get install default-jre
! java-version
! pip install h2o

Reading package lists... 0%Reading package lists... 0%Reading package lists... 1%Reading package lists... 11%Reading package lists... 11%Reading package lists... 11%Reading package lists... 11%Reading package lists... 91%Reading package lists... 91%Reading package lists... 92%Reading package lists... 92%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 96%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 99%Reading package lists... 99%Reading package lists... 99%Reading package lists... 99%Reading package lists... Done
Building d

# H2O AutoML Regression Demo

This is a [Jupyter](https://jupyter.org/) Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press *Shift+Enter*. 

### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [2]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_171"; OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11); OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp_79z602w
  JVM stdout: /tmp/tmp_79z602w/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmp_79z602w/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.8
H2O cluster version age:,5 days
H2O cluster name:,H2O_from_python_unknownUser_dje8wx
H2O cluster total nodes:,1
H2O cluster free memory:,2.827 Gb
H2O cluster total cores:,2
H2O cluster allowed cores:,2


### Load Data

For the AutoML regression demo, we use the [Combined Cycle Power Plant](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset.  The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values.  In this demo, you will use H2O's AutoML to outperform the [state of the art results](https://www.sciencedirect.com/science/article/pii/S0142061514000908) on this task.

In [4]:
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"


# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Let's take a look at the data.

In [5]:
df.describe()

Rows:9568
Cols:5




Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290957,54.3058037207358,1013.2590781772578,73.30897784280936,454.36500940635455
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611082,12.707892998326807,5.93878370581162,14.600268756728957,17.066994999803423
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


Next, let's identify the response column and save the column name as `y`.  In this dataset, we will use all columns except the response as predictors, so we can skip setting the `x` argument explicitly.

In [0]:
y = "HourlyEnergyOutputMW"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%).  The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

In [0]:
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

## Run AutoML 

Run AutoML, stopping after 60 seconds.  The `max_runtime_secs` argument provides a way to limit the AutoML run by time.  When using a time-limited stopping criterion, the number of models train will vary between runs.  If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another. 

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [8]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


*Note: If you see the following error, it means that you need to install the pandas module.*
```
H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable 
``` 

For demonstration purposes, we will also execute a second AutoML run, this time providing the original, full dataset, `df` (without passing a `leaderboard_frame`).  This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% like we did above.  This time our leaderboard will use cross-validated metrics.

*Note: Using an explicit `leaderboard_frame` for scoring may be useful in some cases, which is why the option is available.*  

In [9]:
aml2 = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_full_data")
aml2.train(y = y, training_frame = df)

AutoML progress: |████████████████████████████████████████████████████████| 100%


*Note: We specify a `project_name` here for clarity.*

## Leaderboard

Next, we will view the AutoML Leaderboard.  Since we specified a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

After viewing the `"powerplant_lb_frame"` AutoML project leaderboard, we compare that to the leaderboard for the `"powerplant_full_data"` project.  We can see that the results are better when the full dataset is used for training.  

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric.  In the case of regression, the default ranking metric is mean residual deviance.  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [10]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_grid_0_AutoML_20180927_022342_model_3,11.6763,3.41706,11.6763,2.42422,0.00751043
StackedEnsemble_AllModels_0_AutoML_20180927_022342,11.7943,3.43428,11.7943,2.43361,0.00755406
StackedEnsemble_BestOfFamily_0_AutoML_20180927_022342,11.8503,3.44243,11.8503,2.43643,0.00757504
GBM_grid_0_AutoML_20180927_022342_model_2,11.855,3.44311,11.855,2.43869,0.00757132
GBM_grid_0_AutoML_20180927_022342_model_1,12.2221,3.49601,12.2221,2.50889,0.00768751
GBM_grid_0_AutoML_20180927_022342_model_0,12.6301,3.55388,12.6301,2.53241,0.00780808
DRF_0_AutoML_20180927_022342,13.5288,3.67816,13.5288,2.62764,0.00810127
XRT_0_AutoML_20180927_022342,13.6271,3.69149,13.6271,2.62194,0.00812536
GBM_grid_0_AutoML_20180927_022342_model_4,14.016,3.74379,14.016,2.72194,0.00822983
DeepLearning_0_AutoML_20180927_022342,20.6733,4.54679,20.6733,3.46012,0.0100312




Now we will view a snapshot of the top models.  Here we should see the two Stacked Ensembles at or near the top of the leaderboard.  Stacked Ensembles can almost always outperform a single model.

In [11]:
aml2.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_0_AutoML_20180927_022507,10.6686,3.26628,10.6686,2.36449,0.00717045
StackedEnsemble_BestOfFamily_0_AutoML_20180927_022507,10.8772,3.29806,10.8772,2.38175,0.00724116
GBM_grid_0_AutoML_20180927_022507_model_3,10.9183,3.30429,10.9183,2.38846,0.00725076
GBM_grid_0_AutoML_20180927_022507_model_2,10.9535,3.3096,10.9535,2.4085,0.00726313
GBM_grid_0_AutoML_20180927_022507_model_0,11.0288,3.32096,11.0288,2.42324,0.00728685
GBM_grid_0_AutoML_20180927_022507_model_1,11.3415,3.36772,11.3415,2.46744,0.00739167
DRF_0_AutoML_20180927_022507,12.3201,3.51,12.3201,2.56916,0.00770918
XRT_0_AutoML_20180927_022507,12.3938,3.52049,12.3938,2.5644,0.00772779
GBM_grid_0_AutoML_20180927_022507_model_4,12.9893,3.60407,12.9893,2.67657,0.00790625
GLM_grid_0_AutoML_20180927_022507_model_0,20.8802,4.56949,20.8802,3.64814,0.0100502




This dataset comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) of machine learning datasets.  The data was used in a [publication](https://www.sciencedirect.com/science/article/pii/S0142061514000908) in the *International Journal of Electrical Power & Energy Systems* in 2014.  In the paper, the authors achieved a mean absolute error (MAE) of 2.818 and a Root Mean-Squared Error (RMSE) of 3.787 on their best model.  So, with H2O's AutoML, we've already beaten the state-of-the-art in just 60 seconds of compute time!

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [12]:
pred = aml.predict(test)
pred.head()

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict
485.036
473.125
466.435
452.65
446.836
469.328
443.004
462.974
442.507
434.325




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [13]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 11.676303771453584
RMSE: 3.417060691801301
MAE: 2.424221360046135
RMSLE: 0.007510425406461126
Mean Residual Deviance: 11.676303771453584


