
#H2O AutoML Regression Demo
##This is a Jupyter Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press Shift+Enter.

#Start H2O
##Import the h2o Python module and H2OAutoML class and initialize a local H2O cluster.

In [12]:
!pip install h2o
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 hour 57 mins
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,1 month and 9 days
H2O_cluster_name:,H2O_from_python_unknownUser_vnrbfl
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.161 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2



#Load Data
##For the AutoML regression demo, we use the Combined Cycle Power Plant dataset. The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values. In this demo, you will use H2O's AutoML to outperform the state of the art results on this task.

In [13]:
# Use local data file or download from GitHub
import os
docker_data_path = "/home/h2o/data/automl/powerplant_output.csv"
if os.path.isfile(docker_data_path):
  data_path = docker_data_path
else:
  data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"


# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [14]:
df.head()

TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
14.96,41.76,1024.07,73.17,463.26
25.18,62.96,1020.04,59.08,444.37
5.11,39.4,1012.16,92.14,488.56
20.86,57.32,1010.24,76.64,446.48
10.82,37.5,1009.23,96.62,473.9
26.27,59.44,1012.23,58.77,443.67
15.89,43.96,1014.02,75.24,467.35
9.48,44.71,1019.12,66.43,478.42
14.64,45.0,1021.78,41.25,475.98
11.74,43.56,1015.14,70.72,477.5




Next, let's identify the response column and save the column name as y. In this dataset, we will use all columns except the response as predictors, so we can skip setting the x argument explicitly.




In [15]:
y = "HourlyEnergyOutputMW"


Lastly, let's split the data into two frames, a train (80%) and a test frame (20%). The test frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.



In [16]:

splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

#Run AutoML
##Run AutoML, stopping after 60 seconds. The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.

##The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [17]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

H2OResponseError: ignored

##For demonstration purposes, we will also execute a second AutoML run, this time providing the original, full dataset, df (without passing a leaderboard_frame). This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% like we did above. This time our leaderboard will use cross-validated metrics.

##Note: Using an explicit leaderboard_frame for scoring may be useful in some cases, which is why the option is available.

In [None]:

aml2 = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_full_data")
aml2.train(y = y, training_frame = df)

#Leaderboard
##Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

#After viewing the "powerplant_lb_frame" AutoML project leaderboard, we compare that to the leaderboard for the "powerplant_full_data" project. We can see that the results are better when the full dataset is used for training.

#A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [None]:
aml.leaderboard.head()


##Now we will view a snapshot of the top models. Here we should see the two Stacked Ensembles at or near the top of the leaderboard. Stacked Ensembles can almost always outperform a single model


##This dataset comes from the UCI Machine Learning Repository of machine learning datasets. The data was used in a publication in the International Journal of Electrical Power & Energy Systems in 2014. In the paper, the authors achieved a mean absolute error (MAE) of 2.818 and a Root Mean-Squared Error (RMSE) of 3.787 on their best model. So, with H2O's AutoML, we've already beaten the state-of-the-art in just 60 seconds of compute time!

#Predict Using Leader Model
##If you need to generate predictions on a test set, you can make predictions on the "H2OAutoML" object directly, or on the leader model object.



In [None]:

pred = aml.predict(test)
pred.head()


##If needed, the standard model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [None]:

perf = aml.leader.model_performance(test)
perf