# AutoML

In this notebook, Sparkling Water's AutoML is used to see if we can find an optimal model for the [Power Emission](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant) dataset, by automating the training of various types of models

## Library Isolation

We can use [library utilities](https://docs.azuredatabricks.net/user-guide/dev-tools/dbutils.html#dbutils-library) to install Python libraries and create an environment scoped to a notebook session.

Required libraries (PyPi):
* colorama==0.3.8
* h2o-pysparkling-2.4

In [3]:
%python
dbutils.library.installPyPI("colorama", "0.3.8")
dbutils.library.installPyPI("h2o-pysparkling-2.4")

##Import Data

In [5]:
# File location and type
file_location = "/FileStore/tables/powerplant.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ";"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location) \
  .na.drop()

display(df.take(3))

time,AT,V,AP,RH,PE
2017-01-01T00:00:00.000+0000,14.96,41.76,1024.07,73.17,463.26
2017-01-01T01:00:00.000+0000,25.18,62.96,1020.04,59.08,444.37
2017-01-01T02:00:00.000+0000,5.11,39.4,1012.16,92.14,488.56


## Train-Test Split

We want to compare apples to apples here, so we are going to do the same train-test split before we get started with H2O.

In [7]:
(trainDF, testDF) = df.randomSplit([.8, .2], seed=42)
display(trainDF.take(2))

time,AT,V,AP,RH,PE
2017-01-01T00:00:00.000+0000,14.96,41.76,1024.07,73.17,463.26
2017-01-01T04:00:00.000+0000,10.82,37.5,1009.23,96.62,473.9


## Setting up H2O

[H20 documentation in Azure Databricks](https://docs.azuredatabricks.net/spark/latest/mllib/third-party-libraries.html#h2o-sparkling-water).

In [9]:
from pysparkling import *
from pyspark.sql import SparkSession
import h2o
from h2o.automl import H2OAutoML

#Set up H2O Configurations
spark = SparkSession.builder.appName("SparklingWaterApp").getOrCreate()
h2oConf = H2OConf(spark).set("spark.ui.enabled", "false")
hc = H2OContext.getOrCreate(spark, conf=h2oConf)

## Convert DF to H2O Frame

In [11]:
trainH2O = hc.as_h2o_frame(trainDF)
testH2O = hc.as_h2o_frame(testDF)

trainH2O[:,0:6].describe()

## AutoML

Now we will use AutoML to train the various models. This may take a few minutes (`max_runtime_secs` set to 300)

In [13]:
x = trainH2O.columns
y = "PE"
x.remove(y)
# Defaults to 5 fold cross-val
aml = H2OAutoML(max_runtime_secs = 300, seed = 42, project_name = "Power Emission Prediction", stopping_metric="RMSE", sort_metric="RMSE")
aml.train(x=x, y=y, training_frame=trainH2O)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)
# Note: deep learning model results are not reproducible

## Predict Test Data

In [15]:
import pandas as pd
predictPE = aml.predict(testH2O).as_data_frame()
actual = testH2O['PE'].as_data_frame()
predict_vs_actual = pd.concat([predict, actual],axis=1)
predict_vs_actual=spark.createDataFrame(predict_vs_actual)
display(predict_vs_actual.take(10))

predict,PE
445.544677734375,444.37
486.4376525878906,488.56
447.6765747070313,446.48
444.5905456542969,451.28
465.8669128417969,467.54
454.4395751953125,450.69
432.3283081054688,430.12
481.7620239257813,473.62
443.4638366699219,442.99
442.8514709472656,446.22


In [16]:
perf = aml.leader.model_performance(testH2O)
perf

In [17]:
aml.leader

## Save Model

In [19]:
h2o.save_model(aml.leader, path = userhome + "/machine-learning-p/h2o")
