Note: this example may take a long time to run, and incur significant charges in its use of GPUs, depending upon how its parameters are configured.
The performance of a machine learning model is often crucially dependent on the choice of good hyperparameters. For models of any complexity, relying on trial and error to find good values for these parameters does not scale. This tutorial shows how to use Cloud AI Platform Pipelines in conjunction with Keras Tuner to build a hyperparameter-tuning workflow that uses distributed HP search.
Cloud AI Platform Pipelines, currently in Beta, provides a way to deploy robust, repeatable machine learning pipelines along with monitoring, auditing, version tracking, and reproducibility, and gives you an easy-to-install, secure execution environment for your ML workflows. AI Platform Pipelines is based on Kubeflow Pipelines (KFP) installed on a Google Kubernetes Engine (GKE) cluster, and can run pipelines specified via both the KFP and TFX SDKs. See this blog post for more detail on the Pipelines tech stack. You can create an AI Platform Pipelines installation with just a few clicks. After installing, you access AI Platform Pipelines by visiting the AI Platform Panel in the Cloud Console.
Keras Tuner is an easy-to-use, distributable hyperparameter optimization framework. Keras Tuner makes it easy to define a search space and leverage included algorithms to find the best hyperparameter values. Keras Tuner comes with several search algorithms built-in, and is also designed to be easy for researchers to extend in order to experiment with new search algorithms. It is straightforward to run the tuner in distributed search mode, which we’ll leverage for this example.
The intent of a HP tuning search is typically not to do full training for each parameter combination, but to find the best starting points. The number of epochs run in the HP search trials are typically smaller than that used in the full training. So, an HP tuning-based workflow could include:
- perform a distributed HP tuning search, and obtain the results
- do concurrent model training runs for each of the best N parameter configurations, and export the model for each
- serve the results (often after model evaluation).
As mentioned above, a Cloud AI Platform (KFP) Pipeline runs under the hood on a GKE cluster. This makes it straightforward to implement this workflow— including the distributed HP search and model serving— so that you just need to launch a pipeline job to kick it off. This tutorial shows how to do that. It also shows how to use preemptible GPU-enabled VMS for the HP search, to reduce costs; and how to use TF-serving to deploy the trained model(s) on the same cluster for serving. As part of the process, we’ll see how GKE provides a scalable, resilient platform with easily-configured use of accelerators.
The Cloud Public Datasets Program makes available public datasets that are useful for experimenting with machine learning. Just as we did in our “Explaining model predictions on structured data” post, we’ll use data that is essentially a join of two public datasets stored in BigQuery: London Bike rentals and NOAA weather data, with some additional processing to clean up outliers and derive additional GIS and day-of-week fields.
We’ll use this dataset to build a Keras regression model to predict the duration of a bike rental based on information about the start and end stations, the day of the week, the weather on that day, and other data. If we were running a bike rental company, for example, these predictions—and their explanations—could help us anticipate demand and even plan how to stock each location.
We’ll then use the Keras Tuner package to do an HP search using this model.
With the Keras Tuner, you set up a HP tuning search along these lines (the code is from this example; other search algorithms are supported in addition to 'random'):
tuner = RandomSearch(
create_model,
objective='val_mae',
max_trials=args.max_trials,
distribution_strategy=STRATEGY,
executions_per_trial=args.executions_per_trial,
directory=args.tuner_dir,
project_name=args.tuner_proj
)...where in the above, the create_model call takes takes an argument hp from which you can sample hyperparameters. For this example, we're varying number of hidden layers, number of nodes per hidden layer, and learning rate in the HP search. There are many other hyperparameters that you might also want to vary in your search.
def create_model(hp):
inputs, sparse, real = bwmodel.get_layers()
...
model = bwmodel.wide_and_deep_classifier(
inputs,
linear_feature_columns=sparse.values(),
dnn_feature_columns=real.values(),
num_hidden_layers=hp.Int('num_hidden_layers', 2, 5),
dnn_hidden_units1=hp.Int('hidden_size', 32, 256, step=32),
learning_rate=hp.Choice('learning_rate',
values=[1e-1, 1e-2, 1e-3, 1e-4])
)Then, call tuner.search(...). See the Keras Tuner docs for more.
The Keras Tuner supports running a hyperparameter search in distributed mode. Google Kubernetes Engine (GKE) makes it straightforward to configure and run a distributed HP tuning search. GKE is a good fit not only because it lets you easily distribute the HP tuning workload, but because you can leverage autoscaling to boost node pools for a large job, then scale down when the resources are no longer needed. It’s also easy to deploy trained models for serving onto the same GKE cluster, using TF-serving. In addition, the Keras Tuner works well with preemptible VMs, making it even cheaper to run your workloads.
With the Keras Tuner’s distributed config, you specify one node as the ‘chief’, which coordinates the search, and ‘tuner’ nodes that do the actual work of running model training jobs using a given param set (the trials). When you set up an HP search, you indicate the max number of trials to run, and how many ‘executions’ to run per trial. The Kubeflow pipeline allows dynamic specification of the number of tuners to use for a given HP search— this determines how many trials you can run concurrently— as well as the max number of trials and number of executions.
We’ll define the tuner components as Kubernetes jobs, each specified to have 1 replica. This means that if a tuner job pod is terminated for some reason prior to job completion, Kubernetes will start up another replica. Thus, the Keras Tuner’s HP search is a good fit for use of preemptible VMs. Because the HP search bookkeeping— orchestrated by the tuner ‘chief’, via an ‘oracle’ file— tracks the state of the trials, the configuration is robust to a tuner pod terminating unexpectedly— say, due to a preemption— and a new one being restarted. The new job pod will get its instructions from the ‘oracle’ and continue running trials. The example uses GCS for the tuners’ shared file system.
Once the HP search has finished, any of the tuners can obtain information on the N best parameter sets (as well as export the best model(s)).
The definition of the pipeline itself is here, specified using the KFP SDK. It’s then compiled to an archive file and uploaded to AI Platforms Pipelines. (To compile it yourself, you’ll need to have the KFP SDK installed). Pipeline steps are container-based, and you can find the Dockerfiles and underlying code for the steps under the components directory.
The example pipeline first runs a distributed HP tuning search using a specified number of tuner workers, then obtains the best N parameter combinations—by default, two. The pipeline step itself does not do the heavy lifting, but rather launches all the tuner jobs on GKE, which run concurrently, and monitors for their completion. (Unsurprisingly, this stage of the pipeline may run for quite a long time, depending upon how many HP search trials were specified and how many tuners you’re using).
Concurrently to the Keras Tuner runs, the pipeline sets up a TensorBoard visualization component, its log directory set to the GCS path under which we’ll run the full training jobs. The output of this step—the log dir info— is consumed by the training step.
The pipeline then runs full training jobs, concurrently, for each of the N best parameter sets (by default, 2). It does this via the KFP loop construct, allowing the pipeline to support dynamic specification of N.
We’ll be able to compare the training jobs to each other using TensorBoard, both while they’re running and after they’ve completed.
Then, the trained models are deployed for serving for serving on the GKE cluster, using TF-serving. Each deployed model has its own cluster service endpoint. (While not shown in this example, one could insert a step for model evaluation before deploying to TF-serving.)
For example, here is the DAG for a pipeline execution that did training and then deployed prediction services for the two best parameter configurations.
The DAG for keras tuner pipeline execution. Here the two best parameter configurations were used for full training.
Note: this example may take a long time to run, and incur significant charges in its use of GPUs, depending upon how its parameters are configured.
To run the example, first create a Cloud AI Platform Pipelines installation, as described in the Pipelines documentation. Be sure to tick the box that sets up the GKE cluster with full access to GCP APIs. It’s necessary to additionally configure the installation’s underlying GKE cluster as described below, before running the pipeline.
Once the Pipelines installation has finished, grab the credentials for the underlying GKE cluster as follows:
gcloud container clusters get-credentials <cluster-name> --zone <cluster-zone> --project <your-project>Note that you can reconstruct this command via the “Connect” button in the GKE cluster listing in the Cloud Console. If you don’t have gcloud installed locally, you can run the commands in this section via the Cloud Shell.
Next, give the pipeline-runner service account permissions to launch new Kubernetes resources:
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runnerThen, apply a Nvidia daemonset, that will install Nvidia drivers on any GPU-enabled cluster nodes.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yamlThen, run:
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runnerNext, create GPU node pools. Just for the purposes of this example, we’ll create two: one a pool of preemptible nodes with one GPU each (the lighter-weight Keras tuner jobs can select this pool), and another of nodes with two GPUs each, which we’ll use for full training. (The pipeline is defined so that the full training steps are placed on nodes with at least 2 GPUs, though you can change this value and recompile the pipeline if you like.)
We’ll configure both node pools to use autoscaling and to scale down to zero when not in use. This means that when you run the example, you may see pauses while a node pool scales up.
Note: Before you run these commands, you may need to increase the GPU quota for your project.
Create the pool with preemptible nodes (edit this command with your cluster’s config info first):
gcloud container node-pools create preempt-gpu-pool \
--cluster=<your-cluster> \
--zone <cluster-zone> \
--enable-autoscaling --max-nodes=8 --min-nodes=0 \
--machine-type n1-highmem-8 \
--preemptible \
--scopes cloud-platform --verbosity error \
--accelerator=type=nvidia-tesla-k80,count=1Create the non-preemptible pool (again, first edit with your info):
gcloud container node-pools create gpu-pool \
--cluster=<your-cluster> \
--zone <cluster-zone> \
--enable-autoscaling --max-nodes=2 --min-nodes=0 \
--machine-type n1-highmem-8 \
--scopes cloud-platform --verbosity error \
--accelerator=type=nvidia-tesla-k80,count=2(If you have quota for more powerful accelerators, you can optionally specify them instead of the k80s, though of course that will increase the expense).
Note: There is no reason we couldn’t make all our GPU-enabled node pools preemptible, and run the full training steps on preemptible nodes as well. While for simplicity we’re not doing that as part of this example, see this blog post for information on how to define a KFP pipeline that supports step retries on interruption.
Upload the compiled ‘Keras Tuner’ pipeline to the Kubeflow Pipelines dashboard. You can use this URL: https://storage.googleapis.com/aju-dev-demos-codelabs/KF/compiled_pipelines/bw_ktune.py.tar.gz, or if you’ve checked out or downloaded the repo, you can upload the compiled archive file directly. (To compile the bw_ktune.py pipeline file yourself, you’ll need to have the KFP SDK installed).
For the pipeline parameters, fill in the name of a bucket (under which the HP tuning bookkeeping will be written) as well as a working_dir path (under which the info for the model full training will be written). These don’t need to be the same bucket, but of course both must be accessible to the pipelines installation.
You can adjust the other params if you like. E.g., you may want to lower the max_trials number.
The num_best_hps and num_best_hps_list params specify the N top param sets to use for full training, and must be consistent with each other, with the latter the list of the N first indices. (This redundancy is a bit hacky, but a loop in the pipeline spec makes use of this latter param). If you change these values, recall that the full training jobs are configured to each run on nodes with at least two GPUs, and so your node pool config must reflect this. You can edit this line in the pipeline spec: train.set_gpu_limit(2) and then recompile to change that constraint.
The 'keras tuner' pipeline parameters.
Note: The pipeline specification hardwires the number of executions per trial to 2 (the
--executions-per-trialarg)— to change that, edit the pipeline definition and recompile it.
Once you launch this pipeline, it may run for quite a long time, depending upon how many trials you’re doing. The first pipeline step (ktune) launches the Keras Tuner workers and chief as Kubernetes jobs, and so the pipeline step logs just show that the tuners have been deployed, then waits for them all to finish before proceeding.
You can track the pods via kubectl, e.g.:
kubectl get pods -A --watch=true -o wide(or via the Cloud Console). You can view the output for each of the tuner pods in their Stackdriver logs (or via kubectl until the node pool is scaled back down).
You may notice GPU preemptions in the pod listing while the tuners are running, since we set up that node pool to be preemptible. The pod status will look like this: OutOfnvidia.com/gpu; then you’ll see a replacement pod start up in its place, since we defined the tuner jobs to require 1 replica each. The new pod will communicate with the HP search oracle to get its instructions, and continue running trials.
After the HP search has completed, the first (ktune) step will obtain the N best parameter sets, and start full training jobs using those parameters.
After a full training step has finished, the exported model will be deployed to the cluster for serving, using TF-serving. (For simplicity, we’re not including model eval in the workflow, but typically you’d want to do that as well. For TF/Keras models, the TFMA library can be useful with such analyses.)
The full training runs use the tf.keras.callbacks.TensorBoard callback, so you can view and compare training info in TensorBoard during or after training.
To do this, we’re using a prebuilt KFP component for TensorBoard visualization. In the pipeline specification, we run this step before we launch the full training job(s), pointing the TB log dir to the parent directory of the training runs. By including this component, we can view the logs during training. (The training component also generates metadata to create a TB visualization—pointing to the same directory— but this viz will not be available until the training step completes. Because it points to the same directory, the visualization generated by the training step is redundant to the TB viz pipeline step in this pipeline).
Start up TensorBoard from the pipeline's Run Output panel.
After the full models have been trained and deployed for serving, you can request predictions from the TF-serving services. For this example, we’re not putting the services behind external IP addresses, so we’ll port-forward to connect to them.
Find the TF-serving service names by running this command (edit the following if you deployed to a different namespace):
kubectl -n default get services -l apptype=tf-servingBy default, you should see two such services per pipeline run. The service names will look something like bikeswxxxxxxxxxx. Port-forward to a service as follows, first editing to use your service name:
kubectl -n default port-forward svc/bikeswxxxxxxxxxx 8500:8500Then, send the TF-serving service a prediction request, formatted as follows:
curl -d '{"instances": [{"end_station_id": "333", "ts": 1435774380.0, "day_of_week": "4", "start_station_id": "160", "euclidean": 4295.88, "loc_cross": "POINT(-0.13 51.51)POINT(-0.19 51.51)", "prcp": 0.0, "max": 94.5, "min": 58.9, "temp": 81.8, "dewp": 59.5 }]}' \
-X POST http://localhost:8500/v1/models/bikesw:predictYou should get a response that looks something like this, where the prediction value is the rental duration:
{
"predictions": [[1493.55908]
]
}You can find the KFP component code in the components subdirectory, and the component Dockerfile definitions are under the components/kubeflow-resources/containers subdirectory.
In the bikesw_training directory, deploy_tuner.py implements the KFP component that launches the tuner ‘workers’ and ‘chief’ jobs, using the yaml templates in the same directory.
bw_hptune_standalone.py is executed in the tuner pod containers. This is where the tuner search is set up and run. For this example, we’re using the Keras Tuner’s RandomSearch , but other options are supported as well. Then, the best N results are written to a given GCS path, and this info is used to kick off the full training runs.
The bikes_weather_limited.py is used for full training with the given HP tuning parameters. The bwmodel module contains core model code used by both.
Note: With Keras Tuner, you can do both data-parallel and trial-parallel distribution. That is, you can use tf.distribute.Strategy to run each Model on multiple GPUs, and you can also search over multiple different hyperparameter combinations in parallel on different workers. The template yamls specify just one GPU, but it would be easy to modify the code to support the former.
The TF-Serving component is in the tf-serving directory, which contains both the code that launches the TF-serving service, and the yaml template used to do so.
After you’re done running the example, you probably want to do some cleanup. (If you set up the GPU node pools using autoscaling, they should scale down to zero on their own after a period of inactivity).
You can delete the ‘chief’ job when you’re done with it via this command:
kubectl delete jobs -l apptype=ktuner-chiefThen delete the chief service as well:
kubectl delete services -l apptype=ktuner-chiefYou can delete the tuner jobs as follows (if the jobs have completed, this will tear down their pods; if the jobs are still running, this will terminate them):
kubectl delete jobs -l app=ktuner-tunerTo take down the TF-serving deployments and services (edit the following if you deployed to a different namespace):
kubectl delete deployment -n default -l apptype=tf-serving
kubectl delete services -n default -l apptype=tf-servingYou can also take down your Cloud AI Platform Pipelines installation— optionally deleting its GKE cluster too— via the Pipelines panel in the Cloud Console.