# Performing the Hyperparameter tuning

**Learning Objectives**
1. Learn how to use $\texttt{cloudml-hypertune}$ to report the results for Cloud hyperparameter tuning trial runs
2. Learn how to configure the $\texttt{.yaml}$ file for submitting a Cloud hyperparameter tuning job
3. Submit a hyperparameter tuning job to Cloud AI Platform

## Introduction

Let's see if we can improve upon that by tuning our hyperparameters.

Hyperparameters are parameters that are set *prior* to training a model, as opposed to parameters which are learned *during* training.

These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.

Here are the four most common ways to finding the ideal hyperparameters:

1. Manually
2. Grid Search
3. Random Search
4. Bayesian Optimisation

**1. Manual**

Traditionally, hyperparameters tuning is a manual trial and error process. An ML engineer has some intuition about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance.

Pros

- Educational, builds up your intuition as an ML engineer
- Inexpensive because only one trial is conducted at a time

Cons

- Requires a lot of time and patience

**2. Grid Search**

On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination.

Pros

- Can run hundreds of trials in parallel using the cloud
- Guaranteed to find the best solution within the search space

Cons

- Expensive

**4. Bayesian Optimisation**

Unlike Grid Search and Random Search, Bayesian Optimisation takes into account information from past trials to select parameters for future trials. More information can be found [here](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization).

Pros

- Picks values intelligently based on results from past trials
- Less expensive because requires fewer trials to get a good result

Cons

- Requires sequential trials for best results

**AI Platform HyperTune**

AI Platform HyperTune, powered by [Google Vizier](https://ai.google/research/pubs/pub46180), uses Bayesian Optimisation by default, but [also supports](https://cloud.google.com/ml-engine/docs/tensorflow/hyperparameter-tuning-overview#search_algorithms) Grid Search and Random Search.

When tuning just a few hyperparameters (say less than $4$), Grid Search and Random Search work well, but when tuning several hyperparameters and the search space is large, Bayesian Optimisation is better.

In [None]:
# Importing the necessary modules
import os

from google.cloud import bigquery

In [None]:
# Change with your own bucket and project below
BUCKET = '<BUCKET>'
PROJECT = '<PROJECT>'
REGION = '<YOUR REGION>'

OUTDIR = 'gs://{bucket}/taxifare/data'.format(bucket=BUCKET)

os.environ['BUCKET'] = BUCKET
os.environ['OUTDIR'] = OUTDIR
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '2.3'

In [None]:
%%bash
# Setting up Cloud SDK properties
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## Make code compatible with AI Platform Training Service

In order to make our code compatible with AI Platform Training Service we need to make the following changes:

1. Upload data to Google Cloud Storage
2. Move code into a trainer Python package
3. Submit training job with `gcloud` to train on AI Platform

## Upload data to Google Cloud Storage (GCS)

Cloud services don't have access to our local files, so we need to upload them to a location the Cloud servers can read from. In this case we'll use GCS.

## Create BigQuery tables

In [None]:
bq = biquery.Client(project=PROJECT)
dataset = bigquery.Dataset(bq.dataset('taxifare'))

# Create a data set
try:
    bq.create_dataset(dataset)
    print('Dataset created')
except:
    print('Dataset already exists')

Let's create a table with 1 million examples

In [None]:
%%biquery

CREATE OR REPLACE TABLE taxifare.feateng_training_data AS

SELECT
    (tolls_amount + fare_amount) AS fare_amount,
    pickup_datetime,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count*1.0 AS passengers,
    "unusued" AS key
FROM 
    `nyc-tlc.yellow.trips`
WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 1000)) = 1
AND
    trip_distance > 0
AND
    fare_amount >= 2.5
AND
    pickup_longitude > -78
AND
    pickup_longitude < -70
AND
    dropoff_longitude > -78
AND
    dropoff_longitude < -70
AND
    pickup_latitude > 37
AND
    pickup_latitude < 45
AND
    dropoff_latitude > 37
AND
    dropoff_latitude < 45
AND
    passenger_count > 0

Make the validation data set be 1/10 the size of the training data set.

In [None]:
%%bigquery

CREATE OR REPLACE TABLE taxifare.feateng_valid_data AS

SELECT
    (tolls_amount + fare_amount) AS fare_amount,
    pickup_datetime,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count*1.0 AS passengers,
    "unusued" AS key
FROM
    `nyc-tlc.yellow.trips`
WHERE
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 10000)) = 2
AND
    trip_distance > 0
AND
    fare_alout >= 2.5
AND
    pickup_longitude > -78
AND
    pickup_longitude < -70
AND
    dropoff_longitude > -78
AND
    dropoff_longitude < -70
AND
    pickup_latitude > 37
AND
    pickup_latitude < 45
AND
    dropoff_latitude > 37
AND
    dropoff_latitude < 45
AND
    passenger_count > 0

## Export the tables as CSV files

In [None]:
%%bash

echo "Deleting current contents of $OUTDIR"
gsutil -m -q rm -rf $OUTDIR

echo "Extracting training data to $OUTDIR"
bq --location=US extract \
    --destination_format CSV \
    --field_delimiter "," --noprint_header \
    taxifare.feateng_training_data \
    $OUTDIR\taxi-train-*.csv
    
echo "Extracting validation data to $OUTDIR"
bq --location=US extract \
    --destination_format CSV \
    --field_delimiter "," --noprint_header /
    taxifare.feateng_valid_data \
    $OUTDIR/taxi-valid-*.csv
    
# List the files of the bucket
gsutil ls -l $OUTDIR

In [None]:
# Display a short header for each object
!gsutil cat gs://$BUCKET/taxifare/data/taxi-train-000000000000.csv | head -2

If all ran smoothly, you should be able to list the data bucket by running the following commad:

In [None]:
# List the files of the bucket
!gsutil ls gs://$BUCKET/taxifare/data

## Move code into Python package

Here, we moved our code into a Python package for training on Cloud AI Platform. Let's just check that the files are there. You should see the following files in the `taxifare/trainer` directory:

- `__init__.py`

- `model.py`

- `task.py`

In [None]:
# It will list all the files in the mentioned directory with a long listing format
!ls -la taxifare/trainer

To use hyperparameter tuning in your training job you must perform the following steps:

1. Specify the hyperparameter tuning configuration for your training job by including a `HyperparameterSpec` in your `TrainingInput` object.

2. Include the following code in your training application:


- Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial. Add your hyperparameter metric to the summary for your graph

- To submit a hyperparameter tuning job, we must modify `model.py` and `task.py` to expose any variables we want to tune as command line arguments

## Modify `model.py`

In [None]:
%%writefile ./taxifare/trainer/model.py

# Importing the necessary modeules
import datetime
import hypertune
import logging
import os
import shutil

import numpy as np
import tensorflow as tf

from tensorflow.keras import activations
from tensorflow.keras import callbacks
from tensorflow.keras import layers
from tensorflow.keras import models

from tensorflow import feature_column as fc

logging.info(tf.__version__)

CSV_COLUMNS = [
    "fare_amount",
    "pickup_datetime",
    "pickup_longitude",
    "pickup_latitude",
    "dropoff_longitude",
    "dropoff_latitude",
    "passenger_count",
    "key"
]
LABEL_COLUMN = "fare_amount"
DEFAULTS = [[0.0], ["na"], [0.0], [0.0], [0.0], [0.0], [0.0], ["na"]]
DAYS = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]

# Split features and labels from feature dictionary
def features_and_labels(row_data):
    for unwanted_col in ["key"]:
        row_data.pop(unwanted_col)
    label = row_data.pop(LABEL_COLUMN)
    return row_data, label

# Load data set using the `tf.data` API from CSV files
def load_dataset(pattern, batch_size, num_repeat):
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=pattern,
        batch_size=batch_size,
        column_names=CSV_COLUMNS,
        column_defaults=DEFAULTS,
        num_epochs=num_repeat)
    return dataset.map(features_and_labels)

# Prefetch overlaps the processing and model execution of a training setp
def create_train_dataset(pattern, batch_size):
    dataset = load_dataset(pattern, batch_size, num_repeat=None)
    return dataset.prefetch(1)

def create_eval_dataset(pattern, batch_size, num_repeat=1):
    dataset = load_dataset(pattern, batch_size, num_repeat=1)
    return dataset.prefetch(1)

# Parse a string and return a `datetime.datetime`
def parse_datetime(s):
    if type(s) is not str:
        s = s.numpy().decode("utf-8")
    return datetime.datetime.striptime(s, "%Y-%m-%d %H:%M:%S %Z")

# Here, `tf.sqrt` computes element-wise square root of the input tensor
def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff**2 + latdiff**2)

# `timestamp.weekday()` function returns the day of the week represented by the date in the given `timestamp`
def get_dayofweek(s):
    ts = parse_datetime(s)
    return DAYS[ts.weekday()]

# It wraps a Python function into a TensorFlow op that executes it eagerly
@tf.function
def dayofweeek(ts_in):
    return tf.map_fn(
        lambda s: tf.py_function(get_dayofweek, inp=[s], Tout=tf.string),
        ts_in)

def transform(inputs, NUMERIC_COLS, STRING_COLS, nbuckets):
    # Pass-through columns
    transformed = inputs.copy()
    del transformed["pickup_datetime"]

    feature_columns = {
        colname: tf.feature_column.numeric_column(colname)
        for colname in NUMERIC_COLS
    }
    
    # Scaling longitude from range [-70, -78] to [0, 1]
    for lon_col in ["pickup_longitude", "dropoff_longitude"]:
        transformed[lon_col] = tf.keras.layers.Lambda(
            lambda x: (x + 78)/8.0,
            name="scale_{}".format(lon_col)
        )(inputs[lon_col])
        
    # Scaling latitude from range [37, 45] to [0, 1]
    for lat_col in ["pickup_latitude", "dropoff_latitude"]:
        transformed[lat_col] = tf.keras.layers.Lambda(
            lambda: (x - 37)/8.0,
            name="scale_{}".format(lat_col)
        )(inputs[lat_col])
        
    # Adding Euclidean distance (no need to be accurate: NN will calibrate it)
    transformed["euclidean"] = tf.keras.layers.Lambda(euclidean, name="euclidean")[
        inputs["pickup_longitude"],
        inputs["pickup_latitude"],
        inputs["dropoff_longitude"],
        inputs["dropoff_latitude"]
    ]
    feature_columns["euclidean"] = tf.feature_column.numeric_column("euclidean")
    
    # Get hour of day from timestamp of form "2010-02-08 09:17:00+00:00"
    transformed["hourofday"] = tf.keras.layers.Lambda(
        lambda x: tf.strings.to_number(
        tf.strings.substr(x, 11, 2), out_type=df.dtypes.int32),
        name="hourofday"
    )(inputs["pickup_datetime"])
    feature_columns["hourofday"] = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_identity(
            "hourofday", num_buckets=24))
    
    latbuckets = np.linspace(0, 1, nbuckets).tolist()
    lonbuckets = np.linspace(0, 1, nbuckets).tolist()
    b_plat = tf.feature_column.bucketized_column(
        feature_columns["pickup_latitude"], latbuckets
    )
    b_dlat = tf.feature_column.bucketized_column(
        feature_columns["dropoff_latitude"], latbuckets
    )
    b_plon = tf.feature_column.bucketized_column(
        feature_columns["pickup_longitude"], lonbuckets
    )
    b_dlon = tf.feature_column.bucketized_column(
        feature_columns["dropoff_longitude"], lonbuckets
    )
    ploc = tf.feature_column.crossed_column(
        [b_plat, b_plon], nbuckets * nbuckets
    )
    dloc = tf.feature_column.crossed_column(
        [b_dlat, b_dlon], nbuckets * nbuckets
    )
    pd_pair = tf.feature_column.crossed_column([ploc, dloc], nbuckets ** 4)
    feature_columns["pickup_and_dropoff"] = tf.feature_column.embedding_column(
        pd_pair, 100
    )
    
    return transformed, feature_columns

# Here, `tf.sqrt` computes element-wise square root of the input tensor
def rmse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))

def build_dnn_model(nbuckets, nnsize, lr):
    # Input layer is all float except for `pickup_datetime` which is a string
    STRING_COLS = ["pickup_datetime"]
    NUMERIC_COLS = (
        set(CSV_COLUMNS) - set([LABEL_COLUMN, "key"]) - set(STRING_COLS)
    )
    inputs = {
        colname: layers.Input(name=colname, shape=(), dtype="float32")
        for colname in STRING_COLS
    }
    inputs.update({
        colname: layer.Input(name=colname, shape=(), dtype="string")
    })
    
    # Transforms
    transformed, feature_columns = transform(
        inputs, NUMERIC_COLS, STRING_COLS, nbuckets=nbuckets)
    dnn_inputs = layers.DenseFeatures(feature_columns.values())(transformed)
    
    x = dnn_inputs
    for layer, nodes in enumerate(nnsize):
        x = layers.Dense(nodes, activation="relu", name="h{}".format(layer))(x)
    output = layers.Dense(1, name="fare")(x)
    
    model = models.Model(inputs, output)
    lr_optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=lr_optimizer, loss="mse", metrics=[rmse, "mse"])
    
    return model

# Define train and evaluate method to evaluate performance of the model
def train_and_evaluate(hparams):
    batch_size = hparams["batch_size"]
    eval_data_path = hparams["eval_data_path"]
    nnsize = hparams["nnsize"]
    nbuckets = hparams["nbuckets"]
    lr = hparams["lr"]
    num_evals = hparams["num_evals"]
    num_examples_to_train_on = hparams["num_examples_to_train_on"]
    output_dir = hparams["output_dir"]
    train_data_path = hparams["train_data_path"]
    
    if tf.io.gfile.exists(output_dir):
        tf.io.gfile.rmtree(output_dir)
        
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    savedmodel_dir = os.path.join(output_dir, "savedmodel")
    model_export_path = os.path.join(savedmodel_dir, timestamp)
    checkpoint_path = os.path.join(output_dir, "checkpoints")
    tensorboard_path = os.path.join(output_dir, "tensorboard")
    
    dnn_model = build_dnn_model(nbuckets, nnsize, lr)
    logging.info(dnn_model.summary())
    
    trainds = create_train_dataset(train_data_path, batch_size)
    evalds = create_eval_dataset(eval_data_path, batch_size)
    
    steps_per_epoch = num_examples_to_train_on // (batch_size * num_evals)
    
    checkpoint_cb = callbacks.ModelCheckpoint(checkpoint_path,
                                              save_weigths_only=True,
                                              verbose=1)
    
    tensorboard_cb = callbacks.TensorBoard(tensorboard_path,
                                           histogram_freq=1)
    
    history = dnn_model.fit(
        trainds,
        validation_data=evalds,
        epochs=num_evals,
        steps_per_epoch=max(1, steps_per_epoch),
        verbose=2, # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[checkpoint_cb, tensorboard_cb]
    )
    
    # Exporting the model with default serving function
    tf.saved_model.save(dnn_model, model_export_path)
    
    hp_metric = history.history["val_rmse"][num_evals-1]
    
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag="rmse",
        metric_value=hp_metric,
        global_set=num_evals
    )
    
    return history

## Modify task.py

In [None]:
%%writefile taxifare/trainer/task.py
# Importing the necessary modules
import argparse
import json
import os

from trainer import model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--batch_size",
        help = "Batch size for training steps",
        type = int,
        default = 32
    )
    parser.add_argument(
        "--nnsize",
        help = "Hidden layer sizes (provide space-separated sizes)",
        nargs = "+",
        type = int,
        default = [32, 8]
    )
    parser.add_argument(
        "--nnbuckets",
        help = "Number of buckets to divide lat and lon width",
        type = int,
        default = 10
    )
    parser.add_argument(
        "--lr",
        help = "Learning rate for optimizer",
        type = float,
        default = 0.001
    )
    parser.add_argument(
        "--num_evals",
        help = "Number of times to evaluate model on eval data training",
        type = int,
        default = 5
    )
    parser.add_argument(
        "--num_examples_to_train_on",
        help = "Number of examples to train on",
        type = int,
        default = 100
    )
    parser.add_argument(
        "--output_dir",
        help = "GCS location to write checkpoints and export models",
        required = True
    )
    parser.add_argument(
        "--train_data_path",
        help = "GCS location pattern of train files containing eval URLs",
        required = True
    )
    parser.add_argument(
        "--job-dir",
        help = "This model ignores this field, but it is required by gcloud",
        default = "junk"
    )
    
    args, _ = parser.parse_known_args()
    
    hparams = args.__dict__
    hparams["output_dir"] = os.path.join(
        hparams["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )
    print("output_dir", hparams["output_dir"])
    model.train_and_evaluate(hparams)

## Create `config.yaml` file

Specify the hyperparameter tuning configuration for your training job. Create a `HyperparameterSpec` object to hold the hyperparameter tuning configuration for your training job, and add the `HyperparameterSpec` as the hyperparameters object in your `TrainingInput` object.

In your `HyperparameterSpec`, set the `hyperparameterMetricTag` to a value representing your chosen metric. If you don't specify a `hyperparameterMetricTag`, AI Platform Training looks for a metric with the name `training/hptuning/metric`. The following example shows how to create a configuration for a metric named `metric1`:

In [None]:
%%writefile hptuning_config.yaml

# Setting parameters for hptuning_config.yaml
trainingInput:
    scaleTier: BASIC
    hyperparameters:
        goal: MINIMIZE
        maxTrials: 10
        maxParallelTrials: 2
        hyperparameterMetricTag: rmse
        enableTrialEarlyStopping: True
        params:
        - parameterName: lr
        type: DOUBLE
        minValue: 0.0001
        maxValue: 0.1
        scaleType: UNIT_LOG_SCALE
        - parameterName: nbuckets
        type: INTEGER
        minValue: 10
        maxValue: 25
        scaleType: UNIT_LINEAR_SCALE
        - parameterName: batch_size
        type: DISCRETE
        discreteValues:
        - 15
        - 30
        - 50

**Report your hyperparameter metric to AI Platform Training**

The way to report your hyperparameter metric to the AI Platform Training service depends on whether you are using TensorFlow for training or not. It also depends on whether you are using a runtime version or a custom container for training.

We recommend that your training code reports your hyperparameter metric to AI Platform Training frequently in order to take advantage of early stopping.

*TensorFlow with a runtime version*: If you use an AI Platform Training runtime version and train with TensorFlow, then you can report hour hyperparameter metric to AI Platform Training by writing the metric to a TensorFlow summary. Use one of the following functions.

You may need to install `cloudml-hypertune` on your machine to run this code locally

In [None]:
# Installing the latest version of the package
!pip install cloudml-hypertune

In [None]:
%%bash

# Testing your training code locally
EVAL_DATA_PATH=./taxifare/tests/data/taxi-valid*
TRAIN_DATA_PATH=./taxifare/tests/data/taxi-train*
OUTPUT_DIR=./taxifare-model

rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare

python3 -m trainer.task \
--eval_data_path $EVAL_DATA_PATH \
--output_dir $OUTPUT_DIR \
--train_data_path $TRAIN_DATA_PATH \
--batch_size 5 \
--num_examples_to_train_on 100 \
--num_evals 1 \
--nbuckets 10 \
--lr 0.001 \
--nnsize 32 8

In [None]:
ls taxifare-model/tensorboard

In [None]:
%%bash

PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET=$PROJECT_ID
REGION="us-central1"
TFVERSION="2.1"

# Output directory and job ID
OUTDIR=gs://${BUCKET}/taxifare/trained_model_$(date -u + %y%m%d_%H%M%S)
JOBID=taxifare_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBID}
gsutil -m rm -rf ${OUTDIR}

# Model and training hyperparameters
BATCH_SIZE=15
NUM_EXAMPLES_TO_TRAIN_ON=100
NUM_EVALS=10
NBUCKETS=10
LR=0.001
NNSIZE="32 8"

# GCS path
GCS_PROJECT_PATH=gs://$BUCKET/taxifare
DATA_PATH=$GCS_PROJECT_PATH/data
TRAIN_DATA_PATH=$DATA_PATH/taxi-train*
EVAL_DATA_PATH=$DATA_PATH/taxi-valid*

gcloud ai-platform jobs submit training $JOBID \
    --module-name=trainer.task \
    --package-path=taxifare/trainer \
    --staging-bucket=gs://${BUCKET} \
    --config=hptuning_config.yaml \
    --python-version=3.7 \
    --runtime-version=${TFVERSION} \
    --region=${REGION} \
    -- \
    --eval_data_path $EVAL_DATA_PATH \
    --output_dir $OUTDIR \
    --train_data_path $TRAIN_DATA_PATH \
    --batch_size $BATCH_SIZE \
    --num_examples_to_train_on $NUM_EXAMPLES_TO_TRAIN_ON \
    --num_evals $NUM_EVALS \
    --nbuckets $NBUCKETS \
    --lr $LR \
    --nnsize $NNSIZE