# 4_train_model_on_Cloud_AI_Platform
Once we've already built a good `Wide and Deep model` on a subset of the dataset (`2_prototype_model.ipynb`) and created a clean full dataset as CSV files (`3_create_ML_dataset_using_Dataflow.ipynb`), we can now train the model on the full dataset. In this notebook, we'll do so by using [Cloud AI Platform](https://console.cloud.google.com/ai-platform). 

Training on Cloud AI Platform requires:
* Making the code a Python package
* Using `gcloud` to submit the training code to Cloud AI Platform

## Set up environment variables and load necessary libraries

In [1]:
!pip3 install cloudml-hypertune



In [2]:
import os
import numpy as np
import tensorflow as tf
print(tf.__version__)

2.3.4


In [3]:
PROJECT = "predict-babyweight-10142021"
BUCKET = PROJECT
REGION = "us-central1"

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET 
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.1"
os.environ["PYTHONVERSION"] = "3.7"

## Package the TensorFlow code up as a Python package

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file. Here, we're going to make a Python package named `trainer` that includes three `.py` files: 
- `__init__.py`: is used to mark directories on disk as Python package directories. In our case, we make it empty.
- `task.py`: contains parameters of our model to pass as flags during training using the `parser` module
- `model.py`: contains the code we wrote for the Wide & Deep model.

In [4]:
%%bash
mkdir -p trainer
touch trainer/__init__.py

We then use the `%%writefile` magic to write the contents of the cell below to a file called `task.py` in the `babyweight/trainer` folder.

### Create `task.py` file to hold hyperparameter argparsing code.

In [5]:
%%writefile trainer/task.py
import argparse
import json
import os

from trainer import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--job-dir",
        help="this model ignores this field, but it is required by gcloud",
        default="junk"
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location of training data",
        required=True
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location of evaluation data",
        required=True
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        required=True
    )
    parser.add_argument(
        "--batch_size",
        help="Number of examples to compute gradient over.",
        type=int,
        default=512
    )
    parser.add_argument(
        "--nnsize",
        help="Hidden layer sizes for DNN -- provide space-separated layers",
        nargs="+",
        type=int,
        default=[128, 32, 4]
    )
    parser.add_argument(
        "--nembeds",
        help="Embedding size of a cross of n key real-valued parameters",
        type=int,
        default=3
    )
    parser.add_argument(
        "--num_epochs",
        help="Number of epochs to train the model.",
        type=int,
        default=10
    )
    parser.add_argument(
        "--train_examples",
        help="""Number of examples (in thousands) to run the training job over.
        If this is more than actual # of examples available, it cycles through
        them. So specifying 1000 here when we have only 100k examples makes
        this 10 epochs.""",
        type=int,
        default=5000
    )
    parser.add_argument(
        "--eval_steps",
        help="""Positive number of steps for which to evaluate model. Default
        to None, which means to evaluate until input_fn raises an end-of-input
        exception""",
        type=int,
        default=None
    )

    # Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Unused args provided by service
    arguments.pop("job_dir", None)
    arguments.pop("job-dir", None)

    # Modify some arguments
    arguments["train_examples"] *= 1000

    # Append trial_id to path if we are doing hptuning
    # This code can be removed if we are not using hyperparameter tuning
    arguments["output_dir"] = os.path.join(
        arguments["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )

    # Run the training job
    model.train_and_evaluate(arguments)

Overwriting trainer/task.py


### Create trainer module's model.py to hold Keras model code.

To create our `model.py`, we'll use the code we wrote for the Wide & Deep model.

In [6]:
%%writefile trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf
import hypertune

# Determine CSV, label, and key columns
CSV_COLUMNS = ["weight_pounds",
               "is_male",
               "mother_age",
               "plurality",
               "gestation_weeks",
               "cigarette_use",
               "alcohol_use"]
LABEL_COLUMN = "weight_pounds"

# Set default values for each CSV column.
DEFAULTS = [[0.0], ["null"], [0.0], ["null"], [0.0], ["null"], ["null"]]

# Make dataset of features and label from CSV files
def features_and_labels(row_data):
    """Splits features and labels from feature dictionary.
    """
    label = row_data.pop(LABEL_COLUMN)
    return row_data, label  # features, label


def load_dataset(pattern, batch_size=1, mode='eval'):
    """Loads dataset using the tf.data API from CSV files.
    """
    # Make a CSV dataset
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=pattern,
        batch_size=batch_size,
        column_names=CSV_COLUMNS,
        column_defaults=DEFAULTS,
        ignore_errors=True)

    # Map dataset to features and label
    dataset = dataset.map(map_func=features_and_labels)  # features, label

    # Shuffle and repeat for training
    if mode == 'train':
        dataset = dataset.shuffle(buffer_size=1000).repeat()

    # Take advantage of multi-threading; 1=AUTOTUNE
    dataset = dataset.prefetch(buffer_size=1)

    return dataset


# Create input layers for raw features.
def create_input_layers():
    """Creates dictionary of input layers for each feature.
    """
    deep_inputs = {
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype="float32")
        for colname in ["mother_age", "gestation_weeks"]
    }

    wide_inputs = {
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype="string")
        for colname in ["is_male", "plurality", "cigarette_use", "alcohol_use"]
    }

    inputs = {**wide_inputs, **deep_inputs}

    return inputs


# Create feature columns for inputs.
def categorical_fc(name, values):
    """Helper function to wrap categorical feature by indicator column.
    """
    cat_column = tf.feature_column.categorical_column_with_vocabulary_list(
            key=name, vocabulary_list=values)
    ind_column = tf.feature_column.indicator_column(categorical_column=cat_column)

    return cat_column, ind_column


def create_feature_columns(nembeds):
    """Creates wide and deep dictionaries of feature columns from inputs.
    """
    deep_fc = {
        colname: tf.feature_column.numeric_column(key=colname)
        for colname in ["mother_age", "gestation_weeks"]
    }
    wide_fc = {}
    is_male, wide_fc["is_male"] = categorical_fc("is_male", 
                                                 ["true", "false", "Unknown"])
    cigarette_use, wide_fc["cigarette_use"] = categorical_fc("cigarette_use", 
                                                             ["true", "false", "Unknown"])
    alcohol_use, wide_fc["alcohol_use"] = categorical_fc("alcohol_use", 
                                                         ["true", "false", "Unknown"])
    plurality, wide_fc["plurality"] = categorical_fc("plurality", 
                                                     ["Single(1)", "Twins(2)", "Triplets(3)",
                                                      "Quadruplets(4)", "Quintuplets(5)", "Multiple(2+)"])

    # Bucketize the float fields. This makes them wide
    age_buckets = tf.feature_column.bucketized_column(
        source_column=deep_fc["mother_age"],
        boundaries=np.arange(15, 45, 1).tolist())
    wide_fc["age_buckets"] = tf.feature_column.indicator_column(
        categorical_column=age_buckets)

    gestation_buckets = tf.feature_column.bucketized_column(
        source_column=deep_fc["gestation_weeks"],
        boundaries=np.arange(17, 47, 1).tolist())
    wide_fc["gestation_buckets"] = tf.feature_column.indicator_column(
        categorical_column=gestation_buckets)

    # Cross all the wide columns, have to do the crossing before we one-hot
    crossed = tf.feature_column.crossed_column(
        keys=[age_buckets, gestation_buckets],
        hash_bucket_size=1000)
    deep_fc["crossed_embeds"] = tf.feature_column.embedding_column(
        categorical_column=crossed, dimension=nembeds)

    return wide_fc, deep_fc


# Create DNN dense hidden layers and output layer.
def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
    """Creates model architecture and returns outputs.
    """
    # Hidden layers for the deep side
    layers = [int(x) for x in dnn_hidden_units]
    deep = deep_inputs
    for layerno, numnodes in enumerate(layers):
        deep = tf.keras.layers.Dense(units=numnodes,
                                     activation="relu",
                                     name=f"dnn_{layerno+1}")(deep)
    deep_out = deep

    # Linear model for the wide side
    wide_out = tf.keras.layers.Dense(
        units=10, activation="relu", name="linear")(wide_inputs)

    # Concatenate the two sides
    both = tf.keras.layers.concatenate(
        inputs=[deep_out, wide_out], name="both")

    # Final output is a linear activation because this is regression
    output = tf.keras.layers.Dense(
        units=1, activation="linear", name="weight")(both)

    return output


# Create custom evaluation metric
def rmse(y_true, y_pred):
    """Calculates RMSE evaluation metric.
    """
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))

def r_squared(y, y_pred):
    """Calculates R^2 evaluation metric.
    """
    residual = tf.reduce_sum(tf.square(tf.subtract(y, y_pred)))
    total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
    r2 = tf.subtract(1.0, tf.divide(residual, total))
    return r2

# Build DNN model tying all of the pieces together
def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
    """Builds wide and deep model using Keras Functional API.
    """
    # Create input layers
    inputs = create_input_layers()

    # Create feature columns for both wide and deep
    wide_fc, deep_fc = create_feature_columns(nembeds)

    # The constructor for DenseFeatures takes a list of numeric columns
    # The Functional API in Keras requires: LayerConstructor()(inputs)
    wide_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=wide_fc.values(), name="wide_inputs")(inputs)
    deep_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=deep_fc.values(), name="deep_inputs")(inputs)

    # Get output of model given inputs
    output = get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units)

    # Build model and compile it all together
    model = tf.keras.models.Model(inputs=inputs, outputs=output)
    model.compile(optimizer="adam", loss="mse", metrics=["mse",rmse, r_squared])

    return model


# Train and evaluate
def train_and_evaluate(args):
    model = build_wide_deep_model(args["nnsize"], args["nembeds"])
    print("*** Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    trainds = load_dataset(args["train_data_path"],args["batch_size"],'train')
    evalds = load_dataset(args["eval_data_path"], 1000, 'eval')
    if args["eval_steps"]:
        evalds = evalds.take(count=args["eval_steps"])

    num_batches = args["batch_size"] * args["num_epochs"]
    steps_per_epoch = args["train_examples"] // num_batches
    
    checkpoint_path = os.path.join(args["output_dir"], "checkpoints/babyweight")
    cp_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path, verbose=1, save_weights_only=True)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=args["num_epochs"],
        steps_per_epoch=steps_per_epoch,
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[cp_callback])

    EXPORT_PATH = os.path.join(
        args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(obj=model, export_dir=EXPORT_PATH)  # with default serving function
    
    hp_metric = history.history['val_rmse'][-1]

    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='rmse',
        metric_value=hp_metric,
        global_step=args['num_epochs'])

    print(f"*** Exported trained model to {EXPORT_PATH}")

Overwriting trainer/model.py


## Run trainer module package locally

After moving the code to a package, make sure it works as a standalone. We can run a very small training job over a single file with a small batch size, 1 epoch, 1 train example, and 1 eval step.

Note, even for this small subset, this takes about *5 minutes* to finish (no output) ...

In [7]:
%%bash
OUTDIR=model_trained_locally
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}
python3 -m trainer.task \
    --job-dir=./tmp \
    --train_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/train.csv*  \
    --eval_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/eval.csv*  \
    --output_dir=${OUTDIR} \
    --batch_size=10 \
    --num_epochs=1 \
    --train_examples=1 \
    --eval_steps=1

*** Here is our Wide-and-Deep architecture so far:

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
alcohol_use (InputLayer)        [(None,)]            0                                            
__________________________________________________________________________________________________
cigarette_use (InputLayer)      [(None,)]            0                                            
__________________________________________________________________________________________________
gestation_weeks (InputLayer)    [(None,)]            0                                            
__________________________________________________________________________________________________
is_male (InputLayer)            [(None,)]            0                                            
___________________________________

2021-10-23 06:38:13.282863: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
2021-10-23 06:38:13.283384: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5626c3c7be40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-23 06:38:13.283444: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-23 06:38:13.283641: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
2021-10-23 06:38:19.431031: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Instructions for updating:
This property should not be used in 

## Training on Cloud AI Platform

Now that we see everything is working locally, it's time to train on the cloud! 

To submit to the Cloud we use [`gcloud ai-platform jobs submit training [jobname]`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training) and simply specify some additional parameters for AI Platform Training Service:
- jobname: A unique identifier for the Cloud job. We usually append system time to ensure uniqueness
- job-dir: A GCS location to upload the Python package to
- runtime-version: Version of TF to use.
- python-version: Version of Python to use. Currently only Python 3.7 is supported for TF 2.1.
- region: Cloud region to train in. See [here](https://cloud.google.com/ml-engine/docs/tensorflow/regions) for supported AI Platform Training Service regions

Below the `-- \` we add in the arguments for our `task.py` file.

In [8]:
%%bash

OUTDIR=gs://${BUCKET}/model_trained_Dataflow
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)

gcloud ai-platform jobs submit training ${JOBID} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --master-machine-type=n1-standard-8 \
    --scale-tier=CUSTOM \
    --runtime-version=${TFVERSION} \
    --python-version=${PYTHONVERSION} \
    -- \
    --train_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/train.csv*  \
    --eval_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/eval.csv*  \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=10000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

jobId: babyweight_211023_064139
state: QUEUED


Job [babyweight_211023_064139] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_211023_064139

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_211023_064139


The training job should complete within 10 to 15 minutes. Once it's done, we can check the directory structure of the outputs of our trained model.

In [9]:
%%bash
gsutil ls gs://${BUCKET}/model_trained_Dataflow

gs://predict-babyweight-10142021/model_trained_Dataflow/
gs://predict-babyweight-10142021/model_trained_Dataflow/20211023065439/
gs://predict-babyweight-10142021/model_trained_Dataflow/checkpoints/


## Hyperparameter tuning
To do hyperparameter tuning, create `hyperparam/hyperparam.yaml` and pass it as `--config hyperparam.yaml` as submitting a training job on Cloud AI Platform.

In [10]:
%%writefile hyperparam/hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: rmse
        goal: MINIMIZE
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: INTEGER
          minValue: 8
          maxValue: 512
          scaleType: UNIT_LOG_SCALE
        - parameterName: nembeds
          type: INTEGER
          minValue: 3
          maxValue: 30
          scaleType: UNIT_LINEAR_SCALE
        - parameterName: nnsize
          type: INTEGER
          minValue: 64
          maxValue: 512
          scaleType: UNIT_LOG_SCALE

Overwriting hyperparam.yaml


In [11]:
%%bash

OUTDIR=gs://${BUCKET}/hyperparam
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBID
gsutil -m rm -rf $OUTDIR

gcloud ai-platform jobs submit training ${JOBID} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --master-machine-type=n1-standard-8 \
    --scale-tier=CUSTOM \
    --config=hyperparam/hyperparam.yaml \
    --runtime-version=${TFVERSION} \
    --python-version=${PYTHONVERSION} \
    -- \
    --train_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/train.csv*  \
    --eval_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/eval.csv*  \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=10000 \
    --eval_steps=100 \

gs://predict-babyweight-10142021/hyperparam us-central1 babyweight_211023_071146
jobId: babyweight_211023_071146
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [babyweight_211023_071146] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_211023_071146

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_211023_071146


## Repeat training with the new hyperparameters found from the tuning

The tuning job was completed after ~1h20mins. The output obtained from the job shows the best performance info below:

    finalMetric:
      objectiveValue: 1.0123
      trainingStep: '10'
    hyperparameters:
      batch_size: '20'
      nembeds: '27'
      nnsize: '205'
      
Let's use these new hyperparameter values for the final training.
      

In [13]:
%%bash

OUTDIR=gs://${BUCKET}/model_trained_Dataflow_tuned
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)

gcloud ai-platform jobs submit training ${JOBID} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --master-machine-type=n1-standard-8 \
    --scale-tier=CUSTOM \
    --runtime-version=${TFVERSION} \
    --python-version=${PYTHONVERSION} \
    -- \
    --train_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/train.csv*  \
    --eval_data_path=gs://predict-babyweight-10142021/datasets_preprocessed_Dataflow/eval.csv*  \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=10000 \
    --eval_steps=100 \
    --batch_size=20 \
    --nembeds=27 \
    --nnsize=205

jobId: babyweight_211023_143616
state: QUEUED


Job [babyweight_211023_143616] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_211023_143616

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_211023_143616


Again, once it's done (after ~30mins), we can check the directory structure of the outputs of our final trained model.

In [14]:
%%bash
gsutil ls gs://${BUCKET}/model_trained_Dataflow_tuned

gs://predict-babyweight-10142021/model_trained_Dataflow_tuned/
gs://predict-babyweight-10142021/model_trained_Dataflow_tuned/20211023150249/
gs://predict-babyweight-10142021/model_trained_Dataflow_tuned/checkpoints/


{
  "insertId": "7824ocg40s6z9o",
  "jsonPayload": {
    "levelname": "INFO",
    "created": 1635001368.7409413,
    "message": "50000/50000 - 154s - loss: 0.9519 - mse: 0.9519 - rmse: 0.9589 - r_squared: -3.2733e-02 - val_loss: 0.9253 - val_mse: 0.9253 - val_rmse: 0.9616 - val_r_squared: 0.0170",
    "lineno": 328,
    "pathname": "/runcloudml.py"
  },
  "resource": {
    "type": "ml_job",
    "labels": {
      "project_id": "predict-babyweight-10142021",
      "job_id": "babyweight_211023_143616",
      "task_name": "master-replica-0"
    }
  },
  "timestamp": "2021-10-23T15:02:48.740941285Z",
  "severity": "INFO",
  "labels": {
    "compute.googleapis.com/resource_name": "cmle-training-741771706156483001",
    "ml.googleapis.com/trial_type": "",
    "ml.googleapis.com/job_id/log_area": "root",
    "compute.googleapis.com/zone": "us-central1-c",
    "ml.googleapis.com/trial_id": "",
    "compute.googleapis.com/resource_id": "7947812095131230988"
  },
  "logName": "projects/predict-babyweight-10142021/logs/master-replica-0",
  "receiveTimestamp": "2021-10-23T15:02:50.415808996Z"
}

## Lab Summary: 
In this notebook, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.