# LAB 5a:  Training Keras model on Cloud AI Platform.

**Learning Objectives**

1. Setup up the environment
1. Create trainer module's task.py to hold hyperparameter argparsing code
1. Create trainer module's model.py to hold Keras model code
1. Run trainer module package locally
1. Submit training job to Cloud AI Platform
1. Submit hyperparameter tuning job to Cloud AI Platform


## Introduction
After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model. 

In this notebook, we'll be training our Keras model at scale using Cloud AI Platform.

In this lab, we will set up the environment, create the trainer module's task.py to hold hyperparameter argparsing code, create the trainer module's model.py to hold Keras model code, run the trainer module package locally, submit a training job to Cloud AI Platform, and submit a hyperparameter tuning job to Cloud AI Platform.

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/5a_train_keras_ai_platform_babyweight.ipynb).

## Set up environment variables and load necessary libraries

Import necessary libraries.

In [1]:
import os

### Lab Task #1: Set environment variables.

Set environment variables so that we can use them throughout the entire lab. We will be using our project name for our bucket, so you only need to change your project and region.

In [7]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "${PROJECT}

Your current GCP Project Name is: qwiklabs-gcp-00-0db9b1bc58c6


In [11]:
# TODO: Change these to try this notebook out
PROJECT = 'qwiklabs-gcp-00-0db9b1bc58c6'  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

print(PROJECT, BUCKET, REGION)

qwiklabs-gcp-00-0db9b1bc58c6 qwiklabs-gcp-00-0db9b1bc58c6 us-central1


In [12]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.1"
os.environ["PYTHONVERSION"] = "3.7"

In [13]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


In [14]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}; then
    gsutil mb -l ${REGION} gs://${BUCKET}
fi

## Check data exists

Verify that you previously created CSV files we'll be using for training and evaluation. If not, go back to lab [1b_prepare_data_babyweight.ipynb](../solutions/1b_prepare_data_babyweight.ipynb) to create them.

In [15]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv

gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/eval000000000000.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000000.csv


Now that we have the [Keras wide-and-deep code](../solutions/4c_keras_wide_and_deep_babyweight.ipynb) working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.

## Train on Cloud AI Platform

Training on Cloud AI Platform requires:
* Making the code a Python package
* Using gcloud to submit the training code to [Cloud AI Platform](https://console.cloud.google.com/ai-platform)

**Ensure that the Cloud AI Platform API is enabled by going to this [link](https://console.developers.google.com/apis/library/ml.googleapis.com).**

### Move code into a Python package

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file suffices.

The bash command `touch` creates an empty file in the specified location, the directory `babyweight` should already exist.

In [16]:
%%bash
mkdir -p babyweight/trainer
touch babyweight/trainer/__init__.py

We then use the `%%writefile` magic to write the contents of the cell below to a file called `task.py` in the `babyweight/trainer` folder.

### Lab Task #2: Create trainer module's task.py to hold hyperparameter argparsing code.

The cell below writes the file `babyweight/trainer/task.py` which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the `parser` module. Look at how `batch_size` is passed to the model in the code below. Use this as an example to parse arguements for the following variables
- `nnsize` which represents the hidden layer sizes to use for DNN feature columns
- `nembeds` which represents the embedding size of a cross of n key real-valued parameters
- `train_examples` which represents the number of examples (in thousands) to run the training job
- `eval_steps` which represents the positive number of steps for which to evaluate model

Be sure to include a default value for the parsed arguments above and specfy the `type` if necessary.

In [41]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from trainer import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--job-dir",
        help="this model ignores this field, but it is required by gcloud",
        default="junk"
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location of training data",
        required=True
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location of evaluation data",
        required=True
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        required=True
    )
    parser.add_argument(
        "--batch_size",
        help="Number of examples to compute gradient over.",
        type=int,
        default=512
    )

    # TODO: Add nnsize argument
    parser.add_argument(
        "--nnsize",
        help = "Hidden layer sizes (provide space-separated sizes)",
        nargs = "+",
        type = int,
        default=[32, 8]
    )

    # TODO: Add nembeds argument
    parser.add_argument(
        "--nembeds",
        help="nembeds",
        type=int,
        default=8
    )

    # TODO: Add num_epochs argument
    parser.add_argument(
        "--num_epochs",
        help="num_epochs",
        type=int,
        default=1
    )

    # TODO: Add train_examples argument
    parser.add_argument(
        "--train_examples",
        help="train_examples",
        type=int,
        default=1
    )

    # TODO: Add eval_steps argument
    parser.add_argument(
        "--eval_steps",
        help="eval_steps",
        type=int,
        default=8
    )

    # Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Unused args provided by service
    arguments.pop("job_dir", None)
    arguments.pop("job-dir", None)

    # Modify some arguments
    arguments["train_examples"] *= 1000

    # Append trial_id to path if we are doing hptuning
    # This code can be removed if you are not using hyperparameter tuning
    arguments["output_dir"] = os.path.join(
        arguments["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )

    # Run the training job
    model.train_and_evaluate(arguments)

Overwriting babyweight/trainer/task.py


In the same way we can write to the file `model.py` the model that we developed in the previous notebooks. 

### Lab Task #3: Create trainer module's model.py to hold Keras model code.

Complete the TODOs in the code cell below to create our `model.py`. We'll use the code we wrote for the Wide & Deep model. Look back at your [9_keras_wide_and_deep_babyweight](../solutions/9_keras_wide_and_deep_babyweight.ipynb) notebook and copy/paste the necessary code from that notebook into its place in the cell below.

In [None]:
%%writefile babyweight/trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf
import hypertune

# Determine CSV, label, and key columns
# TODO: Add CSV_COLUMNS and LABEL_COLUMN
CSV_COLUMNS = ["weight_pounds", "is_male", "mother_age", "plurality", "gestation_weeks"]
LABEL_COLUMN = 'weight_pounds'
NUMERIC_COLUMNS = ["mother_age", "gestation_weeks"]
STRING_COLUMNS = ["is_male", "plurality"]
VOC_IS_MALE = ['true', 'false', 'Unknown']
VOC_PLURALITY = ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)', 'Multiple(2+)']

# Set default values for each CSV column.
# Treat is_male and plurality as strings.
# TODO: Add DEFAULTS
DEFAULTS = [[0.], ['null'], [0.], ['null'], [0.]]


def features_and_labels(row_data):
    # TODO: Add your code here
    label = row_data.pop(LABEL_COLUMN)
    feature = row_data
    return feature, label


def load_dataset(pattern, batch_size=1, mode='eval'):
    # TODO: Add your code here
    ds = tf.data.experimental.make_csv_dataset(pattern, batch_size, CSV_COLUMNS, DEFAULTS)
    ds = ds.map(features_and_labels).cache()
    
    if mode == 'train':
        ds = ds.shuffle(buffer_size=1000).repeat()
    ds = ds.prefetch(buffer_size=1)
    return ds


def create_input_layers():
    # TODO: Add your code here
    inputs = {
        colname: tf.keras.Input(name=colname, shape=(), dtype='float32') for colname in NUMERIC_COLUMNS 
    }
    inputs.update({
        colname: tf.keras.Input(name=colname, shape=(), dtype='string') for colname in STRING_COLUMNS 
    })
    return inputs


def categorical_fc(name, values):
    # TODO: Add your code here
    cat = tf.feature_column.categorical_column_with_vocabulary_list(name, values)
    return tf.feature_column.indicator_column(cat)


def create_feature_columns(nembeds):
    # TODO: Add your code here
    deep_fc = {
        colname: tf.feature_column.numeric_column(colname) for colname in NUMERIC_COLUMNS
    }
    
    wide_fc = {}
    colname = 'is_male'
    wide_fc[colname] = categorical_fc(colname, VOC_IS_MALE)
    colname = 'plurality'
    wide_fc[colname] = categorical_fc(colname, VOC_PLURALITY)
    
    age_bkt = tf.feature_column.bucketized_column(deep_fc['mother_age'], 
                                                  boundaries=np.arange(15, 45, 1).tolist()
                                                 )
    wide_fc['age_bkt'] = tf.feature_column.indicator_column(age_bkt)
    gestation_bkt = tf.feature_column.bucketized_column(deep_fc['gestation_weeks'], boundaries=np.arange(17, 47, 1).tolist())
    wide_fc['gestation_bkt'] = tf.feature_column.indicator_column(gestation_bkt)
    
    crossed = tf.feature_column.crossed_column([age_bkt, gestation_bkt], 1000)
    deep_fc['crossed'] = tf.feature_column.embedding_column(crossed, dimension=nembeds)

    return wide_fc, deep_fc


def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
    # TODO: Add your code here
    deep = deep_inputs
    for i, units in enumerate(dnn_hidden_units):
        deep = tf.keras.layers.Dense(units, name=f'deep_h{i+1}', activation='relu')(deep)
        
    wide = tf.keras.layers.Dense(10, name='wide_h', activation='relu')(wide_inputs)
    
    both = tf.keras.layers.concatenate(inputs=[deep, wide], name="both")
        
    out = tf.keras.layers.Dense(1, name='output', activation='linear')(both)
    
    return out


def rmse(y_true, y_pred):
    # TODO: Add your code here
    return tf.sqrt(tf.reduce_mean((y_true - y_pred)**2))


def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
    # TODO: Add your code here
    inputs = create_input_layers()
    wide_fc, deep_fc = create_feature_columns(nembeds)
    
    wide_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=wide_fc.values(), name="wide_inputs")(inputs)
    deep_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=deep_fc.values(), name="deep_inputs")(inputs)

    output = get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units)
    
    model = tf.keras.models.Model(inputs=inputs, outputs=output)

    model.compile(loss='mse', optimizer='adam', metrics=[rmse, 'mse'])
    
    return model

def build_summary():
    model = build_wide_deep_model()
    model.summary()

def train_and_evaluate(args):
    model = build_wide_deep_model(args["nnsize"], args["nembeds"])
    print("Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    trainds = load_dataset(
        args["train_data_path"],
        args["batch_size"],
        'train')

    evalds = load_dataset(
        args["eval_data_path"], 1000, 'eval')
    if args["eval_steps"]:
        evalds = evalds.take(count=args["eval_steps"])

    num_batches = args["batch_size"] * args["num_epochs"]
    steps_per_epoch = args["train_examples"] // num_batches

    checkpoint_path = os.path.join(args["output_dir"], "checkpoints/babyweight")
    cp_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path, verbose=1, save_weights_only=True)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=args["num_epochs"],
        steps_per_epoch=steps_per_epoch,
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[cp_callback])
    
    hp_metric = history.history['val_rmse'][-1]

    hptune = hypertune.HyperTune()
    hptune.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='rmse', 
        metric_value=hp_metric,
        global_step=args['num_epochs']
    )

    EXPORT_PATH = os.path.join(
        args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(
        obj=model, export_dir=EXPORT_PATH)  # with default serving function
    print("Exported trained model to {}".format(EXPORT_PATH))

## Train locally

After moving the code to a package, make sure it works as a standalone. Note, we incorporated the `--train_examples` flag so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change it so that we can train on all the data. Even for this subset, this takes about *3 minutes* in which you won't see any output ...

### Lab Task #4: Run trainer module package locally.

Fill in the missing code in the TODOs below so that we can run a very small training job over a single file with a small batch size, 1 epoch, 1 train example, and 1 eval step.

In [65]:
%%bash
OUTDIR=babyweight_trained
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python3 -m trainer.task \
    --job-dir=./tmp \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --batch_size=32 \
    --num_epochs=1 \
    --train_examples=1 \
    --eval_steps=32

Here is our Wide-and-Deep architecture so far:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
gestation_weeks (InputLayer)    [(None,)]            0                                            
__________________________________________________________________________________________________
is_male (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
mother_age (InputLayer)         [(None,)]            0                                            
__________________________________________________________________________________________________
plurality (InputLayer)          [(None,)]            0                                            
______________________________________________

2021-06-02 01:37:21.615038: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-06-02 01:37:21.615171: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-06-02 01:37:21.615189: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use t

## Training on Cloud AI Platform

Now that we see everything is working locally, it's time to train on the cloud! 

To submit to the Cloud we use [`gcloud ai-platform jobs submit training [jobname]`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training) and simply specify some additional parameters for AI Platform Training Service:
- jobname: A unique identifier for the Cloud job. We usually append system time to ensure uniqueness
- job-dir: A GCS location to upload the Python package to
- runtime-version: Version of TF to use.
- python-version: Version of Python to use. Currently only Python 3.7 is supported for TF 2.1.
- region: Cloud region to train in. See [here](https://cloud.google.com/ml-engine/docs/tensorflow/regions) for supported AI Platform Training Service regions

Below the `-- \` we add in the arguments for our `task.py` file.

In [66]:
%%bash

OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)

gcloud ai-platform jobs submit training ${JOBID} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --master-machine-type=n1-standard-8 \
    --scale-tier=CUSTOM \
    --runtime-version=${TFVERSION} \
    --python-version=${PYTHONVERSION} \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=10000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

jobId: babyweight_210602_013729
state: QUEUED


Job [babyweight_210602_013729] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_210602_013729

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_210602_013729


The training job should complete within 10 to 15 minutes. You do not need to wait for this training job to finish before moving forward in the notebook, but will need a trained model to complete our next lab.

## Dockerized module

Since we are using TensorFlow 2.3 and it is new, we will use a container image to run the code on AI Platform.

Once TensorFlow 2.3 is natively supported on AI Platform, you will be able to simply do (without having to build a container):
<pre>
gcloud ai-platform jobs submit training ${JOBNAME} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --scale-tier=STANDARD_1 \
    --runtime-version=${TFVERSION} \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8
</pre>

### Create Dockerfile

We need to create a container with everything we need to be able to run our model. This includes our trainer module package, python3, as well as the libraries we use such as the most up to date TensorFlow 2.0 version.

In [67]:
%%writefile babyweight/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY trainer /babyweight/trainer
RUN apt update && \
    apt install --yes python3-pip && \
    pip3 install --upgrade --quiet tensorflow==2.1 && \
    pip3 install --upgrade --quiet cloudml-hypertune

ENV PYTHONPATH ${PYTHONPATH}:/babyweight
ENTRYPOINT ["python3", "babyweight/trainer/task.py"]

Overwriting babyweight/Dockerfile


### Build and push container image to repo

Now that we have created our Dockerfile, we need to build and push our container image to our project's container repo. To do this, we'll create a small shell script that we can call from the bash.

In [68]:
%%writefile babyweight/push_docker.sh
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}

echo "Building  $IMAGE_URI"
docker build -f Dockerfile -t ${IMAGE_URI} ./
echo "Pushing $IMAGE_URI"
docker push ${IMAGE_URI}

Overwriting babyweight/push_docker.sh


**Note:** If you get a permissions/stat error when running push_docker.sh from Notebooks, do it from CloudShell:

Open CloudShell on the GCP Console
* git clone https://github.com/GoogleCloudPlatform/training-data-analyst
* cd training-data-analyst/courses/machine_learning/deepdive2/structured/solutions/babyweight
* bash push_docker.sh

This step takes 5-10 minutes to run.

In [73]:
%%bash
cd babyweight
bash push_docker.sh

Building  gcr.io/qwiklabs-gcp-00-0db9b1bc58c6/babyweight_training_container
Sending build context to Docker daemon  23.55kB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-cpu
 ---> 4412d1fc4180
Step 2/5 : COPY trainer /babyweight/trainer
 ---> Using cache
 ---> b7e383fa5b58
Step 3/5 : RUN apt update &&     apt install --yes python3-pip &&     pip3 install --upgrade --quiet tensorflow==2.1 &&     pip3 install --upgrade --quiet cloudml-hypertune
 ---> Running in 7c98df809936
[91m

[0mGet:1 http://packages.cloud.google.com/apt gcsfuse-bionic InRelease [5394 B]
Get:2 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease [6780 B]
Get:5 http://packages.cloud.google.com/apt gcsfuse-bionic/main amd64 Packages [2649 B]
Get:6 http://packages.cloud.google.com/apt cloud-sdk-bionic/main amd64 Packages [185 kB]
Get:7 http://security.ubuntu

Kindly ignore the incompatibility errors.

### Test container locally

Before we submit our training job to Cloud AI Platform, let's make sure our container that we just built and pushed to our project's container repo works perfectly. We can do that by calling our container in bash and passing the necessary user_args for our task.py's parser.

In [74]:
%%bash
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}
echo "Running  $IMAGE_URI"
docker run ${IMAGE_URI} \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=gs://${BUCKET}/babyweight/trained_model \
    --batch_size=10 \
    --num_epochs=10 \
    --train_examples=1 \
    --eval_steps=1

Running  gcr.io/qwiklabs-gcp-00-0db9b1bc58c6/babyweight_training_container
Here is our Wide-and-Deep architecture so far:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
gestation_weeks (InputLayer)    [(None,)]            0                                            
__________________________________________________________________________________________________
is_male (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
mother_age (InputLayer)         [(None,)]            0                                            
__________________________________________________________________________________________________
plurality (InputLayer)          [(None,)]            0                

2021-06-02 01:42:25.609124: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2021-06-02 01:42:25.609343: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2021-06-02 01:42:25.609367: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
2021-06-02 01:42:28.605868: W ten

## Lab Task #5: Train on Cloud AI Platform.

Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about <b> two hours </b> for me. You can monitor the job from the GCP console in the Cloud AI Platform section. Complete the __#TODO__s to make sure you have the necessary user_args for our task.py's parser.

In [78]:
!gsutil ls gs://{BUCKET}/babyweight/data

gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/eval000000000000.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/eval000000000001.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000000.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000001.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000002.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000003.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000004.csv
gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/data/train000000000005.csv


In [79]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
# gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBID} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=2000000 \
    --eval_steps=32 \
    --batch_size=32 \
    --nembeds=32

gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/trained_model us-central1
jobId: babyweight_210602_014638
state: QUEUED


Job [babyweight_210602_014638] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_210602_014638

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_210602_014638


In [84]:
!gcloud ai-platform jobs stream-logs babyweight_210602_014638

INFO	2021-06-02 01:46:39 +0000	service		Validating job requirements...
INFO	2021-06-02 01:46:39 +0000	service		Job creation request has been successfully validated.
INFO	2021-06-02 01:46:40 +0000	service		Waiting for job to be provisioned.
INFO	2021-06-02 01:46:40 +0000	service		Job babyweight_210602_014638 is queued.
INFO	2021-06-02 01:46:41 +0000	service		Waiting for training program to start.
INFO	2021-06-02 01:47:07 +0000	master-replica-0		
INFO	2021-06-02 01:47:07 +0000	master-replica-0		
INFO	2021-06-02 01:47:07 +0000	master-replica-0		
INFO	2021-06-02 01:47:07 +0000	master-replica-0		
INFO	2021-06-02 01:47:07 +0000	master-replica-0		
ERROR	2021-06-02 01:51:10 +0000	master-replica-0		2021-06-02 01:51:10.064896: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
ERROR	2021-06-02 01:51:10 +0000	master-replica-0		2021-06-02 01:51:10.06515

When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:
<pre>
Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
</pre>
The final RMSE was 1.03 pounds.

## Lab Task #6: Hyperparameter tuning.

All of these are command-line parameters to my program.  To do hyperparameter tuning, create `hyperparam.yaml` and pass it as `--config hyperparam.yaml`.
This step will take <b>up to 2 hours</b> -- you can increase `maxParallelTrials` or reduce `maxTrials` to get it done faster.  Since `maxParallelTrials` is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search. Complete __#TODO__s in yaml file and gcloud training job bash command so that we can run hyperparameter tuning.

In [80]:
%%writefile hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: rmse # TODO: Add metric we want to optimize
        goal: MINIMIZE # TODO: MAXIMIZE or MINIMIZE?
        maxTrials: 20
        maxParallelTrials: 5  # 순차적으로 실행할 경우 다음 train 최적화에 도움이 되지만 속도 저하. 병렬 처리는 그와 반대
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: INTEGER
          minValue: 10
          maxValue: 50
          scaleType: UNIT_LINEAR_SCALE 
        - parameterName: nembeds
          type: INTEGER
          minValue: 4
          maxValue: 64
          scaleType: UNIT_LINEAR_SCALE 


Overwriting hyperparam.yaml


In [81]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    --config=hyperparam.yaml \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100

gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/hyperparam us-central1 babyweight_210602_014807
jobId: babyweight_210602_014807
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [babyweight_210602_014807] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_210602_014807

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_210602_014807


In [83]:
!gcloud ai-platform jobs stream-logs babyweight_210602_014807

INFO	2021-06-02 01:48:09 +0000	service		Validating job requirements...
INFO	2021-06-02 01:48:10 +0000	service		Job creation request has been successfully validated.
INFO	2021-06-02 01:48:10 +0000	service		Job babyweight_210602_014807 is queued.
INFO	2021-06-02 01:48:17 +0000	service	3	Waiting for job to be provisioned.
INFO	2021-06-02 01:48:17 +0000	service	4	Waiting for job to be provisioned.
INFO	2021-06-02 01:48:17 +0000	service	2	Waiting for job to be provisioned.
INFO	2021-06-02 01:48:17 +0000	service	5	Waiting for job to be provisioned.
INFO	2021-06-02 01:48:17 +0000	service	1	Waiting for job to be provisioned.
INFO	2021-06-02 01:48:18 +0000	service	3	Waiting for training program to start.
INFO	2021-06-02 01:48:18 +0000	service	1	Waiting for training program to start.
INFO	2021-06-02 01:48:18 +0000	service	2	Waiting for training program to start.
INFO	2021-06-02 01:48:19 +0000	service	4	Waiting for training program to start.
INFO	2021-06-02 01:48:19 +0000	service	5	Waiting for tr

## Repeat training

This time with tuned parameters for `batch_size` and `nembeds`.

In [82]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

gs://qwiklabs-gcp-00-0db9b1bc58c6/babyweight/trained_model_tuned us-central1 babyweight_210602_014810
jobId: babyweight_210602_014810
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [babyweight_210602_014810] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_210602_014810

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_210602_014810


In [85]:
!gcloud ai-platform jobs stream-logs babyweight_210602_014810

INFO	2021-06-02 01:48:12 +0000	service		Validating job requirements...
INFO	2021-06-02 01:48:13 +0000	service		Job creation request has been successfully validated.
INFO	2021-06-02 01:48:13 +0000	service		Waiting for job to be provisioned.
INFO	2021-06-02 01:48:13 +0000	service		Job babyweight_210602_014810 is queued.
INFO	2021-06-02 01:48:14 +0000	service		Waiting for training program to start.
INFO	2021-06-02 01:48:40 +0000	master-replica-0		
INFO	2021-06-02 01:48:40 +0000	master-replica-0		
INFO	2021-06-02 01:48:40 +0000	master-replica-0		
INFO	2021-06-02 01:48:40 +0000	master-replica-0		
INFO	2021-06-02 01:48:40 +0000	master-replica-0		
ERROR	2021-06-02 01:52:40 +0000	master-replica-0		2021-06-02 01:52:40.704514: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
ERROR	2021-06-02 01:52:40 +0000	master-replica-0		2021-06-02 01:52:40.70480

## Lab Summary: 
In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License