# Training on Cloud ML Engine

**Learning Objectives**
- Use CMLE to run a distributed training job

## Introduction 
After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model. 

This notebook illustrates how to do distributed training and hyperparameter tuning on Cloud ML Engine. 

To start, we'll set up our environment variables as before.

In [88]:
PROJECT = "qwiklabs-gcp-636667ae83e902b6"  # Replace with your PROJECT
BUCKET =  "qwiklabs-gcp-636667ae83e902b6_al"  # Replace with your BUCKET
REGION = "us-east1"            # Choose an available region for AI Platform  
TFVERSION = "1.13"                # TF version for AI Platform

In [89]:
import os
os.environ["BUCKET"] = BUCKET
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = TFVERSION

In [90]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


Next, we'll look for the preprocessed data for the babyweight model and copy it over if it's not there. 

In [91]:
%%bash
if ! gsutil ls -r gs://$BUCKET | grep -q gs://$BUCKET/babyweight/preproc; then
    gsutil mb -l ${REGION} gs://${BUCKET}
    # copy canonical set of preprocessed files if you didn't do previous notebook
    gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}
fi

In [92]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*

gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/eval.csv-00000-of-00013
gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/train.csv-00000-of-00188


In [93]:
"weight_pounds,is_male,mother_age,mother_race,father_race,cigarette_use,mother_married,ever_born,plurality,weight_gain_pounds,gestation_weeks"

'weight_pounds,is_male,mother_age,mother_race,father_race,cigarette_use,mother_married,ever_born,plurality,weight_gain_pounds,gestation_weeks'

Make sure I have the extra fields I need

In [94]:
%%bash
gsutil cat gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/eval.csv-00000-of-00013 | head -5

9.6452239625,Unknown,30,1,1,false,true,2,Single(1),45,38
9.6452239625,false,30,1,1,false,true,2,Single(1),45,38
5.81358984894,Unknown,30,1,1,false,true,2,Single(1),30,38
5.81358984894,false,30,1,1,false,true,2,Single(1),30,38
3.5163730789,Unknown,30,1,1,false,true,2,Multiple(2+),33,38


In [95]:
%%bash
gsutil cat gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/train.csv-00000-of-00188 | head -5

7.09668021378,Unknown,18,1,99,None,false,2,Single(1),99,40
7.09668021378,false,18,1,99,None,false,2,Single(1),99,40
8.6751900097,Unknown,42,1,99,false,false,1,Single(1),25,41
8.6751900097,false,42,1,99,false,false,1,Single(1),25,41
7.9917569975,Unknown,22,1,99,false,false,2,Single(1),40,39


In the previous labs we developed our TensorFlow model and got it working on a subset of the data. Now we can package the TensorFlow code up as a Python module and train it on Cloud ML Engine.

## Train on Cloud ML Engine

Training on Cloud ML Engine requires two things:
- Configuring our code as a Python package
- Using gcloud to submit the training code to Cloud ML Engine

### Move code into a Python package

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file suffices.

The bash command `touch` creates an empty file in the specified location, the directory `babyweight` should already exist.

In [96]:
%%bash
touch babyweight/trainer/__init__.py

We then use the `%%writefile` magic to write the contents of the cell below to a file called `task.py` in the `babyweight/trainer` folder.

#### **Exercise 1**

The cell below write the file `babyweight/trainer/task.py` which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the `parser` module. Look at how `batch_size` is passed to the model in the code below. Use this as an example to parse arguements for the following variables
- `nnsize` which represents the hidden layer sizes to use for DNN feature columns
- `nembeds` which represents the embedding size of a cross of n key real-valued parameters
- `train_examples` which represents the number of examples (in thousands) to run the training job
- `eval_steps` which represents the positive number of steps for which to evaluate model
- `pattern` which specifies a pattern that has to be in input files. For example '00001-of' would process only one shard. For this variable, set 'of' to be the default. 

Be sure to include a default value for the parsed arguments above and specfy the `type` if necessary.

In [97]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from . import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--bucket",
        help = "GCS path to data. We assume that data is in \
        gs://BUCKET/babyweight/preproc/",
        required = True
    )
    parser.add_argument(
        "--output_dir",
        help = "GCS location to write checkpoints and export models",
        required = True
    )
    parser.add_argument(
        "--batch_size",
        help = "Number of examples to compute gradient over.",
        type = int,
        default = 512
    )
    parser.add_argument(
        "--job-dir",
        help = "this model ignores this field, but it is required by gcloud",
        default = "junk"
    )
    
    parser.add_argument(
        "--nnsize",
        help = "Hidden layer sizes to use for DNN (string, comma-separated)",
        default="[10,10]"
    )
    
    parser.add_argument(
        "--nembeds",
        help = "Embedding size of a cross of n key parameters - this will be a small integer",
        default = 3)
    
    parser.add_argument(
        "--ntrees",
        help = "Number of trees",
        default = 100)
    
    parser.add_argument(
        "--maxdepth",
        help = "Depth of trees",
        default = 6)
    
    parser.add_argument(
        "--train_examples",
        help="Number of examples (in thousands) to run the training job",
        default=1)
    
    parser.add_argument(
        "--eval_steps",
        help="Steps for which to evaluate model",
        default=100)
    
    parser.add_argument(
        "--pattern",
        help = "Pattern that appears in filename",
        default = "of")
        
    # Parse arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Pop unnecessary args needed for gcloud
    arguments.pop("job-dir", None)

    # Assign the arguments to the model variables
    output_dir = arguments.pop("output_dir")
    model.BUCKET     = arguments.pop("bucket")
    model.BATCH_SIZE = int(arguments.pop("batch_size"))
    model.TRAIN_STEPS = 200000 
    #model.TRAIN_STEPS = (int(arguments.pop("train_examples") * 1000)) / model.BATCH_SIZE
    model.EVAL_STEPS = arguments.pop("eval_steps")    

    #print ("Will train for {} steps using batch_size={}".format(model.TRAIN_STEPS, model.BATCH_SIZE))
    model.PATTERN = arguments.pop("pattern")
    model.NEMBEDS= arguments.pop("nembeds")
    model.NNSIZE = arguments.pop("nnsize")
    #print ("Will use DNN size of {}".format(model.NNSIZE))
  
    model.MAXDEPTH = int(arguments.pop("maxdepth")) 
    model.NTREES = int(arguments.pop("ntrees"))
    
    print ("Will train on {} trees with max depth of {}".format(model.NTREES, model.MAXDEPTH))
    # Append trial_id to path if we are doing hptuning
    # This code can be removed if you are not using hyperparameter tuning
    output_dir = os.path.join(
        output_dir,
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )

    # Run the training job
    model.train_and_evaluate_gbt(output_dir)

Overwriting babyweight/trainer/task.py


In the same way we can write to the file `model.py` the model that we developed in the previous notebooks. 

#### **Exercise 2**

Complete the TODOs in the code cell below to create out `model.py`. We'll use the code we wrote for the Wide & Deep model. Look back at your `3_tensorflow_wide_deep` notebook and copy/paste the necessary code from that notebook into its place in the cell below.

In [106]:
%%writefile babyweight/trainer/model.py
import shutil
import numpy as np
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

BUCKET = None  # set from task.py
PATTERN = "of" # gets all files

# Determine CSV and label columns
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,father_race,cigarette_use,mother_married,ever_born,plurality,weight_gain_pounds,gestation_weeks'.split(',')
LABEL_COLUMN = 'weight_pounds'

# Set default values for each CSV column
CSV_DEFAULTS = [[0.0], ['Unknown'], [0], ['0'], ['0'], ['False'], ['True'], ['1'], ['Single(1)'], [30], [0]]

# Define some hyperparameters
TRAIN_STEPS = 10000
EVAL_STEPS = None
BATCH_SIZE = 512
NEMBEDS = 3
NNSIZE = [64, 16, 4]
#NTREES = 100
#MAXDEPTH = 6


def decode_csv(line_of_text):
    fields = tf.decode_csv(records = line_of_text, record_defaults = CSV_DEFAULTS, na_value='None')
    features = dict(zip(CSV_COLUMNS, fields))
    features['mother_race'] = tf.cast(features['mother_race'], 'string')
    features['father_race'] = tf.cast(features['father_race'], 'string')
    features['plurality'] = tf.cast(features['plurality'], 'string')
    features['weight_gain_pounds'] = tf.cast(features['weight_gain_pounds'], 'int32')
    if (features['weight_gain_pounds'] == 99):
        features['weight_gain_pounds'] = 30
    label = features.pop(LABEL_COLUMN) # remove label from features and store
    return features, label

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(filename_pattern, mode, batch_size = 512):
    def _input_fn():
    
        path_to_files = 'gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/'
        # Create list of files that match pattern.  Does support internal wildcarding e.g. "babyweight*.csv"
        file_list = tf.gfile.Glob(path_to_files + filename_pattern + "*" + PATTERN + "*")

        print(filename_pattern)
        print(file_list)
        # Create dataset from file list
        dataset = tf.data.TextLineDataset(filenames = file_list).skip(count = 1)
        dataset = dataset.map(map_func = decode_csv)

        # In training mode, shuffle the dataset and repeat indefinitely
        if mode == tf.estimator.ModeKeys.TRAIN:
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
            num_epochs = None 
        else:
            num_epochs = 1 

        dataset = dataset.repeat(count = num_epochs).batch(batch_size = batch_size)
        return dataset

        # This will now return batches of features, label
        return dataset
    return _input_fn

# Define feature columns
def get_categorical(name, values):
    return tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(key=name, vocabulary_list=values))

    
num_cols = ['mother_age', 'gestation_weeks', 'weight_gain_pounds']
cat_cols = ['is_male', 'mother_race', 'father_race', 'cigarette_use', 'mother_married', 'plurality', 'ever_born']


cat_vocab = {
            'is_male': ['True', 'False', 'Unknown'], 
             'cigarette_use': ['True', 'False', 'None'], 
             'mother_married': ['True', 'False'], 
             'mother_race': ['1', '7', '2', '0', '3', '18', '28', '5', '48', '4', '68', '9', '78',
        '6', '38', '58'], 
             'father_race': ['1', '7', '2', '0', '3', '18', '28', '5', '48', '4', '68', '9', '78',
        '6', '38', '58'], 
             'plurality': ['Single(1)', 'Twins(2)', 'Multiple(2+)', 'Triplets(3)',
       'Quintuplets(5)', 'Quadruplets(4)'] ,
              'ever_born': ['1', '2', '3', '4', '5']
            }

def get_cols(num_cols, cat_cols, cat_vocab):
    all_cols = []
    for col in num_cols:
        all_cols.append(tf.feature_column.numeric_column(key = col))
    for col in cat_cols:
        all_cols.append(get_categorical(col, cat_vocab[col]))

    #fc_crossed_race = tf.feature_column.crossed_column(keys = ['mother_race', 'father_race'], hash_bucket_size = 100)
    
    #all_cols.append(tf.feature_column.indicator_column(categorical_column = fc_crossed_race))
    return all_cols


# Create serving input function to be able to serve predictions later using provided inputs
def serving_input_fn():
    num_placeholders = {col: tf.placeholder(dtype=tf.float32, shape=[None], name=col) for col in num_cols}     
    cat_placeholders = {col: tf.placeholder(dtype=tf.string, shape=[None], name=col) for col in cat_cols}
    
    feature_placeholders = {**num_placeholders, **cat_placeholders}
    
    features = {
        key: tf.expand_dims(input = tensor, axis = -1)
        for key, tensor in feature_placeholders.items()
    }
    
    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = feature_placeholders)

# create metric for hyperparameter tuning
def my_rmse(labels, predictions):
    pred_values = predictions["predictions"]
    return {"rmse": tf.metrics.root_mean_squared_error(labels = labels, predictions = pred_values)}

# Create estimator to train and evaluate
def train_and_evaluate_dnn(output_dir):
    EVAL_INTERVAL = 300
    run_config = tf.estimator.RunConfig(
        save_checkpoints_secs = EVAL_INTERVAL,
        tf_random_seed=42,
        keep_checkpoint_max = 3)

    estimator = tf.estimator.DNNRegressor(model_dir=output_dir,
                                         feature_columns = get_cols(num_cols, cat_cols, cat_vocab),
                                         hidden_units = [64,32],
                                         config=run_config)
    
    estimator = tf.contrib.estimator.add_metrics(estimator, my_rmse)
    train_spec = tf.estimator.TrainSpec(input_fn = read_dataset("train", mode = tf.estimator.ModeKeys.TRAIN),
        max_steps = TRAIN_STEPS)
    
    exporter = tf.estimator.LatestExporter(name = "exporter", serving_input_receiver_fn = serving_input_fn)
    eval_spec = tf.estimator.EvalSpec(input_fn = read_dataset("eval", mode=tf.estimator.ModeKeys.EVAL), exporters=exporter)
        
    tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
    
def train_and_evaluate_gbt(output_dir):
    EVAL_INTERVAL = 300
    run_config = tf.estimator.RunConfig(
        save_checkpoints_secs = EVAL_INTERVAL,
        tf_random_seed=42,
        keep_checkpoint_max = 3)

    estimator = tf.estimator.BoostedTreesRegressor(model_dir=output_dir,
                                                   n_batches_per_layer = 1,
                                         feature_columns = get_cols(num_cols, cat_cols, cat_vocab),
                                         n_trees=NTREES,
                                         max_depth=MAXDEPTH,   
                                         learning_rate=0.05,          
                                         config=run_config)
    
    estimator = tf.contrib.estimator.add_metrics(estimator, my_rmse)
    train_spec = tf.estimator.TrainSpec(input_fn = read_dataset("train", mode = tf.estimator.ModeKeys.TRAIN),
        max_steps = TRAIN_STEPS)
    
    exporter = tf.estimator.BestExporter(name = "exporter", serving_input_receiver_fn = serving_input_fn)
    eval_spec = tf.estimator.EvalSpec(input_fn = read_dataset("eval", mode=tf.estimator.ModeKeys.EVAL), exporters=exporter)
                  
    tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)    
    

Overwriting babyweight/trainer/model.py


## Train locally

After moving the code to a package, make sure it works as a standalone. Note, we incorporated the `--pattern` and `--train_examples` flags so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change the pattern so that we can train on all the data. Even for this subset, this takes about *3 minutes* in which you won't see any output ...

#### **Exercise 3**

Fill in the missing code in the TODOs below so that we can run a very small training job over a single file (i.e. use the `pattern` equal to "00000-of-") with 1 train step and 1 eval step 

In [107]:
%%bash
echo "bucket=${BUCKET}"
rm -rf babyweight_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
/usr/bin/python3.5 -m trainer.task \
    --bucket=$BUCKET \
    --output_dir=babyweight_trained \
    --job-dir=./tmp \
    --pattern="00000-of-"\
    --eval_steps=1

bucket=qwiklabs-gcp-636667ae83e902b6_al
Will train on 100 trees with max depth of 6

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

train
['gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/train.csv-00000-of-00188']
eval
['gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/preproc/eval.csv-00000-of-00013']


INFO:tensorflow:Using config: {'_tf_random_seed': 42, '_model_dir': 'babyweight_trained/', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 3, '_service': None, '_evaluation_master': '', '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_task_id': 0, '_log_step_count_steps': 100, '_num_worker_replicas': 1, '_eval_distribute': None, '_experimental_distribute': None, '_task_type': 'worker', '_save_checkpoints_secs': 300, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_train_distribute': None, '_num_ps_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe330c4f908>, '_master': '', '_global_id_in_cluster': 0, '_device_fn': None, '_save_summary_steps': 100}
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
INFO:tensorflow:Using config: {'_tf_random_seed': 42, 

## Making predictions

The JSON below represents an input into your prediction model. Write the input.json file below with the next cell, then run the prediction locally to assess whether it produces predictions correctly.

In [108]:
%%writefile inputs.json
{"is_male": "True",  "mother_age": 26.0,  "mother_race": "1",  "father_race": "1", "cigarette_use": "False", "mother_married": "True", "ever_born": "1", "plurality": "Single(1)", "gestation_weeks": 39, "weight_gain_pounds": 30}
{"is_male": "False",  "mother_age": 26.0,  "mother_race": "1",  "father_race": "1", "cigarette_use": "False", "mother_married": "True", "ever_born": "1", "plurality": "Single(1)", "gestation_weeks": 39, "weight_gain_pounds": 30}

Overwriting inputs.json


#### **Exercise 4**

Finish the code in cell below to run a local prediction job on the `inputs.json` file we just created. You will need to provide two additional flags
- one for `model-dir` specifying the location of the model binaries
- one for `json-instances` specifying the location of the json file on which you want to predict

In [109]:
%%bash
MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)
echo $MODEL_LOCATION
gcloud ai-platform local predict --model-dir=$MODEL_LOCATION --json-instances=inputs.json




ls: cannot access '/home/jupyter/training-data-analyst/courses/machine_learning/deepdive/05_review/labs/babyweight_trained/export/exporter/*': No such file or directory
ERROR: (gcloud.ai-platform.local.predict) Traceback (most recent call last):
  File "lib/googlecloudsdk/command_lib/ml_engine/local_predict.py", line 184, in <module>
    main()
  File "lib/googlecloudsdk/command_lib/ml_engine/local_predict.py", line 179, in main
    signature_name=args.signature_name)
  File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/prediction_lib.py", line 98, in local_predict
    client = create_client(framework, model_dir, **kwargs)
  File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/prediction_lib.py", line 91, in create_client
    return create_client_fn(model_path, **kwargs)
  File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/frameworks/tf_prediction_lib.py", line 522, in create_tf_session_client
    return SessionClien

CalledProcessError: Command 'b'MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)\necho $MODEL_LOCATION\ngcloud ai-platform local predict --model-dir=$MODEL_LOCATION --json-instances=inputs.json\n'' returned non-zero exit status 1

## Training on the Cloud with CMLE

Once the code works in standalone mode, you can run it on Cloud ML Engine.  Because this is on the entire dataset, it will take a while. The training run took about <b> an hour </b> for me. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.

In [104]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_gbt_cc
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME

gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_gbt_cc us-east1 babyweight_190726_124915


In [105]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_gbt_cc
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=STANDARD_1 \
    --runtime-version=$TFVERSION \
    --python-version=3.5 \
    -- \
    --bucket=${BUCKET} \
    --output_dir=${OUTDIR} \
    --train_examples=2000

gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_gbt_cc us-east1 babyweight_190726_124919
jobId: babyweight_190726_124919
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [babyweight_190726_124919] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_190726_124919

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_190726_124919


When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:
<pre>
Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
</pre>
The final RMSE was 1.03 pounds.

In [60]:
# DNN:
# Saving dict for global step 200008: 
# average_loss = 1.037256, global_step = 200008, label/mean = 7.438621, loss = 531.0751, prediction/mean = 7.523035, 
# rmse = 1.0184577

In [62]:
# GBT:
# Saving dict for global step 197872: average_loss = 0.9814496, global_step = 197872, label/mean = 7.438621, 
# loss = 502.5022, prediction/mean = 7.4479456, rmse = 0.9906814

<h2> Optional: Hyperparameter tuning </h2>
<p>
All of these are command-line parameters to my program.  To do hyperparameter tuning, create hyperparam.xml and pass it as --configFile.
This step will take <b>1 hour</b> -- you can increase maxParallelTrials or reduce maxTrials to get it done faster.  Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.


#### **Exercise 6**

We need to create a .yaml file to pass with our hyperparameter tuning job. Fill in the TODOs below for each of the parameters we want to include in our hyperparameter search.

%%writefile hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: rmse
        goal: MINIMIZE
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: INTEGER
          minValue: 8
          maxValue: 512
          scaleType: UNIT_LOG_SCALE
        - parameterName: nembeds
          type: INTEGER
          minValue: 3
          maxValue: 30
          scaleType: UNIT_LINEAR_SCALE
        - parameterName: nnsize
          type: INTEGER
          minValue: 64
          maxValue: 512
          scaleType: UNIT_LOG_SCALE

In [None]:
# First run for trees did best at 500 trees which was my max - try more

In [80]:
%%writefile hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: rmse
        goal: MINIMIZE
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: ntrees
          type: INTEGER
          minValue: 400
          maxValue: 1000
          scaleType: UNIT_LINEAR_SCALE
        - parameterName: maxdepth
          type: INTEGER
          minValue: 2
          maxValue: 5
          scaleType: UNIT_LINEAR_SCALE


Overwriting hyperparam.yaml


In [81]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam3
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=STANDARD_1 \
    --config=hyperparam.yaml \
    --runtime-version=$TFVERSION \
    --python-version=3.5 \
    -- \
    --bucket=${BUCKET} \
    --output_dir=${OUTDIR} \
    --eval_steps=10 \
    --train_examples=20000

gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/hyperparam3 us-east1 babyweight_190725_195701
jobId: babyweight_190725_195701
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [babyweight_190725_195701] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe babyweight_190725_195701

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs babyweight_190725_195701


In [82]:
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file

In [83]:
OUTDIR="gs://{}/babyweight/hyperparam3".format(BUCKET)

get_ipython().system_raw(
    "tensorboard --logdir {} --host 0.0.0.0 --port 6006 &"
    .format(OUTDIR))


get_ipython().system_raw("/home/jupyter/training-data-analyst/courses/machine_learning/asl/02_tensorflow/assets/ngrok http 6006 &")

In [84]:
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

http://f674f68d.ngrok.io


<h2> Repeat training </h2>
<p>
This time with tuned parameters (note last line)

My best parameters:
    * batch_size = 56
    * nembeds = 21
    * nnsize = 180
    
RMSE was 0.99!    

Try 8, 30, 511 - second-best but the other run gave worse RMSE

In [85]:
## For trees the best is 584 trees, depth of 2

In [87]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned_gbt
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=STANDARD_1 \
    --runtime-version=$TFVERSION \
    --python-version=3.5 \
    -- \
    --bucket=${BUCKET} \
    --output_dir=${OUTDIR} \
    --train_examples=2000 --ntrees=584 --maxdepth=2

gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt us-east1 babyweight_190726_000521
jobId: babyweight_190726_000521
state: QUEUED


Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/checkpoint#1564099299318799...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/eval/#1564099306714990...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/#1564099296918706...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/events.out.tfevents.1564099179.cmle-training-master-1b12a2b510-0-p6mt6#1564099180415429...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/eval/events.out.tfevents.1564099306.cmle-training-master-1b12a2b510-0-p6mt6#1564099307946883...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/model.ckpt-0.index#1564099216962366...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_model_tuned_gbt/model.ckpt-1168.data-00000-of-00003#1564099298287164...
Removing gs://qwiklabs-gcp-636667ae83e902b6_al/babyweight/trained_

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License