# Using Vertex AI to train an image classification model

<div class="alert alert-block alert-info">
Run the <a href="xxx"><code>00_pcam_setup.ipynb notebook</code></a> first, before running this one.  You'll need the settings info from that notebook.
</div>

## Introduction

This notebook shows some examples of how to use [Vertex AI](https://cloud.google.com/vertex-ai/docs) for training a machine learning model. 

It shows how to define and submit a **model training job**; then how to upload and deploy that model for serving; and then how to send prediction requests to the deployed model's *Endpoint*. It also shows how to create and use a Managed Tensorboard instance during training, and how to log information about the training run to the Vertex Experiments API.

Then, a follow-on notebook, `02_2_vertex_ai_pcam`, shows how to set up a [**hyperparameter tuning**](https://en.wikipedia.org/wiki/Hyperparameter_optimization) job using that same model; and how to set up and run a **distributed multi-node training** job.

Then, a following set of notebooks show how to use Vertex Pipelines to define ML workflows for data preprocessing, training, model evaluation, and deployment.

Currently, the code is on GitHub here: https://github.com/verily-src/terra-solutions-ml.

### Estimated cost of running this notebook

The dataset used for the examples in this notebook is fairly large, as is the base model architecture, and training using the notebook's default configurations will take about 1.5 hours.

The model training works best with GPU(s)— it runs fine using only CPUs, but training will take an even longer time. For this example, the notebook itself doesn't need GPUs; instead they'll be used by Vertex AI.

This example should cost < $2 in Vertex AI charges to run (billed to your ['native' GCP project](https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-advanced-GCP-features-in-Terra)), not including the cost of the notebook instance.

### Running on a [Terra](http://app.terra.bio) notebook

This example requires that TensorFlow >= 2.6 be installed, and does not require GPUs; instead the example uses GPUs on Vertex AI Training.
You can use the default GATK image. 

You will need to use a ['native' GCP project](https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-advanced-GCP-features-in-Terra) to connect to the Vertex AI services.  The `00_pcam_setup.ipynb` notebook, which should be run before this one, will walk you through that setup.

<div class="alert alert-block alert-info">
This notebook is not designed for running all the cells at once via a 'Run All'— rather, you will need to wait until training has finished to run the latter parts of the example.
    
If you like, you can shut down the notebook instance/Cloud Environment while the training job runs— monitoring its progress in the Cloud Console UI— and then restart the notebook instance when the job is finished to complete the example. If you do this, you'll need to do a bit of additional work to redefine some imports and config after the notebook is restarted.  
</div>

To monitor the logs for a training job while it is running, click on the links output to the notebook when you start the training job.  You can also visit the [Vertex AI tab in the Cloud Console](https://console.cloud.google.com/vertex-ai/training/custom-jobs) for your 'native' GCP project, and click on 'Training', then 'CUSTOM JOBS'.  From that list of jobs, click in to any of them— look for your username— then click on the 'Logs' link in the detailed view.
<img src="https://storage.googleapis.com/amy-jo/images/terra/CleanShot%202022-02-18%20at%2013.48.34%402x.png" width="90%"/>


### About the ML task and dataset

This notebook shows an example of training an _image classification_ [Keras](https://keras.io/) model.

The [PatchCamelyon benchmark](https://www.tensorflow.org/datasets/catalog/patch_camelyon) consists of 327,680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a
binary label indicating presence of metastatic tissue. 

The model uses one of Keras' prebuilt model architectures, [Xception](https://keras.io/api/applications/xception/). The training does [_transfer learning_](https://en.wikipedia.org/wiki/Transfer_learning) , bootstrapping from model weights trained on the ['imagenet'](https://en.wikipedia.org/wiki/ImageNet) dataset.

<img src="https://storage.googleapis.com/tfds-data/visualization/fig/patch_camelyon-2.0.0.png" width="60%">

## Config and setup

We'll first do some configuration and set some variables.


In [None]:
import json
import os
import time
from datetime import datetime

import IPython
import numpy as np
import PIL
import tensorflow as tf
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from PIL import Image

IMAGE_HEIGHT = 96
IMAGE_WIDTH = 96

IMAGE_SIZE = (IMAGE_HEIGHT, IMAGE_WIDTH)

LABELS = ["non_metastatic", "metastatic"]
BATCH_SIZE = 32
NB_NUM = "02-1"
print(tf.__version__)

We'll set some variables using Workspace Data.  

In [None]:
OWNER_EMAIL = ""
USER = ""

if (
    "GOOGLE_PROJECT" in os.environ
):  # This env var is set when running in a Terra workspace
    from firecloud import api as fapi

    WORKSPACE_NAME = os.environ["WORKSPACE_NAME"]
    WORKSPACE_NAMESPACE = os.environ["WORKSPACE_NAMESPACE"]
    OWNER_EMAIL = os.environ["OWNER_EMAIL"]
    # WORKSPACE_ATTRIBUTES contains key-value pairs from the "Workspace Data" section of the Workspace "Data" tab.
    WORKSPACE_ATTRIBUTES = (
        fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME)
        .json()
        .get("workspace", {})
        .get("attributes", {})
    )

    # set a variable from the workspace attributes
    PYTHON_PACKAGE_GCS_URI_WS = WORKSPACE_ATTRIBUTES["PYTHON_PACKAGE_GCS_URI_WS"]
    print(f"PYTHON_PACKAGE_GCS_URI_WS: {PYTHON_PACKAGE_GCS_URI_WS}")
else:
    print(
        "Not running on Terra: you will need to set some variables manually. See below."
    )

if OWNER_EMAIL:
    USER = OWNER_EMAIL.split("@")[0].replace('.','-')

### Set some variables


**Edit the cell below before running it**.  **Replace the values with the ones for your 'native' GCP project** generated when running the `00_pcam_setup.ipynb` notebook.

In [1]:
PROJECT_ID = "your-project-id"
# The service account you've set up for these Vertex AI examples
TRAINING_SA = "your-sa-name@your-project-id.iam.gserviceaccount.com"
BUCKET_NAME = (
    "your-bucket-name"  # don't include the 'gs://' prefix; that is added below
)
# The TensorBoard instance you created: optional but useful
TENSORBOARD_INSTANCE = (
    "projects/xxxxxxxxxxxx/locations/us-central1/tensorboards/xxxxxxxxxxxxxxxxxxx"
)

The `USER` value will be used to create Vertex resource and job names, so that you can locate your info more easily in the GCP Cloud Console.

In [None]:
if USER == "" or USER is None:
    USER = "your-username"  # <-- CHANGE THIS

Make sure `USER` was set correctly:

In [None]:
print(f"USER: {USER}")

In [None]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Ensure that the PROJECT_ID is set correctly and set your region

Ensure that your project ID has been set correctly. This should be the project ID of the ['native' GCP project](https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-advanced-GCP-features-in-Terra).  (This is different from the project for your workspace).

In [None]:
print(PROJECT_ID)
LOCATION = "us-central1"

### Check the service account used for some of the Vertex AI calls

You'll use the service account that you set up in your native GCP project. Ensure that it's set properly.


In [None]:
TRAINING_SA

### Set a Cloud Storage bucket to use for this example


In [None]:
BUCKET = f"gs://{BUCKET_NAME}"
print(BUCKET)

Copy the Python package with the training code to your bucket. This is necessary because the package needs to be in a GCS bucket accessible to Vertex AI in your 'native' GCP project.

In [None]:
PYTHON_PACKAGE_GCS_URI = BUCKET + "/pcam/dist/trainer-0.7.tar.gz"
print(PYTHON_PACKAGE_GCS_URI)

In [None]:
!gsutil cp $PYTHON_PACKAGE_GCS_URI_WS $PYTHON_PACKAGE_GCS_URI

In [None]:
!gsutil ls $PYTHON_PACKAGE_GCS_URI

### Initialize the Vertex AI SDK with your project, location, and bucket settings

In [None]:
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET)

## Optional: Create an Experiment for tracking training related metadata

The Vertex AI Experiments API is useful for tracking information about your training runs.  You can retrieve the logged information via a pandas dataframe for analysis and comparison.

We'll start by creating an `Experiment`.  Then, in the following sections, we'll define Experiment `runs` and log information about the training jobs to them.

In [None]:
EXPERIMENT_NAME = f"{USER}-pcam-{NB_NUM}-{TIMESTAMP}"
print(f"experiment name: {EXPERIMENT_NAME}")
aiplatform.init(experiment=EXPERIMENT_NAME)

## Train the model on Vertex AI using the Vertex AI SDK

Now we'll define the training code that we'll run on Vertex AI.  You can indicate the filepath to a Python script when defining the training job, or alternately package your code as a module, upload it to GCS, and indicate that URL instead.  We'll show examples of both below.

Using the defaults, this training job will take about 2.5 hours to run. 

### Define a training script

This script is a simplified version of the package checked in to GitHub [here](https://github.com/DataBiosphere/terra-example-notebooks).

In [None]:
%%writefile pcamtask.py

import argparse
import json
import logging
import os
import subprocess
from datetime import datetime
from pathlib import Path

import hypertune
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds


LABELS = ['non_metastatic', 'metastatic']


def generate_camelyon_datasets(batch_size):
  ds, ds_info = tfds.load('patch_camelyon', with_info = True, as_supervised = True)

  #Get the train, validation and test datasets
  training_data = ds['train']
  validation_data = ds['validation']
  test_data = ds['test']

  # shuffle train_data
  buffer_size = 1000
  training_data = training_data.shuffle(buffer_size)

  # batch and prefetch
  training_data = training_data.batch(batch_size).prefetch(1)
  validation_data = validation_data.batch(batch_size).prefetch(1)
  test_data = test_data.batch(batch_size).prefetch(1)

  return(training_data, validation_data, test_data, ds_info)


def get_compiled_model(lr, image_height, image_width):
  base_model = keras.applications.Xception(
      weights="imagenet",
      input_shape=(image_height, image_width, 3),
      include_top=False,
  )

  base_model.trainable = False

  inputs = keras.Input(shape=(image_height, image_width, 3))

  x = layers.Rescaling(1.0 / 255)(inputs)
  x = base_model(x, training=False)
  x = keras.layers.GlobalAveragePooling2D()(x)
  # x = keras.layers.Dropout(0.2)(x)
  outputs = keras.layers.Dense(len(LABELS), activation="softmax")(x)

  model = keras.Model(inputs, outputs)
  loss = tf.keras.losses.SparseCategoricalCrossentropy()

  model.compile(
      optimizer=keras.optimizers.Adam(learning_rate=lr),
      loss=loss,
      metrics=["accuracy"],
  )
  return model


def define_callbacks(log_dir, checkpoint_dir):
  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=log_dir, update_freq=300
  )
  model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
      filepath=checkpoint_dir,
      #     save_weights_only=True,
      monitor="val_accuracy",
      mode="max",
      save_freq="epoch"
      #     save_best_only=True
  )
  return (tensorboard_callback, model_checkpoint_callback)


def generate_metrics(validation_set, model):
  ma = tf.keras.metrics.AUC()
  mp = tf.keras.metrics.Precision()
  mr = tf.keras.metrics.Recall()

  all_preds = []
  all_labels = []

  for images, labels in validation_set.take(len(validation_set)):
    predictions = model.predict(images)
    y_preds = np.argsort(predictions, axis=1)[:, -1:]

    all_preds += list(y_preds.flatten())
    all_labels += list(labels.numpy())
    onehot_labels = tf.keras.utils.to_categorical(
        labels, num_classes=len(LABELS)
    )
    ma.update_state(onehot_labels, predictions)
    mp.update_state(onehot_labels, predictions)
    mr.update_state(onehot_labels, predictions)
  return (ma, mp, mr, all_preds, all_labels)


def main():

  logging.getLogger().setLevel(logging.INFO)
  parser = argparse.ArgumentParser(description="ML Trainer")
  parser.add_argument("--epochs", type=int, default=4)
  parser.add_argument("--batch-size", type=int, default=32)
  parser.add_argument("--lr", type=float, default=1e-3)
  parser.add_argument("--image-height", type=int, default=96)
  parser.add_argument("--image-width", type=int, default=96)

  parser.add_argument("--gcs-workdir", required=True)
  parser.add_argument("--gcs-model-savedir", required=True)
  parser.add_argument("--gcs-metrics-path", required=True)
  # required for consistency with the training package args
  parser.add_argument("--ml-task", default='patchcamelyon')
    

  parser.add_argument(
      "--input-data-dir",
      default='/gcs/terra-solutions-ml-exs-debug/data/model_data/training_data/',
  )
  parser.add_argument(
      "--input-data",
      default="gs://terra-solutions-ml-exs-debug/data/model_data.zip",
  )
  parser.add_argument("--copy-data", default=False, action="store_true")
  parser.add_argument("--use-fuse", dest="copy-data", action="store_false")

  parser.add_argument("--multi-node", default=False, action="store_true")
  parser.add_argument(
      "--single-node", dest="multi-node", action="store_false"
  )

  parser.add_argument("--hptune", default=False, action="store_true")
  parser.add_argument("--non-hptune", dest="hptune", action="store_false")

  args = parser.parse_args()
  logging.info("Tensorflow version %s", tf.__version__)

  timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
  print(f"timestamp: {timestamp}")

  log_dir = os.environ["AIP_TENSORBOARD_LOG_DIR"]
  print(f" using (autogenerated) tb log dir: {log_dir}")

  checkpoint_dir = f"{args.gcs_workdir}/checkpoints/{timestamp}/checkpoints"
  if args.hptune:  # add the trial id to the dir path
    trial_id = os.environ.get("CLOUD_ML_TRIAL_ID")
    checkpoint_dir = f"{checkpoint_dir}/{trial_id}"
  print(f"checkpoint dir: {checkpoint_dir}")

  if args.copy_data:
    # copy and unzip the dataset to the local file system
    local_dir = "."
    copy_data(args.input_data, local_dir)
    data_dir = f"{local_dir}/model_data/training_data"
    print(f"training data dir: {data_dir}")

  # define and compile the model
  print("creating the model..")
  if args.multi_node:
    print("using MultiWorkerMirroredStrategy")
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
  else:
    strategy = tf.distribute.MirroredStrategy()

  print("Number of devices: {}".format(strategy.num_replicas_in_sync))

  image_size = (int(args.image_height), int(args.image_width))
  print(f'using image size: {image_size}')
  (training_set, validation_set, _, _) = generate_camelyon_datasets(args.batch_size)


  if strategy.num_replicas_in_sync > 1:
    print("Using mirrored strategy.")
    with strategy.scope():
      model = get_compiled_model(args.lr, int(args.image_height), int(args.image_width))
  else:
    model = get_compiled_model(args.lr, int(args.image_height), int(args.image_width))

  model.summary()

  # define callbacks
  (tensorboard_callback, model_checkpoint_callback) = define_callbacks(
      log_dir, checkpoint_dir
  )

  # train the model
  print(f"training the model with lr {args.lr}")
  model.fit(
      training_set,
      epochs=args.epochs,
      callbacks=[tensorboard_callback, model_checkpoint_callback],
      validation_data=validation_set,
  )
    
  ## fine-tuning for patchcamelyon
  print("Now fine-tuning the PatchCamelyon model")
  fine_tuning_epochs = 5
  for layer in model.layers:
    layer.trainable = True
  model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-4),
                loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
  model.fit(training_set, epochs=fine_tuning_epochs, callbacks=[tensorboard_callback,
            model_checkpoint_callback], validation_data=validation_set)    

  print("saving the model to GCS")
  if args.hptune:
    model_path_gcs = f"{args.gcs_model_savedir}/{trial_id}"
  else:
    if args.gcs_model_savedir == "AIP_MODEL_DIR" and os.getenv("AIP_MODEL_DIR"):
      model_path_gcs = os.getenv("AIP_MODEL_DIR")
    else:
      model_path_gcs = args.gcs_model_savedir

  print(f"GCS saved model path: {model_path_gcs}")
  model.save(model_path_gcs)

  # get some metrics info
  print(f"model history: {model.history.history}")
  val_accuracy = (model.history.history["val_accuracy"])[-1]
  val_loss = (model.history.history["val_loss"])[-1]

  (ma, mp, mr, all_preds, all_labels) = generate_metrics(
      validation_set, model
  )

  # write out metrics info for 'eval' component
  metrics_info = {
      "val_accuracy": val_accuracy,
      "val_loss": val_loss,
      "auc": f"{ma.result().numpy()}",
      "precision": f"{ma.result().numpy()}",
      "recall": f"{mr.result().numpy()}",
      "all_labels": f"{all_labels}",
      "all_preds": f"{all_preds}",
      "num_classes": len(LABELS),
  }
  print(f"AUC: {ma.result().numpy()}")
  print(f"Precision: {mp.result().numpy()}")
  print(f"Recall: {mr.result().numpy()}")

  metrics_info_str = json.dumps(metrics_info)
  print(f"metrics info json string: {metrics_info_str}")

  Path(args.gcs_metrics_path).mkdir(parents=True, exist_ok=True)
  metrics_info_file = f"{args.gcs_metrics_path}/metrics.json"

  print(f"writing metrics: {metrics_info_str} to {metrics_info_file}")
  with open(metrics_info_file, "w") as f:
    f.write(metrics_info_str)

  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag="accuracy",
      metric_value=val_accuracy,
      global_step=args.epochs,
  )


if __name__ == "__main__":
  main()


Set some variables that we'll use to configure the Vertex AI SDK calls. 

We'll use GPUs for training, and CPUs for serving.  The images below need to be consistent with that.

In [None]:
TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-6:latest"
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-6:latest"
# alternately, to serve the model with GPUs: us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-6:latest

In [None]:
ts = int(time.time())

MODEL_DISPLAY_NAME = f"{USER}-pcam{NB_NUM}-{ts}"
EPOCHS = (
    2  # set this to 3 or 4 for greater accuracy, or reduce to 1 for quicker training
)
GCS_WORKDIR = f"gs://{BUCKET_NAME}/{MODEL_DISPLAY_NAME}"

# GCS_MODEL_SAVEDIR = f'{GCS_WORKDIR}/{ts}'
GCS_MODEL_SAVEDIR = "AIP_MODEL_DIR"  # indicate to use Vertex AI-generated dir in order to automate the model upload

GCS_METRICS_PATH = f"/gcs/{BUCKET_NAME}/{MODEL_DISPLAY_NAME}/metrics/{ts}"
print(f"MODEL_DISPLAY_NAME: {MODEL_DISPLAY_NAME}")
print(f"model savedir: {GCS_MODEL_SAVEDIR}, GCS_METRICS_PATH: {GCS_METRICS_PATH}")

CMDARGS = [
    "--epochs",
    str(EPOCHS),
    "--gcs-workdir",
    GCS_WORKDIR,
    "--gcs-model-savedir",
    GCS_MODEL_SAVEDIR,
    "--gcs-metrics-path",
    GCS_METRICS_PATH,
    "--image-height",
    str(IMAGE_HEIGHT),
    "--image-width",
    str(IMAGE_WIDTH),
    "--ml-task",
    "patchcamelyon",
]
print(CMDARGS)

Here we'll indicate the machine type and accelerator type and number for the training job.

In [None]:
TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_T4, 2)
TRAIN_COMPUTE = "n1-highmem-8"

### Create and run a training job

Now, we're ready to define and run a training job on Vertex AI. 

#### Log information about the training job to the Vertex Experiments API

Before submitting the job, we'll log some information about this training job to the Experiment we created above, starting a new 'run' for the job.

In [None]:
aiplatform.start_run("train-run-1")

Gather info about the job parameters and log that to the Experiments run info.

In [None]:
args_dict = {CMDARGS[i]: CMDARGS[i + 1] for i in range(0, len(CMDARGS), 2)}
print(args_dict)

In [None]:
# you can commment out the addition of the TENSORBOARD_INSTANCE if that value is not set
HYPERPARAMS = {
    "model_display_name": MODEL_DISPLAY_NAME,
    "tensorboard_instance": TENSORBOARD_INSTANCE,
}
HYPERPARAMS = {**HYPERPARAMS, **args_dict}
print(HYPERPARAMS)

In [None]:
aiplatform.log_params(HYPERPARAMS)

Just for fun, we can see what we have logged about the Experiments run so far:

In [None]:
dataframe = aiplatform.get_experiment_df(experiment=EXPERIMENT_NAME)
dataframe

#### Define and submit the training job

In [None]:
job = aiplatform.CustomTrainingJob(
    display_name=MODEL_DISPLAY_NAME,
    script_path="pcamtask.py",
    container_uri=TRAIN_IMAGE,
    requirements=["cloudml-hypertune"],
    model_serving_container_image_uri=DEPLOY_IMAGE,
)

The alternate version below uses a 'package', not a script:

In [None]:
PYTHON_PACKAGE_GCS_URI

In [None]:
# job = aiplatform.CustomPythonPackageTrainingJob(
#    display_name=MODEL_DISPLAY_NAME,
#    python_package_gcs_uri=PYTHON_PACKAGE_GCS_URI,
#    python_module_name='trainer.task',
#    container_uri=TRAIN_IMAGE,
#    model_serving_container_image_uri=DEPLOY_IMAGE,
# )

If you didn't set up a Managed TensorBoard instance, you can comment out the `tensorboard` arg below before you run the training job.

In [None]:
# confirm the service account used for training
TRAINING_SA

In [None]:
model = job.run(
    model_display_name=MODEL_DISPLAY_NAME,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    tensorboard=TENSORBOARD_INSTANCE,
    service_account=TRAINING_SA,
    sync=False,
)

You can click on the 'custom job' link generated when you submit the training job, to view information about the job in the Cloud Console. From that page you can click on the job logs to see output of the training job as it runs.

The `sync=False` arg to the training job means that the call is non-blocking; it will return even though training is still running. (However, you'll see it generate some output to the current notebook cell while it runs).

In [None]:
print("#---you will need the following if your notebook loses context.")
print(f'GCS_METRICS_PATH = "{GCS_METRICS_PATH}"')
print(f'MODEL_DISPLAY_NAME = "{MODEL_DISPLAY_NAME}"')
print(f'EXPERIMENT_NAME = "{EXPERIMENT_NAME}"')
print(f'aiplatform.init(experiment="{EXPERIMENT_NAME}")')
print("#----------\n")

 **Wait until training has finished to proceed with the rest of this example**.

## After training finishes

**Wait until training has completed** to run this section.


### If you have lost notebook context during training

After training finishes: If you've lost notebook context during training, do two things:

1) first **re-run Section 1.2: Config and setup**.
2) Then, copy and run the output from the cell above (the output that includes: `you will need the following...`), to re-set some of the training config before proceeding. Note that you're not only setting some variables, but re-setting your Experiment context via the Vertex AI SDK in order to log some more information to it. Double check that you're only copying and pasting the output of the print statements.

If you have any issues running the following sections, double check that you've completed the steps above.

## Retrieve and save the training metrics to the Experiments `run` info

**Wait until training has completed** to run this section.

If you have trouble, you can skip this part of the example, which is just grabbing and logging some information about the training metrics.

In [None]:
# Ensure that the GCS_METRICS_PATH var is set correctly.
GCS_METRICS_PATH

Grab the file with metrics info that was generated as part of the model training process.

In [None]:
metrics_file = f"{GCS_METRICS_PATH}/metrics.json".replace("/gcs/", "gs://")
metrics_file

Parse the metrics info.  For the purposes of experiment logging, we won't include the array info generated for the confusion metrics.

In [None]:
!gsutil cat $metrics_file > temp_metrics.json

In [None]:
with open("temp_metrics.json") as fp:
    metrics = json.load(fp)
    _ = metrics.pop("all_labels")
    _ = metrics.pop("all_preds")

In [None]:
metrics = {k: float(v) for k, v in metrics.items()}
metrics

Ensure we're using the correct 'run' context within the Experiment:

In [None]:
aiplatform.start_run("train-run-1")

Log the metrics info to the run:

In [None]:
aiplatform.log_metrics(metrics)

We can take a look at the info we've logged so far to the Experiment run:

In [None]:
dataframe = aiplatform.get_experiment_df(experiment=EXPERIMENT_NAME)
dataframe

## Deploy the trained model to an endpoint

**Wait until training has completed** to run this section.

In [None]:
TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1
DEPLOY_COMPUTE = "n1-standard-4"

In [None]:
# check whether your model variable is set
model

### If your `model` variable is no longer set

If you've lost notebook context and your `model` variable is no longer set, you can reconstitute it before running the `deploy` method below.  The code below attempts to do this automatically (uncomment before running), but you can also re-set the model variable via its model ID.

In [None]:
## uncomment the code in this cell to attempt to get the model object automatically.
# modellist = aiplatform.Model.list()
# for model in modellist:
#     if f'{USER}-pcam{NB_NUM}' in model.display_name:
#         print(f'found a match for USER {USER}: {model}, {model.display_name}')
#         break

# print(f'model is: {model}')

If the above didn't work, then you can recreate the model object via its ID.
To find the model ID, one easy way is to visit the list of models in the Cloud Console: https://console.cloud.google.com/vertex-ai/models.  The name of the model should include your username and the 'pcam' string.  Copy the listed ID for that model. Then, edit and run the cell below.

In [None]:
# You can use this code to re-set your model variable if you've lost notebook context after training.
# model = aiplatform.Model('xxxxxxxxxxxxxxxxxxx')  # <-- CHANGE this to your model ID
# model

In [None]:
# Ensure that MODEL_DISPLAY_NAME is set to the value used for the training request.
MODEL_DISPLAY_NAME

In [None]:
# make sure the 'model' var and 'MODEL_DISPLAY_NAME' are set before running this cell.
endpoint = model.deploy(
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type=DEPLOY_COMPUTE,
    # accelerator_type=DEPLOY_COMPUTE.name,
    accelerator_count=0,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

## Prediction

Now, you can send prediction requests to the deployed model at the endpoint you created.

Make sure that the `endpoint` var is set.  If you've lost notebook context, you can reconstitute it by specifying the Endpoint ID as shown below. To find the endpoint ID, one easy way is to visit the list of endpoints in the Cloud Console: https://console.cloud.google.com/vertex-ai/endpoints.

In [None]:
# endpoint = aiplatform.Endpoint('xxxxxxxxxxxxxxxxxxx')
endpoint

In [None]:
LABELS = ["non_metastatic", "metastatic"]

Copy a test image.  This one has label 0 (non_metastatic).

In [None]:
!gsutil cp gs://fc-b60eeef5-8162-47a8-8114-d8dd82b65653/data/patch_camelyon/label_0/download.png .

Resize the image, and render it as a sanity check.

In [None]:
image_file = "./download.png"
display(IPython.display.Image(image_file))

img1 = Image.open(image_file)
img2 = img1.resize((92, 92), resample=PIL.Image.NEAREST)

In [None]:
image_data = np.array(img2)
img_array = np.float32(image_data)[:, :, :3]

In [None]:
img_array2 = img_array.tolist()

Send the image data to the Endpoint where your model was deployed, for online prediction.

In [None]:
predictions = endpoint.predict(instances=[img_array2])

View the prediction results:

In [None]:
predictions

In [None]:
image_predictions = predictions.predictions[0]
image_predictions

Check whether the image prediction matches its label (that is, its actual class):

In [None]:
print(
    f"image is predicted to be: {LABELS[image_predictions.index(max(image_predictions))]}"
)

## Cleanup

Delete the endpoint and model that you created.  You'll need to first undeploy the model from the endpoint before you can delete either.

The training instances are automatically torn down after the job completes. 

Delete the endpoint and model that you created.  The training instances and pipeline step instances are automatically torn down after the job completes. 

If the GCS bucket that you used is not set to automatically delete old files, then you can clean up your GCS bucket as well.  An easy way to do this is via the [Cloud Console UI](https://pantheon.corp.google.com/storage/browser).

In [None]:
# print(f"Endpoint: {endpoint}\n{endpoint.list_models()}")
# print(f"Model: {model}")

In [None]:
endpoint.undeploy_all()

In [None]:
# Delete the model
# if you've lost notebook context, you can first reconstruct the model object as follows.
# Edit the model ID for your model.
# model = aiplatform.Model('xxxxxxxxxxxxxxxxxxx')  # <-- CHANGE THIS
model.delete()

In [None]:
# Delete the endpoint
# if you've lost notebook context, you can first reconstruct the Endpoint object as follows.
# Edit the ID for your endpoint.
# endpoint = aiplatform.Endpoint('xxxxxxxxxxxxxxxxxxx')  # <-- CHANGE THIS
endpoint.delete()

In [None]:
# Delete the Experiment
# This code requires google-cloud-aiplatform >=1.8
c = aiplatform.metadata._Context(EXPERIMENT_NAME)
c.delete()

## Provenance

In [None]:
import datetime
print(datetime.datetime.now())

In [None]:
!pip3 freeze

--------------------------------
Copyright 2021 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style  
license that can be found in the LICENSE file or at  
https://developers.google.com/open-source/licenses/bsd