In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/pytorch_cifar10_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/pytorch_cifar10_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/ai/platform/notebooks/deploy-notebook?download_url=https://raw.githubusercontent.com/amygdala/code-snippets/master/ml/vertex_pipelines/pytorch/cifar/pytorch_cifar10_vertex_pipelines.ipynb">
      Open in Google Cloud Notebooks
    </a>
  </td>
</table>

#  Vertex Pipelines: Pytorch resnet CIFAR10 e2e example

## Overview

This notebook shows two variants of a PyTorch resnet [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) end-to-end example using [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines).  The example is in GitHub [here](https://github.com/amygdala/code-snippets/tree/master/ml/vertex_pipelines/pytorch/cifar).

Thanks to the PyTorch team at Facebook for some of the underlying code and much helpful advice.

The first variant trains the model directly as a Vertex Pipelines step, using 1 GPU. The second variant trains the model using Vertex AI Custom training, using (by default) 2 gpus.

In both cases, after training, the model is uploaded to Vertex AI and deployed to an endpoint, so that it can be used for prediction.



### Set up your local development environment

**If you are using Colab or Google Cloud Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages



In [None]:
PROJECT_ID = 'your-project-id'  # <---CHANGE THIS

In [None]:
!gcloud config set project {PROJECT_ID}

On colab, authenticate first:

In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()

Then, install the libraries.

In [None]:
import sys
if 'google.colab' in sys.modules:
  USER_FLAG = ''
else:
  USER_FLAG = '--user'

In [None]:
!python3 -m pip install {USER_FLAG} torch sklearn webdataset torchvision pytorch-lightning boto3 google-cloud-build --upgrade

In [None]:
!pip3 install {USER_FLAG} google-cloud-aiplatform==1.0.0 --upgrade
!pip3 install {USER_FLAG} kfp google-cloud-pipeline-components --upgrade

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Check the versions of the packages you installed.  The KFP SDK version should be >=1.6.

In [None]:
!python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"

## Before you begin

This notebook does not require a GPU runtime.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API and Compute Engine API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component). 
Also [enable the Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com).

1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
import os
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "python-docs-samples-tests"  # @param {type:"string"}

### Authenticate your Google Cloud account

**If you are using AI Platform Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "AI Platform"
into the filter box, and select
   **AI Platform Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# If on AI Platform, then don't execute this code
if not os.path.exists("/opt/deeplearning/metadata/env_version"):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket as necessary

You will need a Cloud Storage bucket for this example.  If you don't have one that you want to use, you can make one now.


Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where AI Platform (Unified) services are
available](https://cloud.google.com/ai-platform-unified/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with AI Platform.

**Change the bucket name below** before running the next cell.

In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    from datetime import datetime
    TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")  
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

### Import libraries and define constants



Define some constants. 


In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

USER = 'your-user-name' # <---CHANGE THIS
PIPELINE_ROOT = '{}/pipeline_root/{}'.format(BUCKET_NAME, USER)

PIPELINE_ROOT

Do some imports:

In [None]:
import json
from typing import NamedTuple


from kfp import dsl
from kfp.v2 import compiler
from typing import NamedTuple
from kfp.v2 import dsl
from kfp.v2.dsl import (
    component,
    InputPath,
    OutputPath,
    Input,
    Output,
    Artifact,
    Dataset,
    Model,
    ClassificationMetrics,
    Metrics,
)

from kfp.v2.google.client import AIPlatformClient

from google_cloud_pipeline_components import aiplatform as gcc_aip
from google.cloud import aiplatform

## Define the pipeline **components**

This notebook shows two variants of an end-to-end PyTorch pipeline. They differ in the training *component* (that is, pipeline step). 

Some of the components used in these pipelines are drawn from the prebuilt set of components defined in [`google_cloud_pipeline_components`](https://github.com/kubeflow/pipelines/tree/master/components/google-cloud). These make it easy to access Vertex AI services.

Others are 'custom' components defined directly in this notebook, as Python-function-based components. Lightweight Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you.

You will notice a `@component` decorator arg named `output_component_file`.  When the components are evaluated, a component `yaml` spec file is generated. While we don't show it in this example, the component yaml files can be shared & placed under version control, and used later to define a pipeline step.

All of the custom components are defined in this section, with the exception of the second version of the training step, which is defined in a section below.



We'll start by setting the container images that we'll use for some of the components.  You can find the Dockerfiles for these images in the example repo: [Dockerfile](https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/Dockerfile) and [Dockerfile-gpu](https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/Dockerfile-gpu), respectively.


In [None]:
CONTAINER_URI = "gcr.io/google-samples/pytorch-pl:v2"
GPU_CONTAINER_URI = "gcr.io/google-samples/pytorch-pl-gpu:v5"

### Define the 'preprocess' component

This component fetches the cifar10 dataset.

In [None]:
@component(
    base_image=CONTAINER_URI,
    output_component_file="cifar_preproc.yaml",
)
def cifar_preproc(
    cifar_dataset: Output[Dataset],
):

    import subprocess
    import logging
    from pathlib import Path

    import torchvision
    import webdataset as wds
    from sklearn.model_selection import train_test_split

    logging.getLogger().setLevel(logging.INFO)
    logging.info("Dataset path is: %s", cifar_dataset.path)
    output_pth = cifar_dataset.path

    Path(output_pth).mkdir(parents=True, exist_ok=True)

    trainset = torchvision.datasets.CIFAR10(
        root="./", train=True, download=True
    )
    testset = torchvision.datasets.CIFAR10(
        root="./", train=False, download=True
    )

    Path(output_pth + "/train").mkdir(parents=True, exist_ok=True)
    Path(output_pth + "/val").mkdir(parents=True, exist_ok=True)
    Path(output_pth + "/test").mkdir(parents=True, exist_ok=True)

    random_seed = 25
    y = trainset.targets
    trainset, valset, y_train, y_val = train_test_split(
        trainset,
        y,
        stratify=y,
        shuffle=True,
        test_size=0.2,
        random_state=random_seed,
    )

    for name in [(trainset, "train"), (valset, "val"), (testset, "test")]:
        with wds.ShardWriter(
            output_pth + "/" + str(name[1]) + "/" + str(name[1]) + "-%d.tar",
            maxcount=1000,
        ) as sink:
            for index, (image, cls) in enumerate(name[0]):
                sink.write(
                    {"__key__": "%06d" % index, "ppm": image, "cls": cls}
                )

    entry_point = ["ls", "-R", output_pth]
    run_code = subprocess.run(entry_point, stdout=subprocess.PIPE)
    print(run_code.stdout)


### Define a component to create torchserve `Dockerfile` and `config.properties` files from the pipeline params

This component creates configuration files that will be used to deploy the trained model.  It can be run concurrently with other work.

The `config.properties` file will be used to create the model archive after training.

For this example, the torchserve-based container is using a GPU base image, and we will serve the model using a GPU-enabled instance.


In [None]:
@component(
    output_component_file="cifar_config.yaml",
)
def cifar_config(
    mar_model_name: str,
    version: str,
    port: int,
    cifar_config: Output[Artifact],
):
    import os
    from pathlib import Path

    Path(cifar_config.path).mkdir(parents=True, exist_ok=True)

    config_properties = f"""inference_address=http://0.0.0.0:{port}
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
service_envelope=kfserving
model_store=/home/model-server/model-store
model_snapshot={{"name":"startup.cfg","modelCount":1,"models":{{"{mar_model_name}":{{"{version}":{{"defaultVersion":true,"marName":"{mar_model_name}.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}}}}}
"""

    # write to artifact dir
    properties_path = os.path.join(cifar_config.path, "config.properties")
    with open(properties_path, "w") as f:
        f.write(config_properties)

    torchserve_dockerfile_str = f"""FROM pytorch/torchserve:0.4.0-gpu

RUN pip install --upgrade pip
RUN pip install grpcio==1.32.0
RUN pip install pytorch-lightning

COPY config.properties /home/model-server/config.properties
COPY {mar_model_name}.mar /home/model-server/model-store/
"""
    # write to artifact dir
    dockerfile_path = os.path.join(cifar_config.path, "Dockerfile")
    with open(dockerfile_path, "w") as f:
        f.write(torchserve_dockerfile_str)


### Define Version 1 of the `train` component: train on the pipeline step node

The train component will take as input the `Dataset` Artifact generated in the preproc component above, using that as the data source; and write its training data to the `Model` artifact's GCSFuse path.  That means that the trained model info is in GCS.  

This component is configured to train on 1 GPU. If you want to train on CPU, remove the `gpus` arg from the `trainer_args` definition. You'll also need to edit the pipeline definition below to remove the requirement that the training step run on a GPU-enabled instance.  

> Note: For this variant of the training step, you can not use > 1 GPU. (This constraint is tied to how the pipeline steps are launched, and will probably change in future).
See the second variant of the training step below, which uses Vertex AI custom training, for a scenario that allows multiple GPUs.

In [None]:
@component(
    base_image=GPU_CONTAINER_URI,
    output_component_file="cifar_train.yaml",
)
def cifar_train(
    model_name: str,
    max_epochs: int,
    model_display_name: str,
    tensorboard_instance:str,
    cifar_dataset: Input[Dataset],
    cifar_model: Output[Model],
):

    import pytorch_lightning as pl
    import logging
    import os
    from subprocess import Popen, DEVNULL
    import sys

    from pytorch_pipeline.components.trainer.component import Trainer
    from argparse import ArgumentParser
    from pytorch_lightning.loggers import TensorBoardLogger
    from pytorch_lightning.callbacks import (
        EarlyStopping,
        LearningRateMonitor,
        ModelCheckpoint,
    )

    logging.getLogger().setLevel(logging.INFO)
    logging.info("datset root path: %s", cifar_dataset.path)
    logging.info("model root path: %s", cifar_model.path)
    model_output_root = cifar_model.path

    # Argument parser for user defined paths
    parser = ArgumentParser()

    parser.add_argument(
        "--tensorboard_root",
        type=str,
        default=f"{model_output_root}/tensorboard",
        help="Tensorboard Root path (default: output/tensorboard)",
    )

    parser.add_argument(
        "--checkpoint_dir",
        type=str,
        default=f"{model_output_root}/train/models",
        help="Path to save model checkpoints ",
    )

    parser.add_argument(
        "--dataset_path",
        type=str,
        default=cifar_dataset.path,
        help="Cifar10 Dataset path (default: output/processing)",
    )

    parser.add_argument(
        "--model_name",
        type=str,
        default="resnet.pth",
        help="Name of the model to be saved as (default: resnet.pth)",
    )

    sys.argv = sys.argv[:1]

    parser = pl.Trainer.add_argparse_args(parent_parser=parser)
    args = vars(parser.parse_args())

    # Enabling Tensorboard Logger, ModelCheckpoint, Earlystopping
    lr_logger = LearningRateMonitor()
    tboard = TensorBoardLogger(f"{model_output_root}/tensorboard")

    early_stopping = EarlyStopping(
        monitor="val_loss", mode="min", patience=5, verbose=True
    )
    checkpoint_callback = ModelCheckpoint(
        dirpath=f"{model_output_root}/train/models",
        filename="cifar10_{epoch:02d}",
        save_top_k=1,
        verbose=True,
        monitor="val_loss",
        mode="min",
    )

    # Setting the trainer-specific arguments
    trainer_args = {
        "logger": tboard,
        "profiler": "pytorch",
        "checkpoint_callback": True,
        "max_epochs": max_epochs,
        "callbacks": [lr_logger, early_stopping, checkpoint_callback],
        "gpus": 1,
    }

    # Setting the datamodule specific arguments
    data_module_args = {"train_glob": cifar_dataset.path}

    if tensorboard_instance:
      try:
        logging.warning('setting up Vertex tensorboard experiment')
        tb_gs = f"{model_output_root}/tensorboard".replace("/gcs/", "gs://")
        logging.info('tb gs path: %s', tb_gs)
        tb_args = ["/opt/conda/bin/tb-gcp-uploader", "--tensorboard_resource_name", tensorboard_instance, 
                        "--logdir", tb_gs, "--experiment_name", model_display_name,
                        # '--one_shot=True'
                        ]
        logging.warning('tb args: %s', tb_args)
        Popen(tb_args, stdout=DEVNULL, stderr=DEVNULL)
      except Exception as e:
        logging.warning(e)

    # Initiating the training process
    logging.info("about to call the Trainer...")

    trainer = Trainer(
        module_file="cifar10_train.py",
        data_module_file="cifar10_datamodule.py",
        module_file_args=parser,
        data_module_args=data_module_args,
        trainer_args=trainer_args,
    )

 


### Define the 'mar' component

This component generates the [model archive file](https://github.com/pytorch/serve/blob/master/model-archiver/README.md) from the training results.

In [None]:
@component(
    base_image=CONTAINER_URI,
    output_component_file="mar.yaml",
)
def generate_mar_file(
    model_name: str,
    mar_model_name: str,
    handler: str,
    version: str,
    cifar_model: Input[Model],
    cifar_mar: Output[Model],
):

    import logging
    import pytorch_lightning as pl
    import os
    import subprocess

    from pathlib import Path

    def _validate_mar_config(mar_config):
        mandatory_args = [
            "MODEL_NAME",
            "SERIALIZED_FILE",
            "MODEL_FILE",
            "HANDLER",
            "VERSION",
        ]
        missing_list = []
        for key in mandatory_args:
            if key not in mar_config:
                missing_list.append(key)

        if missing_list:
            logging.warning(
                "The following Mandatory keys are missing in the config file {} ".format(
                    missing_list
                )
            )
            raise Exception(
                "Following Mandatory keys are missing in the config file {} ".format(
                    missing_list
                )
            )

    logging.getLogger().setLevel(logging.INFO)

    model_output_root = cifar_model.path
    mar_output_root = cifar_mar.path
    export_path = f"{mar_output_root}/model-store"
    try:
        Path(export_path).mkdir(parents=True, exist_ok=True)
    except Exception as e:
        logging.warning(e)
        # retry after pause
        import time

        time.sleep(2)
        Path(export_path).mkdir(parents=True, exist_ok=True)

    mar_config = {
        "MODEL_NAME": mar_model_name,
        "MODEL_FILE": "pytorch_pipeline/examples/cifar10/cifar10_train.py",
        "HANDLER": handler,
        "SERIALIZED_FILE": os.path.join(
            f"{model_output_root}/train/models",
            model_name,
        ),
        "VERSION": version,
        "EXPORT_PATH": f"{cifar_mar.path}/model-store",
    }
    logging.warning("mar_config: %s", mar_config)
    print(f"mar_config: {mar_config}")
    try:
        logging.info("validating config")
        _validate_mar_config(mar_config)
    except Exception as e:
        logging.warning(e)

    archiver_cmd = "torch-model-archiver --force --model-name {MODEL_NAME} --serialized-file {SERIALIZED_FILE} --model-file {MODEL_FILE} --handler {HANDLER} -v {VERSION}".format(
        MODEL_NAME=mar_config["MODEL_NAME"],
        SERIALIZED_FILE=mar_config["SERIALIZED_FILE"],
        MODEL_FILE=mar_config["MODEL_FILE"],
        HANDLER=mar_config["HANDLER"],
        VERSION=mar_config["VERSION"],
    )
    if "EXPORT_PATH" in mar_config:
        archiver_cmd += " --export-path {EXPORT_PATH}".format(
            EXPORT_PATH=mar_config["EXPORT_PATH"]
        )

    if "EXTRA_FILES" in mar_config:
        archiver_cmd += " --extra_files {EXTRA_FILES}".format(
            EXTRA_FILES=mar_config["EXTRA_FILES"]
        )

    if "REQUIREMENTS_FILE" in mar_config:
        archiver_cmd += " -r {REQUIREMENTS_FILE}".format(
            REQUIREMENTS_FILE=mar_config["REQUIREMENTS_FILE"]
        )

    print("Running Archiver cmd: ", archiver_cmd)
    logging.warning("archiver command: %s", archiver_cmd)

    try:
        return_code = subprocess.Popen(archiver_cmd, shell=True).wait()
        if return_code != 0:
            error_msg = (
                "Error running command {archiver_cmd} {return_code}".format(
                    archiver_cmd=archiver_cmd, return_code=return_code
                )
            )
            print(error_msg)
    except Exception as e:
        logging.warning(e)


### Define the component to build a torchserve docker image

This component uses the results of the 'config' component as well as the model archive file.  It builds a torchserve image using [Cloud Build](https://cloud.google.com/build/docs).


In [None]:
@component(
    base_image="gcr.io/deeplearning-platform-release/tf2-gpu.2-3:latest",
    output_component_file="build_image.yaml",
)
def build_torchserve_image(
    model_name: str,
    cifar_mar: Input[Model],
    cifar_config: Input[Artifact],
    project: str,
) -> NamedTuple("Outputs", [("serving_container_uri", str),],):

    from datetime import datetime
    import logging
    import os

    import google.auth
    from google.cloud.devtools import cloudbuild_v1

    logging.getLogger().setLevel(logging.INFO)
    credentials, project_id = google.auth.default()
    client = cloudbuild_v1.services.cloud_build.CloudBuildClient()

    mar_model_name = f"{model_name}.mar"
    build_version = datetime.now().strftime("%Y%m%d%H%M%S")

    dockerfile_path = os.path.join(cifar_config.path, "Dockerfile")
    gs_dockerfile_path = dockerfile_path.replace("/gcs/", "gs://")
    config_prop_path = os.path.join(cifar_config.path, "config.properties")
    gs_config_prop_path = config_prop_path.replace("/gcs/", "gs://")

    export_path = f"{cifar_mar.path}/model-store"
    model_path = os.path.join(export_path, mar_model_name)
    gs_model_path = model_path.replace("/gcs/", "gs://")
    logging.warning("gs_model_path: %s", gs_model_path)

    image_uri = f"gcr.io/{project}/torchservetest:{build_version}"
    logging.info("image uri: %s", image_uri)

    build = cloudbuild_v1.Build(images=[image_uri])
    build.steps = [
        {
            "name": "gcr.io/cloud-builders/gsutil",
            "args": [
                "cp",
                gs_config_prop_path,
                "config.properties",
            ],
        },
        {
            "name": "gcr.io/cloud-builders/gsutil",
            "args": ["cp", f"{gs_model_path}", f"{mar_model_name}"],
        },
        {
            "name": "gcr.io/cloud-builders/gsutil",
            "args": [
                "cp",
                gs_dockerfile_path,
                "Dockerfile",
            ],
        },
        {
            "name": "gcr.io/cloud-builders/docker",
            "args": ["build", "-t", image_uri, "."],
        },
    ]
    operation = client.create_build(project_id=project, build=build)
    print("IN PROGRESS:")
    print(operation.metadata)

    result = operation.result()
    # Print the completed status
    print("RESULT:", result.status)
    return (image_uri,)


## Optional: Create a Vertex Tensorboard instance

If you like, you can configure the pipeline to upload the training logs to the Vertex TensorBoard service.  To do this, you will need to pre-create a Vertex TensorBoard instance.  Follow the [instructions here](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview#create_a_instance).

As described in the docs, you will need the Vertex TensorBoard instance name (it will look something like: `projects/123/locations/us-central1/tensorboards/456`) that will be printed at the end of the create command output. Make note of that instance name, and you will use it to set a parameter when submitting the pipeline run.

## Define and run the Pipeline Version 1

Define a pipeline that uses these components. 
Before you evaluate the pipeline, **edit the GPU type** for both the `cifar_train_task` and the `model_deploy_op` depending upon what GPU quota you have available.  **You may need to request more GPU quota first**.

The pipeline will look like this:

<a href="https://storage.googleapis.com/amy-jo/images/mp/pytorch_train1.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/pytorch_train1.png" width="95%"/></a>


Define some constants:

In [None]:
from datetime import datetime
ts = datetime.now().strftime("%Y%m%d%H%M%S")
MODEL_NAME = f'resnet{ts}'
PORT = 8080
MAR_MODEL_NAME = 'cifar10'

In [None]:
print(MODEL_NAME)

In [None]:

@dsl.pipeline(
    name="pytorch-cifar-pipeline",
    pipeline_root=PIPELINE_ROOT,
)
def pytorch_cifar_pipeline(
    project: str = PROJECT_ID,
    model_name: str = "resnet.pth",
    model_display_name: str = MODEL_NAME,
    max_epochs: int = 1,
    mar_model_name: str = MAR_MODEL_NAME,
    handler: str = "image_classifier",
    version: str = "1.0",
    port: int = PORT,
    tensorboard_instance: str = ''
):
    cifar_config_task = cifar_config(mar_model_name, version, port)
    cifar_preproc_task = cifar_preproc()

    cifar_train_task = cifar_train(
        model_name=model_name,
        max_epochs=max_epochs,
        model_display_name=model_display_name,
        tensorboard_instance=tensorboard_instance,
        cifar_dataset=cifar_preproc_task.outputs["cifar_dataset"],
    ).set_gpu_limit(1).set_memory_limit('32G')
    cifar_train_task.add_node_selector_constraint(
        # You can change this to use a different accelerator. Ensure you have quota for it.
        "cloud.google.com/gke-accelerator", "nvidia-tesla-v100"
    )

    cifar_mar_task = generate_mar_file(
        model_name,
        mar_model_name,
        handler,
        version,
        cifar_train_task.outputs["cifar_model"],
    )

    build_image_task = build_torchserve_image(
        mar_model_name, cifar_mar_task.outputs["cifar_mar"], 
        cifar_config_task.outputs['cifar_config'],
        project
    )

    model_upload_op = gcc_aip.ModelUploadOp(
        project=project,
        display_name=model_display_name,
        serving_container_image_uri=build_image_task.outputs['serving_container_uri'],
        serving_container_predict_route="/predictions/{}".format(MAR_MODEL_NAME),
        serving_container_health_route="/ping",
        serving_container_ports=[PORT]        
    )
    
    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project=project,
        display_name=model_display_name,
    )

    model_deploy_op = gcc_aip.ModelDeployOp(
        project=project,
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=model_display_name,
        machine_type="n1-standard-4",
        accelerator_type='NVIDIA_TESLA_P100',  # CHANGE THIS as necessary
        accelerator_count=1        
    )



Compile the pipeline:

In [None]:
from kfp.v2 import compiler as v2compiler
v2compiler.Compiler().compile(pipeline_func=pytorch_cifar_pipeline,
                              package_path='pytorch_pipeline_spec.json')

Create a Pipelines client object:

In [None]:
from kfp.v2.google.client import AIPlatformClient  # noqa: F811

api_client = AIPlatformClient(
    project_id=PROJECT_ID, 
    region=REGION, 
    )

**Edit the following cell** if you would like to upload training logs to a Vertex Tensorboard instance.

In [None]:
TENSORBOARD_INSTANCE = 'projects/123/locations/us-central1/tensorboards/456' # CHANGE THIS TO YOUR INSTANCE NAME

Run the pipeline.  If you set up a tensorboard instance, **edit the cell above to your instance name, then uncomment the `tensorboard_instance` line below before evaluating the cell.**

In [None]:
result = api_client.create_run_from_job_spec(
    job_spec_path="pytorch_pipeline_spec.json",
    pipeline_root=PIPELINE_ROOT,
    # enable_caching=False,
    parameter_values={
        "model_name": "resnet.pth", "max_epochs": 5,
        "project": PROJECT_ID, "model_display_name": MODEL_NAME,
        # "tensorboard_instance": TENSORBOARD_INSTANCE
    },
)

You can view the running pipeline in the Cloud Console by clicking the generated link above.

### Viewing model training information using TensorBoard

If you set up a Vertex TensorBoard instance and configured the pipeline to use it, then once the pipeline training step is underway, you can view the TensorBoard server by navigating to 'Vertex AI > Experiments' in the Cloud Console.  Click on '`OPEN TENSORBOARD`' next to the newly created "experiment", which will use the `MODEL_NAME` generated above.

### Using the PyTorch profiler with TensorBoard

The training code is writing profiler information, and this can be viewed in TensorBoard by installing a plugin.  The Vertex TensorBoard service does not support adding arbitrary plugins, but you can view this information as follows (you will need the `gcloud` SDK installed):

- on your local machine, ideally within a virtual environment, run: `pip install -U torch_tb_profiler` and `pip install -U tensorboard`.
- Find the link to the TensorBoard logs produced during training.  You can do this by navigating to the pipeline in the Cloud Console and clicking on the Model Artifact produced as output by the training step.  In the right panel you will see a `URI` link that starts with `gs://` and ends with `cifar_model`.  Append `tensorboard` to that URI, which should result in a URI like this:  
`gs://<your-bucket>/.../cifar_model/tensorboard`.
- Copy the tensorboard logs to your local machine, e.g. (replacing with your URI):  
`gsutil cp gs://<your-bucket>/.../cifar_model/tensorboard /tmp`
- Run the TensorBoard server: `tensorboard --logdir=/tmp/tensorboard`
- visit the TensorBoard server at the given localhost port

## Using your deployed model to get predictions

First, [download this](https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/input.json) `input.json` file.

Then, find the 'endpoint' artifact output by the 'endpoint-create' step in the pipeline and click on it.  Look for the endpoint URI in the right-hand panel. Copy the last part of that URI (a long number).  That is the endpoint ID. (You can also find this information in the  'Vertex AI > Endpoints' panel in the Cloud Console.)

**Change the cell below to use your endpoint ID before you run it.**

In [None]:
ENDPOINT_ID = 'xxxxxxxxxxxx'  # <---- CHANGE THIS

In [None]:
!gcloud ai endpoints predict {ENDPOINT_ID} --json-request=input.json

## Define Version 2 of the training component

This version of the training component uses Vertex AI Custom Training. It does single-node, multiple-GPU training, with— by default— 2 GPUs.

A prebuilt custom training container is used, which includes the training code.  You can view the code and [Dockerfile definition](https://github.com/amygdala/code-snippets/blob/master/ml/vertex_pipelines/pytorch/cifar/Dockerfile-gpu-ct) in the [example repo](https://github.com/amygdala/code-snippets/tree/master/ml/vertex_pipelines/pytorch/cifar).

So, this component uses the Vertex AI SDK to define and launch the custom training job, then waits for it to complete.


In [None]:
@component(
    base_image="gcr.io/deeplearning-platform-release/tf2-cpu.2-3:latest",
    output_component_file="cifar_vertex_train.yaml",
    packages_to_install=["google-cloud-aiplatform"],
)
def cifar_vertex_train(
    project: str,
    region: str,
    staging_bucket: str,
    custom_container_uri: str,
    display_name: str,
    model_name: str,
    max_epochs: int,
    num_gpus: int,
    accelerator_type: str,
    tensorboard_instance: str,
    cifar_dataset: Input[Dataset],
    cifar_model: Output[Model],
):

    import logging
    import os
    import sys
    import subprocess

    from google.cloud import aiplatform

    logging.getLogger().setLevel(logging.INFO)
    gs_dataset_path = cifar_dataset.path
    gs_model_path = cifar_model.path
    gs_dataset_path = gs_dataset_path.replace("/gcs/", "gs://")
    gs_model_path = gs_model_path.replace("/gcs/", "gs://")
    logging.info('datset root path: %s', gs_dataset_path)
    logging.info('model root path: %s', gs_model_path)

    aiplatform.init(
        project=project, location=region, 
        staging_bucket=staging_bucket,
    )
    custom_job = aiplatform.CustomContainerTrainingJob(
        display_name=display_name,
        container_uri=custom_container_uri,
    )
    trainer_args = ['--gcs_tensorboard_root', f"{gs_model_path}/tensorboard", '--gcs_checkpoint_dir', 
                    f"{gs_model_path}/train/models", '--gcs_dataset_path', 
                    gs_dataset_path, '--gcs_mar_dir', f"{gs_model_path}/model-store",
                    '--vertex_num_gpus', num_gpus, '--vertex_max_epochs', max_epochs,
                    '--gcs_tensorboard_instance', tensorboard_instance]
    logging.info('trainer_args: %s', trainer_args)                    
    custom_model = custom_job.run(
        replica_count=1,
        args=trainer_args,
        sync=False,
        machine_type="n1-standard-8",
        # accelerator_type='NVIDIA_TESLA_P100',
        accelerator_type=accelerator_type,
        accelerator_count=int(num_gpus)
    )


## Define and run Pipeline Version 2

This pipeline differs from Version 1 in the training component, but its other components are otherwise the same.

The pipeline will look like this:

<a href="https://storage.googleapis.com/amy-jo/images/mp/pytorch_train2.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/pytorch_train2.png" width="95%"/></a>

In [None]:
from datetime import datetime
ts = datetime.now().strftime("%Y%m%d%H%M%S")
MODEL_NAME = f'resnet{ts}'
PORT = 8080
MAR_MODEL_NAME = 'cifar10'

In [None]:
print(MODEL_NAME)

In [None]:

@dsl.pipeline(
    name="pytorch-cifar-customtrain-pipeline",
    pipeline_root=PIPELINE_ROOT,
)
def pytorch_cifar_pipeline(
    project: str = PROJECT_ID,
    region: str = REGION,
    staging_bucket: str = PIPELINE_ROOT,
    model_name: str = "resnet.pth",
    model_display_name: str = MODEL_NAME,
    max_epochs: int = 1,
    mar_model_name: str = MAR_MODEL_NAME,
    handler: str = "image_classifier",
    version: str = "1.0",
    port: int = PORT,
    num_train_gpus: int = 2,
    accelerator_train_type: str = 'NVIDIA_TESLA_P100',
    tensorboard_instance: str = '',
    custom_container_uri: str = 'gcr.io/google-samples/pytorch-pl-gpu-ct:v4',
):
    cifar_config_task = cifar_config(mar_model_name, version, port)
    cifar_preproc_task = cifar_preproc()

    cifar_train_task = cifar_vertex_train(
        project=project,
        region=region,
        staging_bucket=staging_bucket,
        custom_container_uri=custom_container_uri,
        display_name = model_display_name,
        model_name=model_name,
        max_epochs=max_epochs,
        num_gpus=num_train_gpus,
        accelerator_type=accelerator_train_type,
        tensorboard_instance=tensorboard_instance,
        cifar_dataset=cifar_preproc_task.outputs["cifar_dataset"],
    )

    cifar_mar_task = generate_mar_file(
        model_name,
        mar_model_name,
        handler,
        version,
        cifar_train_task.outputs["cifar_model"],
    )

    build_image_task = build_torchserve_image(
        mar_model_name, cifar_mar_task.outputs["cifar_mar"], 
        cifar_config_task.outputs['cifar_config'],
        project
    )

    model_upload_op = gcc_aip.ModelUploadOp(
        project=project,
        display_name=model_display_name,
        serving_container_image_uri=build_image_task.outputs['serving_container_uri'],
        serving_container_predict_route="/predictions/{}".format(MAR_MODEL_NAME),
        serving_container_health_route="/ping",
        serving_container_ports=[PORT]        
    )
    
    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project=project,
        display_name=model_display_name,
    )

    model_deploy_op = gcc_aip.ModelDeployOp(
        project=project,
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=model_display_name,
        machine_type="n1-standard-4",
        accelerator_type='NVIDIA_TESLA_P100',
        accelerator_count=1        
    )



Compile the pipeline:

In [None]:
from kfp.v2 import compiler as v2compiler
v2compiler.Compiler().compile(pipeline_func=pytorch_cifar_pipeline,
                              package_path='pytorch_ct_pipeline_spec.json')

Create a pipelines client object:

In [None]:
from kfp.v2.google.client import AIPlatformClient  # noqa: F811

api_client = AIPlatformClient(
    project_id=PROJECT_ID, 
    region=REGION, 
    )

**Edit the following cell** if you would like to upload training logs to a Vertex Tensorboard instance.  See the "Pipeline Version 1" section on Vertex TensorBoard for more information.

In [None]:
TENSORBOARD_INSTANCE = 'projects/123/locations/us-central1/tensorboards/456' # CHANGE THIS TO YOUR INSTANCE NAME

Run the pipeline.  If you set up a tensorboard instance, **edit the cell above to your instance name, then uncomment the `tensorboard_instance` line below before evaluating the cell.**

In [None]:
result = api_client.create_run_from_job_spec(
    job_spec_path="pytorch_ct_pipeline_spec.json",
    pipeline_root=PIPELINE_ROOT,
    # enable_caching=False,
    parameter_values={
        "model_name": "resnet.pth", "max_epochs": 5,
        "project": PROJECT_ID, "model_display_name": MODEL_NAME,
        "num_train_gpus": 2,
        "custom_container_uri": 'gcr.io/google-samples/pytorch-pl-gpu-ct:v4',
        # "tensorboard_instance": TENSORBOARD_INSTANCE
    },
)

You can view the running pipeline in the Cloud Console by following the generated link above.

You can send prediction requests to the deployed model in the same way as described above for the 'Version 1' pipeline.

### Viewing model training information using TensorBoard

See the TensorBoard sections above for 'Pipeline Version 1', for information on using TensorBoard.  For this version, the TensorBoard *experiment* won't be created until training has completed.

## Cleanup

When you're done with the example, you may want to undeploy your model.  One way to do this is via the Cloud Console: visit the 'Vertex AI > Endpoints' panel, click on the endpoint(s) to which your model(s) were deployed, and delete those models.  Once the models are deleted, you can delete the endpoints as well.

Then, visit the 'Vertex AI > Notebooks' panel and remove the "tensorboard notebook" created by the pipeline.

You may also want to do other cleanup by removing the GCS artifacts used by the pipeline and by removing the GCR image builds.



---

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
