Reference: https://cloud.google.com/blog/topics/developers-practitioners/use-vertex-pipelines-build-automl-classification-end-end-workflow

# Vertex Pipelines: AutoML Tabular pipelines using google-cloud-pipeline-components


## Overview


This notebook shows how to use the components defined in [`google_cloud_pipeline_components`](https://github.com/kubeflow/pipelines/tree/master/components/google-cloud) to build an AutoML Tabular workflow on [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines).

You'll build a pipeline that looks like this:

<a href="AutoML_Tabular_DAG.png" target="_blank"><img src="AutoML_Tabular_DAG.png" width="95%"/></a>

### Costs 

Running this notebook includes billable components of Google Cloud Platform:

* Vertex AI Training and Serving
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Install additional packages


In [1]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [2]:
# !pip3 install {USER_FLAG} google-cloud-aiplatform --upgrade
# !pip3 install {USER_FLAG} kfp google-cloud-pipeline-components --upgrade

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

Check the versions of the packages you installed.  The KFP SDK version should be >=1.6.

In [3]:
!python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
!python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

KFP SDK version: 1.6.4
google_cloud_pipeline_components version: 0.1.3


In [4]:
import os

PROJECT_ID = "kubeflow-1-0-2"  # <---CHANGE THIS

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [5]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket as necessary

You will need a Cloud Storage bucket for this example.  If you don't have one that you want to use, you can make one now.


Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/ai-platform-unified/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [31]:
BUCKET_NAME = "gs://kubeflow-1-0-2-kubeflowpipelines-default"  # <---CHANGE THIS
REGION = "us-central1"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [32]:
# ! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [33]:
! gsutil ls -al $BUCKET_NAME

### Import libraries and define constants

Define some constants.


In [34]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

USER = "anurag.bhatia"  # <---CHANGE THIS
PIPELINE_ROOT = "{}/pipeline_root/{}".format(BUCKET_NAME, USER)

PIPELINE_ROOT

env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin:/home/jupyter/.local/bin:/home/jupyter/.local/bin


'gs://kubeflow-1-0-2-kubeflowpipelines-default/pipeline_root/anurag.bhatia'

Do some imports:

In [35]:
from typing import NamedTuple

import kfp
# from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import (ClassificationMetrics, Metrics, 
                        Input, Output,
                        component, Model)
from kfp.v2.google.client import AIPlatformClient

## Define a metrics eval custom component

For most of the pipeline steps, we'll be using prebuilt components for Vertex AI services, but we'll define one custom component.  

We'll define the new component as a Python-function-based component. 
Lightweight Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you. 

Note the `@component` decorator.  When you evaluate the `classif_model_eval` function, the component is compiled to what is essentially a task factory function, that can be used in the the pipeline definition. 

In addition, a `tables_eval_component.yaml` component definition file will be generated.  The component `yaml` file can be shared & placed under version control, and used later to define a pipeline step. 

You can also see that the component definition specifies a base image for the component to use (if not specified, the default is Python 3.7), and specifies that the `google-cloud-aiplatform` package should be installed. 

This component retrieves the classification model evaluation generated by the AutoML Tabular training process, does some parsing, and uses that info to render the ROC curve and confusion matrix for the model. It also uses given metrics threshold information and compares that to the evaluation results to determine whether the model is sufficiently accurate to deploy.

(Note that if this had been a regression model, the evaluation information would have a different structure.  So, this custom component is specific to an AutoML Tabular classification task).

In [36]:
@component(
           base_image="gcr.io/deeplearning-platform-release/tf2-cpu.2-3:latest",
           output_component_file="tables_eval_component.yaml",
           packages_to_install=["google-cloud-aiplatform"],
          )
def classif_model_eval_metrics(
                                project: str,
                                location: str,  # "us-central1",
                                api_endpoint: str,  # "us-central1-aiplatform.googleapis.com",
                                thresholds_dict_str: str,
                                model: Input[Model],
                                metrics: Output[Metrics],
                                metricsc: Output[ClassificationMetrics],
                              ) -> NamedTuple("Outputs", [("dep_decision", str)]):  # Return parameter.

    """This function renders evaluation metrics for an AutoML Tabular classification model.
    It retrieves the classification model evaluation generated by the AutoML Tabular training
    process, does some parsing, and uses that info to render the ROC curve and confusion matrix
    for the model. It also uses given metrics threshold information and compares that to the
    evaluation results to determine whether the model is sufficiently accurate to deploy.
    """
    import json
    import logging

    from google.cloud import aiplatform

    # Fetch model eval info
    def get_eval_info(client, model_name):
        from google.protobuf.json_format import MessageToDict

        response = client.list_model_evaluations(parent=model_name)
        metrics_list = []
        metrics_string_list = []
        for evaluation in response:
            print("model_evaluation")
            print(" name:", evaluation.name)
            print(" metrics_schema_uri:", evaluation.metrics_schema_uri)
            metrics = MessageToDict(evaluation._pb.metrics)
            for metric in metrics.keys():
                logging.info("metric: %s, value: %s", metric, metrics[metric])
            metrics_str = json.dumps(metrics)
            metrics_list.append(metrics)
            metrics_string_list.append(metrics_str)

        return (
                evaluation.name,
                metrics_list,
                metrics_string_list,
                )

    # Use the given metrics threshold(s) to determine whether the model is accurate enough to deploy.
    def classification_thresholds_check(metrics_dict, thresholds_dict):
        for k, v in thresholds_dict.items():
            logging.info("k {}, v {}".format(k, v))
            if k in ["auRoc", "auPrc"]:  # higher is better
                if metrics_dict[k] < v:  # if under threshold, don't deploy
                    logging.info("{} < {}; returning False".format(metrics_dict[k], v))
                    return False
        logging.info("threshold checks passed.")
        return True

    def log_metrics(metrics_list, metricsc):
        test_confusion_matrix = metrics_list[0]["confusionMatrix"]
        logging.info("rows: %s", test_confusion_matrix["rows"])

        # log the ROC curve
        fpr = []
        tpr = []
        thresholds = []
        for item in metrics_list[0]["confidenceMetrics"]:
            fpr.append(item.get("falsePositiveRate", 0.0))
            tpr.append(item.get("recall", 0.0))
            thresholds.append(item.get("confidenceThreshold", 0.0))
        print(f"fpr: {fpr}")
        print(f"tpr: {tpr}")
        print(f"thresholds: {thresholds}")
        metricsc.log_roc_curve(fpr, tpr, thresholds)

        # log the confusion matrix
        annotations = []
        for item in test_confusion_matrix["annotationSpecs"]:
            annotations.append(item["displayName"])
        logging.info("confusion matrix annotations: %s", annotations)
        metricsc.log_confusion_matrix(
                                      annotations,
                                      test_confusion_matrix["rows"],
                                     )

        # log textual metrics info as well
        for metric in metrics_list[0].keys():
            if metric != "confidenceMetrics":
                val_string = json.dumps(metrics_list[0][metric])
                metrics.log_metric(metric, val_string)
        # metrics.metadata["model_type"] = "AutoML Tabular classification"

    logging.getLogger().setLevel(logging.INFO)
    aiplatform.init(project=project)
    # extract the model resource name from the input Model Artifact
    model_resource_path = model.uri.replace("aiplatform://v1/", "")
    logging.info("model path: %s", model_resource_path)

    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    client = aiplatform.gapic.ModelServiceClient(client_options=client_options)
    eval_name, metrics_list, metrics_str_list = get_eval_info(
                                                              client, 
                                                              model_resource_path
                                                             )
    logging.info("got evaluation name: %s", eval_name)
    logging.info("got metrics list: %s", metrics_list)
    log_metrics(metrics_list, metricsc)

    thresholds_dict = json.loads(thresholds_dict_str)
    
    # Conditional deployment: Whether or not to bless (for inference/deployment) the trained model
    deploy = classification_thresholds_check(metrics_list[0], 
                                             thresholds_dict)
    if deploy:
        dep_decision = "true"
    else:
        dep_decision = "false"
    logging.info("deployment decision is %s", dep_decision)

    return (dep_decision,)

## Define an AutoML Tabular classification pipeline that uses components from `google_cloud_pipeline_components`


Create a managed tabular dataset from a BQ table and train it using AutoML Tabular Training.

Generate a model display name to use for the deployment.

In [37]:
import time

DISPLAY_NAME = "card-fraud{}".format(str(int(time.time())))
print(DISPLAY_NAME)

card-fraud1626122557


Define the pipeline:

In [42]:
@kfp.dsl.pipeline(name="automl-card-fraud",  # Don't make this label/name too long, to avoid error later
                  pipeline_root=PIPELINE_ROOT)
def pipeline(
            bq_source: str = "bq://kubeflow-1-0-2:credit_card_fraud.train",
            display_name: str = DISPLAY_NAME,
            project: str = PROJECT_ID,
            gcp_region: str = "us-central1",
            api_endpoint: str = "us-central1-aiplatform.googleapis.com",
            thresholds_dict_str: str = '{"auRoc": 0.95}',
            ):
    dataset_create_op = gcc_aip.TabularDatasetCreateOp(
                                                       project=project, 
                                                       display_name=display_name, 
                                                       bq_source=bq_source
                                                       )

    training_op = gcc_aip.AutoMLTabularTrainingJobRunOp(
                                                        project=project,
                                                        display_name=display_name,
                                                        optimization_prediction_type="classification",
                                                        optimization_objective="minimize-log-loss",  # TODO: only logistic regression algorithm being used?      
                                                        budget_milli_node_hours=8000,  # max 8 hours?
                                                        column_transformations=[
#                                                                                 {"numeric": {"column_name": "TransactionID"}},
                                                                                {"numeric": {"column_name": "isFraud"}},
                                                                                {"numeric": {"column_name": "TransactionDT"}},
                                                                                {"numeric": {"column_name": "TransactionAmt"}},
                                                                                {"numeric": {"column_name": "card1"}},
                                                                                {"numeric": {"column_name": "card2"}},
                                                                                {"numeric": {"column_name": "card3"}},
                                                                                {"numeric": {"column_name": "C1"}},
                                                                                {"numeric": {"column_name": "C2"}},
                                                                                {"numeric": {"column_name": "C11"}},
                                                                                {"numeric": {"column_name": "C12"}},
                                                                                {"numeric": {"column_name": "C13"}},
                                                                                {"numeric": {"column_name": "C14"}},
                                                                                {"numeric": {"column_name": "D8"}},
                                                                                {"numeric": {"column_name": "V45"}},
                                                                                {"numeric": {"column_name": "V87"}},
                                                                                {"numeric": {"column_name": "V258"}},
                                                                                {"categorical": {"column_name": "ProductCD"}},
                                                                                {"categorical": {"column_name": "card6"}},
                                                                                {"categorical": {"column_name": "emaildomain"}},
#                                                                                 {"categorical": {"column_name": "R_emaildomain"}},
                                                                                ],
                                                        dataset=dataset_create_op.outputs["dataset"],
                                                        target_column="isFraud",  # Whether fraudulent or genuine transaction
                                                        )
    
    model_eval_task = classif_model_eval_metrics(
                                                 project,
                                                 gcp_region,
                                                 api_endpoint,
                                                 thresholds_dict_str,
                                                 training_op.outputs["model"],
                                                )

    with dsl.Condition(
                       model_eval_task.outputs["dep_decision"] == "true",
                       name="deploy_decision",
                      ):

        deploy_op = gcc_aip.ModelDeployOp(
                                          model=training_op.outputs["model"],
                                          project=project,
                                          machine_type="n1-standard-4",
                                         )

## Compile and run the pipeline

Now, you're ready to compile the pipeline:

In [43]:
from kfp.v2 import compiler

compiler.Compiler().compile(
                            pipeline_func=pipeline, 
                            package_path="tabular_data_classification_pipeline.json"  # to be written
                            )

The pipeline compilation generates the json job spec file.

Next, instantiate an API client object:

In [44]:
from kfp.v2.google.client import AIPlatformClient

api_client = AIPlatformClient(project_id=PROJECT_ID, 
                              region=REGION)

Then, you run the defined pipeline like this: 

In [45]:
response = api_client.create_run_from_job_spec(
                                                "tabular_data_classification_pipeline.json",  # to be used
                                                pipeline_root=PIPELINE_ROOT,
                                                parameter_values={"project": PROJECT_ID, 
                                                                  "display_name": DISPLAY_NAME},
                                              )

Click on the generated link to see your run in the Cloud Console.  

<!-- It should look something like this as it is running:

<a href="https://storage.googleapis.com/amy-jo/images/mp/automl_tabular_classif.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/automl_tabular_classif.png" width="40%"/></a> -->

## Performance metrics

Confusion Matrix: (Snapshot from Vertex AI Pipelines Console)

<a href="AutoML_Tabular_Confusion_Matrix.png" target="_blank"><img src="AutoML_Tabular_Confusion_Matrix.png" width="95%"/></a>

Similarly, ROC curve:

<a href="AutoML_Tabular_SDK_ROC.png" target="_blank"><img src="AutoML_Tabular_SDK_ROC.png" width="95%"/></a>

Area Under Curve (AUC): ROC as well as Precision-Recall curves

<a href="AutoML_Tabular_AUC_ROC_PR.png" target="_blank"><img src="AutoML_Tabular_AUC_ROC_PR.png" width="95%"/></a>

Not bad at all, right? :)