# MLOps Stage 3: Automation: Creating a Kubeflow Pipeline

## Overview

In this notebook, we create a Vertex AI Pipeline for training and deploying a XGBoost model, and using Vertex AI Experiments to log training parameters and metrics.

## Objective

Here, we use prebuilt components in Vertex AI Pipelines for training and deploying a XGBoost custom model, and using Vertex AI Experiments to log the corresponding training parameters and metrics, from within the training package.

This notebook uses the following Google Cloud ML services:
- Google Cloud Pipeline Components
- Vertex AI Training
- Vertex AI Pipelines
- Vertex AI Experiments

The steps performed include:
- Construct a XGBoost training package.
- Add tracking the experiment
    - Construct a pipeline to train and deploy a XGBoost model.
- Execute the pipeline.

## Dataset

The dataset used in this example is the Synthetic Financial Fraud dataset from Kaggle. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

## Installation

Install the following packages for executing this notebook.

In [1]:
import os
import IPython

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME") and not os.getenv("VIRTUAL_ENV")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

In [2]:
!pip3 install {USER_FLAG} --upgrade \
google-cloud-aiplatform \
google-cloud-pipeline-components \
kfp && touch pip_installed



## Restart the Kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [3]:
# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Import Libraries

In [25]:
import json
import os
import kfp

from google_cloud_pipeline_components.experimental.custom_job import utils
import google.cloud.aiplatform as aip
from kfp.v2 import compiler, dsl
from kfp.v2.dsl import component

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

kfp.__version__

'1.8.18'

## Set up Project Information

In [5]:
PROJECT_ID = "bq-experiments-350102"
REGION = "us-central1"
BUCKET_NAME = "bq-experiments-fraud" 
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [11]:
! gsutil ls -al $BUCKET_URI

 493534783  2022-08-25T16:24:56Z  gs://bq-experiments-fraud/synthetic-fraud.csv#1661444696515532  metageneration=1
      2131  2022-12-26T20:22:37Z  gs://bq-experiments-fraud/trainer_fraud.tar.gz#1672086157845914  metageneration=1
                                 gs://bq-experiments-fraud/1skm4wti/
                                 gs://bq-experiments-fraud/k49hwjyi/
                                 gs://bq-experiments-fraud/mqmcvfd2/
                                 gs://bq-experiments-fraud/phk9joqs/
                                 gs://bq-experiments-fraud/pipeline_root/
                                 gs://bq-experiments-fraud/pipelines/
                                 gs://bq-experiments-fraud/q0pjoruv/
                                 gs://bq-experiments-fraud/vy5rkufq/
TOTAL: 2 objects, 493536914 bytes (470.67 MiB)


## Initialize Vertex AI SDK

In [6]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Service Account Access for Pipelines

In [10]:
SERVICE_ACCOUNT = "402374189238-compute@developer.gserviceaccount.com"

# Give storage access permissions
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

No changes made to gs://bq-experiments-fraud/
No changes made to gs://bq-experiments-fraud/


## Set Pre-built Containers

In [16]:
TRAIN_VERSION = "xgboost-cpu.1-1"
DEPLOY_VERSION = "xgboost-cpu.1-1"

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)

DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

print(TRAIN_IMAGE)
print(DEPLOY_IMAGE)

us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-1:latest
us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-1:latest


## Set Machine Type

In [17]:
MACHINE_TYPE = "n1-standard"
VCPU = "8"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

Train machine type n1-standard-8


### Helper to generate UUIDs

In [18]:
import random
import string

# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))

UUID = generate_uuid()

## Create Model Training Single Component

In [133]:
@component(
    base_image="python:3.9",
    packages_to_install=["xgboost", "pandas", "scikit-learn==1.0.2","fsspec", "gcsfs", "google-cloud-bigquery", "db_dtypes"],
)


# Main train function
def custom_train_model(
    model_dir: str,
    bq_table: str,
    project_id: str,
    max_depth: int = 3,
    learning_rate: float = 0.1,
):
    import datetime
    import os
    import subprocess
    import sys
    import pandas as pd
    import xgboost as xgb
    import argparse
    import logging
    import numpy as np
    import json

    from sklearn.model_selection import train_test_split
    from sklearn.metrics import f1_score
    from google.cloud import bigquery
    
    
    # Function to retrieve data from BigQuery
    def get_data(project_id, bq_table):
        logging.info("Downloading training data from BigQuery: {}, {}".format(project_id, bq_table))
        logging.info("Creating BigQuery client")
        bqclient = bigquery.Client(project=project_id)

        logging.info("Loading table data")
        table = bigquery.TableReference.from_string(bq_table)
        rows = bqclient.list_rows(table)
        dataframe = rows.to_dataframe()

        logging.info("Preparing data for training")
        dataframe["isFraud"] = dataframe["isFraud"].astype(int)
        dataframe.drop(['nameOrig','nameDest','isFlaggedFraud'],axis=1,inplace=True)
        X = pd.concat([dataframe.drop('type', axis=1), pd.get_dummies(dataframe['type'])], axis=1)
        y = X[['isFraud']]
        X = X.drop(['isFraud'],axis=1)

        print("Splitting data for training")
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=42, shuffle=True)

        logging.info("Finishing get_data")
        return X_train, X_test, y_train, y_test

    # Function to train the model
    def train_model(X_train, y_train, max_depth, learning_rate):
        logging.info("Start training ...")
        model = xgb.XGBClassifier(
                scale_pos_weight=734,
                max_depth=max_depth,
                learning_rate=learning_rate
        )
        model.fit(X_train, y_train)

        logging.info("Training completed")
        return model

    # Function to evaluate the model
    def evaluate_model(model, X_test, y_test):
        logging.info("Preparing test data ...")
        data_test = xgb.DMatrix(X_test)

        logging.info("Getting test predictions ...")
        y_pred = model.predict(X_test)

        logging.info("Evaluating predictions ...")
        f1 = f1_score(y_test, y_pred, average='weighted')
        logging.info(f"Evaluation completed with weighted f1 score: {f1}")

        logging.info("Finishing ...")
        return f1

    # Start of function
    logging.info("Component start")
    
    X_train, X_test, y_train, y_test = get_data(project_id, bq_table)
    print("Training the model")
    model = train_model(X_train, y_train, max_depth, learning_rate)
    print("Evaluating the model")
    f1 = evaluate_model(model, X_test, y_test)
    metric_dict = {'f1_score': f1}

    # GCSFuse conversion
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir.startswith(gs_prefix):
        args.model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
        dirpath = os.path.split(model_dir)[0]
        if not os.path.isdir(dirpath):
            os.makedirs(dirpath)

    # Export the classifier to a file
    gcs_model_path = os.path.join(model_dir, 'model.bst')
    logging.info("Saving model artifacts to {}". format(gcs_model_path))
    model.save_model(gcs_model_path)

    logging.info("Saving metrics to {}/metrics.json". format(model_dir))
    gcs_metrics_path = os.path.join(args.model_dir, 'metrics.json')
    with open(gcs_metrics_path, "w") as f:
        f.write(json.dumps(metric_dict))

In [134]:
custom_job_training_op = utils.create_custom_training_job_op_from_component(
    custom_train_model, replica_count=1
)

## Construct Custom Training Pipeline

Construct a pipeline for training a custom model using pre-built Google Cloud Pipeline Components for Vertex AI Training, as follows:

1. Pipeline arguments, specify the locations of:
- **python_package:** The custom training Python package.
- **python_module:** The entry module in the package to execute.
- **display_name:** The human readable resource name for generated resources
- **bucket:** The Cloud Storage location to store model artifacts
- **project:** The project ID.
- **region:** The region.

2. Use the prebuilt component CustomPythonPackageTrainingJobRunOp to train a custom model and upload the custom model as a Vertex AI Model resource, where:
- The display name for the model.
- The dataset is specified within the training package.
- The python package are passed into the pipeline.
- The command line arguments for the python package are hardcoded in the call to the component.
- The command line arguments for the name of the experiment and run are hardcoded in the call to the component.
- The training and serving containers are specified in the pipeline definition.
- The component returns the model resource as outputs["model"].\

Note: Since each component is executed as a graph node in its own execution context, you pass the parameter project for each component op, in constrast to doing a aip.init(project=project) if this was a Python script calling the SDK methods directly within the same execution context.

In [135]:
PIPELINE_ROOT = "{}/pipeline_root/custom_xgboost_training".format(BUCKET_URI)
MODEL_DIR = BUCKET_URI + "/model"
BQ_TABLE = "bq-experiments-350102.synthetic_financial_fraud.fraud_data"
MAX_DEPTH=1
LEARNING_RATE=6e-8

@dsl.pipeline(
    name="fraud-xgboost",
    description="Train and deploy a custom XGBoost model for fraud detection",
)
def pipeline(
    model_dir: str = MODEL_DIR,
    bq_table: str = BQ_TABLE,
    project_id: str = PROJECT_ID,
    max_depth: int = 1,
    learning_rate: float = 6e-8,
):
    from google_cloud_pipeline_components.types import artifact_types
    _ = custom_job_training_op(
            model_dir=model_dir,
            bq_table = bq_table,
            learning_rate=learning_rate,
            max_depth=max_depth,
            project_id=project_id,
            project=PROJECT_ID,
            location=REGION,
            base_output_directory=PIPELINE_ROOT,
    )

## Compile and Execute The Pipeline

In [136]:
compiler.Compiler().compile(pipeline_func=pipeline, package_path="custom_xgboost_training.json")

In [None]:
pipeline = aip.PipelineJob(
    display_name="custom_xgboost_fraud",
    template_path="custom_xgboost_training.json",
    pipeline_root=PIPELINE_ROOT,
)

pipeline.run(service_account=SERVICE_ACCOUNT)

Creating PipelineJob
PipelineJob created. Resource name: projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-20221227185850
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-20221227185850')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/fraud-xgboost-20221227185850?project=402374189238
PipelineJob projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-20221227185850 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-20221227185850 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-20221227185850 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/402374189238/locations/us-central1/pipelineJobs/fraud-xgboost-