# MLOps Stage 3: Automation: Creating a Kubeflow Pipeline

## Overview

In this notebook, we create a Vertex AI Pipeline for training and deploying a XGBoost model, and using Vertex AI Experiments to log training parameters and metrics.

## Objective

Here, we use prebuilt components in Vertex AI Pipelines for training and deploying a XGBoost custom model, and using Vertex AI Experiments to log the corresponding training parameters and metrics, from within the training package.

This notebook uses the following Google Cloud ML services:
- Google Cloud Pipeline Components
- Vertex AI Training
- Vertex AI Pipelines
- Vertex AI Experiments

The steps performed include:
- Construct a XGBoost training package.
- Add tracking the experiment
    - Construct a pipeline to train and deploy a XGBoost model.
- Execute the pipeline.

## Dataset

The dataset used in this example is the Synthetic Financial Fraud dataset from Kaggle. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

## Installation

Install the following packages for executing this notebook.

In [1]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME") and not os.getenv("VIRTUAL_ENV")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install {USER_FLAG} --upgrade --quiet google-cloud-aiplatform \
                                             google-cloud-pipeline-components \
                                             kfp 

## Restart the Kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [2]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Set up Project Information

In [1]:
PROJECT_ID = "bq-experiments-350102"

In [2]:
REGION = "us-central1"

In [3]:
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [4]:
BUCKET_NAME = "bq-experiments-fraud" 
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [5]:
! gsutil ls -al $BUCKET_URI

 493534783  2022-08-25T16:24:56Z  gs://bq-experiments-fraud/synthetic-fraud.csv#1661444696515532  metageneration=1
      2133  2022-11-11T15:17:22Z  gs://bq-experiments-fraud/trainer_fraud.tar.gz#1668179842539274  metageneration=1
                                 gs://bq-experiments-fraud/mqmcvfd2/
                                 gs://bq-experiments-fraud/pipelines/
                                 gs://bq-experiments-fraud/q0pjoruv/
                                 gs://bq-experiments-fraud/vy5rkufq/
TOTAL: 2 objects, 493536916 bytes (470.67 MiB)


## Import Libraries

In [10]:
import json
import os

import google.cloud.aiplatform as aip
import tensorflow as tf
from kfp import dsl
from kfp.v2 import compiler

## Initialize Vertex AI SDK

In [7]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Set Pre-built Containers

In [8]:
TRAIN_VERSION = "xgboost-cpu.1-1"
DEPLOY_VERSION = "xgboost-cpu.1-1"

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)


print(TRAIN_IMAGE)

us-docker.pkg.dev/vertex-ai/training/xgboost-cpu.1-1:latest


## Set Machine Type

In [11]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

Train machine type n1-standard-4
