# MLOps stage 2 : Experimentation: Vertex AI Training for XGBoost with Hyperparameter Tuning

## Overview

This notebook demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. Here we are covering stage 2 : Vertex AI training for XGBoost with hyperparameter tuning.

## Objective

In this tutorial, you learn how to use Vertex AI Training for training a XGBoost custom model.

This tutorial uses the following Google Cloud ML services:
- Vertex AI Training
- Vertex AI Model resource
- The steps performed include:
- Vizier hyperparameter tuning

Training using a Python package.
- Report accuracy when hyperparameter tuning.
- Save the model artifacts to Cloud Storage using GCSFuse.
- Create a Vertex AI Model resource.

## Dataset

The dataset used in this example is the Synthetic Financial Fraud dataset from Kaggle. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

## Installations

In [1]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME") and not os.getenv("VIRTUAL_ENV")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q

## Restart Kernel

In [2]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Set up Project Information

In [1]:
PROJECT_ID = "bq-experiments-350102"

In [2]:
REGION = "us-central1"

In [3]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [4]:
BUCKET_NAME = "bq-experiments-fraud" 
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [5]:
! gsutil ls -al $BUCKET_URI

 493534783  2022-08-25T16:24:56Z  gs://bq-experiments-fraud/synthetic-fraud.csv#1661444696515532  metageneration=1
                                 gs://bq-experiments-fraud/pipelines/
TOTAL: 1 objects, 493534783 bytes (470.67 MiB)


## Initialize AIP SDK

In [6]:
import google.cloud.aiplatform as aip

In [7]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Set Hardware Accelerators

In [8]:
import os

if os.getenv("IS_TESTING_TRAIN_GPU"):
    TRAIN_GPU, TRAIN_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_TRAIN_GPU")),
    )
else:
    TRAIN_GPU, TRAIN_NGPU = (None, None)

if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (None, None)

## Set Pre-built Containers

In [9]:
TRAIN_VERSION = "xgboost-cpu.1-1"
DEPLOY_VERSION = "xgboost-cpu.1-1"

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)
DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

## Set Machine Type

In [10]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

Train machine type n1-standard-4


# Scikit-learn Training

Once you have trained a scikit-learn model, you will want to save it at a Cloud Storage location, so it can subsequently be uploaded to a Vertex AI Model resource. You will do the following steps to save to a Cloud Storage location.. 
- Save the in-memory model to the local filesystem in pickle format (e.g., model.pkl).
- Create a Cloud Storage storage client.
- Upload the pickle file as a blob to the specified Cloud Storage location using the Cloud Storage storage client.

You can do hyperparameter tuning with a XGBoost model.

## Training Package Description

**Package layout**
Before you start the training, you need look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.
- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
- __init__.py
- task.py

The files setup.cfg and setup.py are the instructions for installing the package into the operating environment of the Docker image.

The file trainer/task.py is the Python script for executing the custom training job. Note, when we referred to it in the worker pool specification, we replace the directory slash with a dot (trainer.task) and dropped the file suffix (.py).

## Create UUDI

In [19]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

## Package Assembly

The following cells will assemble the training package.

In [11]:
# Make folder for Python training script
! rm -rf custom
! mkdir custom

# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'cloudml-hypertune',\n\n    ],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Financial Fraud Classification\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: bryanfreeman@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

## Create the task script for the training package

Next, we need to create the task.py script for driving the training package. Some noteable steps include:

Command-line arguments:
- model-dir: The location to save the trained model. When using Vertex AI custom training, the location will be specified in the environment variable: AIP_MODEL_DIR,
- dataset_data_url: The location of the training data to download.
- dataset_labels_url: The location of the training labels to download.
- boost-rounds: Tunable hyperparameter

Data preprocessing (get_data()):
- Download the dataset and split into training and test.

Training (train_model()):
- Trains the model

Evaluation (evaluate_model()):
- Evaluates the model.
- If hyperparameter tuning, reports the metric for accuracy.

Model artifact saving
- Saves the model artifacts and evaluation metrics where the Cloud Storage location specified by model-dir.

In [135]:
%%writefile custom/trainer/task.py
import datetime
import os
import subprocess
import sys
import pandas as pd
import xgboost as xgb
import hypertune
import argparse
import logging
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from google.cloud import bigquery

# SET UP TRAINING SCRIPT ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')
parser.add_argument("--project-id", dest="project_id",
                    type=str, help="Project id for bigquery client.")
parser.add_argument("--bq-table", dest="bq_table",
                    type=str, help="Download url for the training data.")
args = parser.parse_args()

logging.getLogger().setLevel(logging.INFO)

# Function to retrieve data from BigQuery
def get_data():
    logging.info("Downloading training data from BigQuery: {}, {}".format(args.project_id, args.bq_table))
    logging.info("Creating BigQuery client")
    bqclient = bigquery.Client(project=args.project_id)
    
    logging.info("Loading table data")
    table = bigquery.TableReference.from_string(args.bq_table)
    rows = bqclient.list_rows(table)
    dataframe = rows.to_dataframe()
    
    logging.info("Preparing data for training")
    dataframe["isFraud"] = dataframe["isFraud"].astype(int)
    dataframe.drop(['nameOrig','nameDest','isFlaggedFraud'],axis=1,inplace=True)
    X = pd.concat([dataframe.drop('type', axis=1), pd.get_dummies(dataframe['type'])], axis=1)
    y = X[['isFraud']]
    X = X.drop(['isFraud'],axis=1)
    
    logging.info("Splitting data for training")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=42, shuffle=True)
    
    logging.info("Finishing get_data")
    return X_train, X_test, y_train, y_test

# Function to train the model
def train_model(X_train, y_train):
    logging.info("Start training ...")
    model = xgb.XGBClassifier(scale_pos_weight=734)
    model.fit(X_train, y_train)
    
    logging.info("Training completed")
    return model

# Function to evaluate the model
def evaluate_model(model, X_test, y_test):
    logging.info("Preparing test data ...")
    data_test = xgb.DMatrix(X_test)
    
    logging.info("Getting test predictions ...")
    y_pred = model.predict(X_test)
    
    logging.info("Evaluating predictions ...")
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    specificity = tn / (tn+fp)
    logging.info(f"Evaluation completed with model specificity: {specificity}")

    logging.info("Report metric for hyperparameter tuning ...")
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='specificity',
        metric_value=specificity
    )
    
    logging.info("Finishing ...")
    return specificity

X_train, X_test, y_train, y_test = get_data()
model = train_model(X_train, y_train)
specificity = evaluate_model(model, X_test, y_test)

# GCSFuse conversion
gs_prefix = 'gs://'
gcsfuse_prefix = '/gcs/'
if args.model_dir.startswith(gs_prefix):
    args.model_dir = args.model_dir.replace(gs_prefix, gcsfuse_prefix)
    dirpath = os.path.split(args.model_dir)[0]
    if not os.path.isdir(dirpath):
        os.makedirs(dirpath)

# Export the classifier to a file
gcs_model_path = os.path.join(args.model_dir, 'model.bst')
logging.info("Saving model artifacts to {}". format(gcs_model_path))
model.save_model(gcs_model_path)

logging.info("Saving metrics to {}/metrics.json". format(args.model_dir))
gcs_metrics_path = os.path.join(args.model_dir, 'metrics.json')
with open(gcs_metrics_path, "w") as f:
    f.write(f"{'specificity: {specificity}'}")

Overwriting custom/trainer/task.py


## Store Training Script to Cloud Storage Bucket

In [136]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_URI/trainer_fraud.tar.gz

custom/
custom/setup.py
custom/setup.cfg
custom/README.md
custom/PKG-INFO
custom/trainer/
custom/trainer/task.py
custom/trainer/__init__.py
Copying file://custom.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  2.0 KiB/  2.0 KiB]                                                
Operation completed over 1 objects/2.0 KiB.                                      


## Create Custom Training Job

In [137]:
DISPLAY_NAME = "fraud_" + UUID

job = aip.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    python_package_gcs_uri=f"{BUCKET_URI}/trainer_fraud.tar.gz",
    python_module_name="trainer.task",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    project=PROJECT_ID,
)

## Prepare Command Line Arguments

In [138]:
MODEL_DIR = "{}/{}".format(BUCKET_URI, UUID)
BQ_TABLE = "bq-experiments-350102.synthetic_financial_fraud.fraud_data"
ROUNDS = 20

DIRECT = False
if DIRECT:
    CMDARGS = [
        "--project-id=" + PROJECT_ID,
        "--bq-table=" + BQ_TABLE,
        "--model_dir=" + MODEL_DIR,
    ]
else:
    CMDARGS = [
        "--project-id=" + PROJECT_ID,
        "--bq-table=" + BQ_TABLE,
    ]

## Run Custom Training Job

In [139]:
if TRAIN_GPU:
    model = job.run(
        model_display_name="fraud_" + UUID,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        accelerator_type=TRAIN_GPU.name,
        accelerator_count=TRAIN_NGPU,
        base_output_dir=MODEL_DIR,
        sync=False,
    )
else:
    model = job.run(
        model_display_name="fraud_" + UUID,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        base_output_dir=MODEL_DIR,
        sync=False,
    )

model_path_to_deploy = MODEL_DIR

Training Output directory:
gs://bq-experiments-fraud/q0pjoruv 


## List Custom Training Job

In [140]:
_job = job.list(filter=f"display_name={DISPLAY_NAME}")
print(_job)

View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/324854833595023360?project=402374189238
[<google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob object at 0x7f2dc42b9c10> 
resource name: projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360, <google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob object at 0x7f2dc42245d0> 
resource name: projects/402374189238/locations/us-central1/trainingPipelines/9030875863255613440, <google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob object at 0x7f2dc420c410> 
resource name: projects/402374189238/locations/us-central1/trainingPipelines/1616824976696934400, <google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob object at 0x7f2dc4211210> 
resource name: projects/402374189238/locations/us-central1/trainingPipelines/4118574569701244928, <google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob object at 0x7f2dc42

## Wait for Custom Training Job to Complete

In [141]:
model.wait()

CustomPythonPackageTrainingJob projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360 current state:
PipelineState.PIPELINE_STATE_RUNNING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6015944611149643776?project=402374189238
CustomPythonPackageTrainingJob projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/402374189238/locations/us-central1/trainingPipelines/324854833595023360 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob project