# Using the gapic Vertex Vizier API

Run this inside a Virtual Environment (even if in a Notebook instance) to isolate the different library versions (and in particular, Tensorflow).

__[Instructions](https://janakiev.com/blog/jupyter-virtual-envs/)__ heree.

In [1]:
from time import time
start=time()

This Notebook is built to run in Google Cloud Notebooks, where you are already authenticated.

In [2]:
import sys
import imp
import os 
assert os.path.exists("/opt/deeplearning/metadata/env_version"), "Should be running in Notebook"

In [3]:
! pip install --user --upgrade google-cloud-aiplatform
! pip install --user xgboost==1.2
! pip install --user witwidget
! pip install --user wheel
! pip install --user pandas



Download the file from Joshua's bucket.

In [4]:
csvfile='fraud-header-withy.csv'
gs_csvfile = f'gs://joshuafraud/{csvfile}' 
if not os.path.exists(csvfile):
  !gsutil cp $gs_csvfile .

In [5]:
import pandas as pd
import xgboost as xgb
import numpy as np
import collections
import witwidget
import datetime
import json

from google.cloud import aiplatform

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
from sklearn.utils import shuffle
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

2021-12-26 12:19:12.375716: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-26 12:19:12.375765: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In Google Cloud Notebooks, you are already authenticated.

In [6]:
REGION = "us-west1"  
 
shell_output = !gcloud config get-value project
PROJECT_ID = shell_output[0]
assert PROJECT_ID

New Study ID each time to avoid collisions.

In [7]:
proj = PROJECT_ID.replace("-", "")
date_s= datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
STUDY_DISPLAY_NAME = f"{proj}_study_{date_s}"
ENDPOINT = REGION + "-aiplatform.googleapis.com"
PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"

print("ENDPOINT", ENDPOINT,"REGION", REGION, "PARENT", PARENT,"PROJECT_ID", PROJECT_ID)

ENDPOINT us-west1-aiplatform.googleapis.com REGION us-west1 PARENT projects/joshua-playground/locations/us-west1 PROJECT_ID joshua-playground


This is ordinary setup for a ML training job. 

In [8]:
#[0]
COLUMN_NAMES =  {
    "isFraud": np.int8,
    "type_CASH_IN": np.int8,
    "type_CASH_OUT": np.int8,
    "type_DEBIT": np.int8,
    "type_PAYMENT": np.int8,
    "type_TRANSFER": np.int8,
    "amount_nml": np.float64,
    "oldBalanceOrigin_nml": np.float64,
    "newBalanceOrigin_nml": np.float64,
    "oldBalanceDestination_nml": np.float64,
    "newBalanceDestination_nml": np.float64,
    "peakHours":np.int8
}

In [9]:
data = pd.read_csv(
  csvfile,
  index_col=None,
  dtype=COLUMN_NAMES
)
data = data.dropna()
data = shuffle(data, random_state=2)

data.head()

Unnamed: 0,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,amount_nml,oldBalanceOrigin_nml,newBalanceOrigin_nml,oldBalanceDestination_nml,newBalanceDestination_nml,peakHours
4801897,0,0,1,0,0,0,0.070289,-0.28214,-0.292442,-0.323814,-0.272905,0
2779475,0,0,1,0,0,0,0.324777,-0.284135,-0.292442,-0.323814,-0.231079,0
784465,0,0,1,0,0,0,-0.105303,-0.276696,-0.292442,-0.323814,-0.319018,0
5635648,0,0,1,0,0,0,-0.211367,-0.288381,-0.292442,-0.323814,-0.307404,0
301298,0,1,0,0,0,0,-0.101613,5.590381,5.555191,-0.095958,0.011459,0


In [10]:
labels = data['isFraud'].values
data = data.drop(columns=['isFraud'])
data.head()

Unnamed: 0,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,amount_nml,oldBalanceOrigin_nml,newBalanceOrigin_nml,oldBalanceDestination_nml,newBalanceDestination_nml,peakHours
4801897,0,1,0,0,0,0.070289,-0.28214,-0.292442,-0.323814,-0.272905,0
2779475,0,1,0,0,0,0.324777,-0.284135,-0.292442,-0.323814,-0.231079,0
784465,0,1,0,0,0,-0.105303,-0.276696,-0.292442,-0.323814,-0.319018,0
5635648,0,1,0,0,0,-0.211367,-0.288381,-0.292442,-0.323814,-0.307404,0
301298,1,0,0,0,0,-0.101613,5.590381,5.555191,-0.095958,0.011459,0


In [11]:
print("length of data", len(data))

length of data 6362620


Define hyperparameters

In [12]:
#[1]
# eta is learning rate
param_eta = {"parameter_id": "eta", "double_value_spec": {"min_value": 0, "max_value": 1}}
param_min_child_weight = {"parameter_id": "min_child_weight", "double_value_spec": {"min_value": 0, "max_value": 3}}
# Alpha is L1 regularization term on weights
param_alpha = {"parameter_id": "alpha", "double_value_spec": {"min_value": 0, "max_value": 3}}
param_max_depth = {"parameter_id": "max_depth", "integer_value_spec": {"min_value": 3, "max_value": 9}}

Because it is an unbalanced dataset, we use Balanced Accuracy as our target metric: The Average of recall on each class (positive and negative).

In [13]:
#[2]
balanced_accuracy = {"metric_id": "balanced_accuracy", "goal": "MAXIMIZE"}
parameter_defs=[  param_eta,param_min_child_weight,param_alpha,param_max_depth ]
study = {
    "display_name": STUDY_DISPLAY_NAME,
    "study_spec": {
        "algorithm": "ALGORITHM_UNSPECIFIED",
        "parameters": parameter_defs,
        "metrics": [balanced_accuracy],
    },
}

Here we have a Python class specific to the API.

In [14]:
#[3]
vizier_client = aiplatform.gapic.VizierServiceClient(client_options=dict(api_endpoint=ENDPOINT))
study = vizier_client.create_study(parent=PARENT, study=study)

"Creating the metric" (balanced accuracy) actually means running an entire training job.

In [15]:
#[7]
def create_metric(trial_id, params, data, labels):
    print(round(time()-start,1), f"seconds elapsed; Trial #{trial_id.split('/')[-1]}:" , params )

    x,y = data.values, labels
    x_train, x_test, y_train, y_test = train_test_split(x,y)

    model = xgb.XGBClassifier(objective='reg:logistic',   **params)

    model.fit(x_train, y_train)
    #[8]
    y_pred = model.predict(x_test)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred.round())
    print("balanced_accuracy", balanced_accuracy)
    
    balanced_accuracy_metric = {"metric_id": "balanced_accuracy", "value": balanced_accuracy}
 
    return [balanced_accuracy_metric]

In [16]:
def trial_params_to_dict_with_rounded_ints(trial_params):
    ret = {}
    int_params = { p["parameter_id"] for p in parameter_defs if "integer_value_spec" in p  }
    for p in trial_params:
        if p.parameter_id in int_params:
            ret[p.parameter_id] = int(round(p.value))
        else:
            ret[p.parameter_id]=p.value
    return ret      

Run a batch of suggested trials.

In [17]:
#[5]
def do_trials(trial_id, client_id, suggestions_per_request, study_name):
    suggest_response = vizier_client.suggest_trials(
        {
            "parent": study_name,
            "suggestion_count": suggestions_per_request,
            "client_id": client_id,
        }
    )

    for suggested_trial in suggest_response.result().trials:
        trial_id = suggested_trial.name.split("/")[-1]
        trial = vizier_client.get_trial({"name": suggested_trial.name})

        if trial.state in ["COMPLETED", "INFEASIBLE"]:
            continue
        
        parameters = trial_params_to_dict_with_rounded_ints(trial.parameters) 
        
        tries = 0
        while tries < 3:
            tries +=1
            try:
                vizier_client.add_trial_measurement(
                        {
                            "trial_name": suggested_trial.name,
    
                            "measurement": {
                                "metrics": create_metric(suggested_trial.name, parameters, data, labels)  #[6]
                            },
                        }
                    )
            except Exception as e:
                print("Try number", tries, "failed", e)
                time.sleep(2**tries)#exp backoff
                
        response = vizier_client.complete_trial(
                {"name": suggested_trial.name, "trial_infeasible": False} )
        
        
    return trial_id

Because Vizier stores trials in a database, you may want to distinguish multiple clients interacting with it.

In [18]:
client_id = "client1"  

Beware! Like all ML training, this can get expensive, so limit the number of suggestions from Vizier per request, and the total number of trials, as set in the variables below.

In [19]:
suggestions_per_request = 1
trials_to_do = 2

In [19]:
print("client_id:", client_id, ", suggestion_count_per_request:", suggestions_per_request, ", trials_to_do:", trials_to_do)
print("Before running trials", round(time()-start,1), "seconds elapsed")

client_id: client1 , suggestion_count_per_request: 1 , trials_to_do: 2


Outer loop: Get suggestions from Vizier repeatedly until we have enough trials. 

In [21]:
trial_id = 0
#[4]
# trial_id can be string, parsed from the suggested_trial.name, or else an int returned from do_trials()
while int(trial_id) < trials_to_do:
    trial_id = do_trials(trial_id, client_id,suggestions_per_request, study.name) 

25.4 seconds elapsed; Trial #1: {'alpha': 1.5, 'eta': 0.5, 'max_depth': 6, 'min_child_weight': 1.5}
balanced_accuracy 0.9268996320227156
464.3 seconds elapsed; Trial #1: {'alpha': 1.5, 'eta': 0.5, 'max_depth': 6, 'min_child_weight': 1.5}
balanced_accuracy 0.9294196162867241
878.8 seconds elapsed; Trial #1: {'alpha': 1.5, 'eta': 0.5, 'max_depth': 6, 'min_child_weight': 1.5}
balanced_accuracy 0.9269447325957195
1298.6 seconds elapsed; Trial #2: {'alpha': 2.135904004004592, 'eta': 0.4630913399905392, 'max_depth': 8, 'min_child_weight': 2.1357746227218115}
balanced_accuracy 0.9314622047207861
1846.5 seconds elapsed; Trial #2: {'alpha': 2.135904004004592, 'eta': 0.4630913399905392, 'max_depth': 8, 'min_child_weight': 2.1357746227218115}
balanced_accuracy 0.9366840852473732
2366.5 seconds elapsed; Trial #2: {'alpha': 2.135904004004592, 'eta': 0.4630913399905392, 'max_depth': 8, 'min_child_weight': 2.1357746227218115}
balanced_accuracy 0.9365132817389556


Here we ask Vizier for the best trials (multiple, because it can return multiple best trials, along a Pareto frontier).

In [22]:
#[9]
optimal_trials = vizier_client.list_optimal_trials({"parent": study.name})

print(f"optimal_trials: {optimal_trials}")

optimal_trials: optimal_trials {
  name: "projects/401966870909/locations/us-west1/studies/1816083338154/trials/2"
  state: SUCCEEDED
  parameters {
    parameter_id: "alpha"
    value {
      number_value: 2.135904004004592
    }
  }
  parameters {
    parameter_id: "eta"
    value {
      number_value: 0.4630913399905392
    }
  }
  parameters {
    parameter_id: "max_depth"
    value {
      number_value: 8.0
    }
  }
  parameters {
    parameter_id: "min_child_weight"
    value {
      number_value: 2.1357746227218115
    }
  }
  final_measurement {
    metrics {
      metric_id: "balanced_accuracy"
      value: 0.9365132817389556
    }
  }
  measurements {
    metrics {
      metric_id: "balanced_accuracy"
      value: 0.9365132817389556
    }
  }
  start_time {
    seconds: 1640522438
  }
  end_time {
    seconds: 1640524015
  }
  client_id: "client1"
}



In [23]:
#vizier_client.delete_study({"name": study.name})