# Lab 4: Machine Learning with Vertex AI

Author: 
* Fabian Hirschmann
* Wissem Khlifi

Welcome back 👋😍. During this lab, you will train a machine learning model on the data set you already know. Then, we will create a MLOps pipeline to automate this process.

In a Jupyter Notebook, you can press `Shift + Return` to execute the current code junk and jump to the next one.

First, let's set a few variables and perform some Python imports

In [108]:
import random
random.seed(1337)
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import pandas as pd
import tensorflow as tf

from google.cloud import aiplatform, bigquery

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


project = !gcloud config get-value project
PROJECT_ID = project[0]

REGION = "us-central1"
BQ_DATASET = "ml_datasets"
BQ_TABLE = "ulb_fraud_detection_dataproc"
BQ_SOURCE = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

If you get warnings about GPUs not being available -- that's fine.

The requirements were automatically installed by the `bootstrap_workbench.sh` script we specified when we created this Workbench instance. However, if the above command fails due to import errors, uncomment the next chunk, run it, and then restart the kernel (`Kernel > Restart Kernel` in the menu).

In [17]:
# !pip install -r requirements.txt

The next chunk represents the source table in BigQuery we will be working with.

In [18]:
BQ_SOURCE

'astute-ace-336608.ml_datasets.ulb_fraud_detection_dataproc'

Let's have a look at the data.

In [109]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"{PROJECT_ID}-bucket")
bq = bigquery.Client(project=PROJECT_ID, location=REGION)

In [20]:
data = bq.query(f"SELECT * FROM `{BQ_SOURCE}`").to_dataframe()

Let's have a look at the data set in more detail.

In [21]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.751610,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.043150,-0.046401,0.00,0
1,14332.0,1.071950,0.340678,1.784068,2.846396,-0.751538,0.403028,-0.734920,0.205807,1.092726,...,-0.169632,-0.113604,0.067643,0.468669,0.223541,-0.112355,0.014015,0.021504,0.00,0
2,32799.0,1.153477,-0.047859,1.358363,1.480620,-1.222598,-0.481690,-0.654461,0.128115,0.907095,...,0.125514,0.480049,-0.025964,0.701843,0.417245,-0.257691,0.060115,0.035332,0.00,0
3,35799.0,-0.769798,0.622325,0.242491,-0.586652,0.527819,-0.104512,0.209909,0.669861,-0.304509,...,0.152738,0.255654,-0.130237,-0.660934,-0.493374,0.331855,-0.011101,0.049089,0.00,0
4,36419.0,1.047960,0.145048,1.624573,2.932652,-0.726574,0.690451,-0.627288,0.278709,0.318434,...,0.078499,0.658942,-0.067810,0.476882,0.526830,0.219902,0.070627,0.028488,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.329760,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1
284803,90676.0,-2.405580,3.738235,-2.317843,1.367442,0.394001,1.919938,-3.106942,-10.764403,3.353525,...,10.005998,-2.454964,1.684957,0.118263,-1.531380,-0.695308,-0.152502,-0.138866,6.99,1
284804,34634.0,0.333499,1.699873,-2.596561,3.643945,-0.585068,-0.654659,-2.275789,0.675229,-2.042416,...,0.469212,-0.144363,-0.317981,-0.769644,0.807855,0.228164,0.551002,0.305473,18.96,1
284805,96135.0,-1.952933,3.541385,-1.310561,5.955664,-1.003993,0.983049,-4.587235,-4.892184,-2.516752,...,-1.998091,1.133706,-0.041461,-0.215379,-0.865599,0.212545,0.532897,0.357892,18.96,1


## Train model on Vertex AI Workbench (JupyterLab)

In [22]:
target = data["Class"].astype(int)
data.drop("Class", axis=1, inplace=True)

In [23]:
target.value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [24]:
x_train, x_test, y_train, y_test = train_test_split(data, target, train_size = 0.80)

In [25]:
scaler = StandardScaler()
x_train = pd.DataFrame(scaler.fit_transform(x_train), index=x_train.index, columns=x_train.columns)
x_test = pd.DataFrame(scaler.transform(x_test), index=x_test.index, columns=x_test.columns)

In [26]:
BUCKET="astute-ace-336608-bucket"

In [27]:
x_train_csv = pd.read_csv(f"gs://{BUCKET}/vertex-data/x_train.csv")
y_train_csv = pd.read_csv(f"gs://{BUCKET}/vertex-data/y_train.csv")["Class"]

In [28]:
model = Sequential([
    Input(shape=(x_train.shape[1],)),
    Dense(16, activation='relu'),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [29]:
model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=['auc'])

In [None]:
%time

history = model.fit(x_train.values, y_train.values, 
                    epochs=4, 
                    batch_size=32, 
                    validation_split=0.2, 
                    verbose=1)

In [33]:
loss, auc = model.evaluate(x_test, y_test, verbose=0)

In [35]:
preds = model.predict(x_test)

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step


The next chunk evaluates a binary classifier using the area under the ROC curve. The ROC curve plots the true positive rate (TPR) vs. the false positive rate (FPR) at various thresholds. The AUC (area under the curve) is a single value between 0 and 1 that summarizes the model's performance, with higher values indicating better class separation.

In [45]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds)
metrics.auc(fpr, tpr)

0.9587367603907067

That's not bad!

What do you think about the performance of the model? Can you improve it?

## Train and serve model on Vertex AI

Custom training jobs (`CustomJob` resources in the Vertex AI API) are the basic way to run your custom machine learning (ML) training code in Vertex AI. In this lab, we will use a `CustomTrainingJob`, which runs a `CustomJob` and registers our model the the Vertex AI model registry. From the registry, a model can be deployed to an endpoint for online prediction or be used for batch prediction.

Vertex AI offers [prebuilt containers](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers) to serve predictions and explanations from models trained using the following machine learning (ML) frameworks:

* TensorFlow
* PyTorch
* XGBoost
* scikit-learn

To use one of these prebuilt containers, you must save your model as one or more model artifacts that comply with the requirements of the prebuilt container. These requirements apply whether or not your model artifacts are created on Vertex AI. In our case, this means we will upload the serialized model with file name `model.keras` to a location in Cloud Storage specified by the Vertex AI infrastructure (`AIP_MODEL_DIR`).

To train the model on Vertex AI, let's first write our preprocessed data to Cloud Storage. There are a variety of ways of loading data in Vertex AI Training jobs. We opt for the simplest one and just upload it to Cloud Storage. However, in case of large data sets, you can write to tfrecords and stream them into your training job using a FUSE-mounted Cloud Storage directory.

In [63]:
x_train.to_csv(f"gs://{PROJECT_ID}-bucket/vertex-data/x_train.csv", index=False)
y_train.to_frame().to_csv(f"gs://{PROJECT_ID}-bucket/vertex-data/y_train.csv", index=False)

In [114]:
dataset = aiplatform.TabularDataset.create(
    display_name="Bootkon Dataset",
    gcs_source=f"gs://{PROJECT_ID}-bucket/vertex-data/x_train.csv"
)

Creating TabularDataset
Create TabularDataset backing LRO: projects/888342260584/locations/us-central1/datasets/2745650409103163392/operations/9156516129248641024
TabularDataset created. Resource name: projects/888342260584/locations/us-central1/datasets/2745650409103163392
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/888342260584/locations/us-central1/datasets/2745650409103163392')


In [99]:
!mkdir -p src

In [121]:
%%writefile src/train.py
#!/usr/bin/env python
# Train a simple neural network classifier on Vertex AI

import os
from pprint import pprint
pprint(dict(os.environ))
import random
random.seed(1337)

import pandas as pd

from google.cloud import aiplatform, storage

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

BUCKET = os.environ["AIP_MODEL_DIR"].split("/")[2]

aiplatform.init(project=os.environ["CLOUD_ML_PROJECT_ID"],
                location=os.environ["CLOUD_ML_REGION"],
                staging_bucket=BUCKET)

# Load data
x_train = pd.read_csv(f"gs://{BUCKET}/vertex-data/x_train.csv")
y_train = pd.read_csv(f"gs://{BUCKET}/vertex-data/y_train.csv")["Class"]

# Train model
model = Sequential([
    Input(shape=(x_train.shape[1],)),
    Dense(16, activation='relu'),
    Dense(8, activation='relu'),  # Hidden layer with 8 neurons
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=['auc'])

history = model.fit(x_train.values, y_train.values, 
                    epochs=4, 
                    batch_size=32, 
                    validation_split=0.2, 
                    verbose=1)

# Upload model to Cloud Storage
model.save("model.keras")
client = storage.Client(project=os.environ["CLOUD_ML_PROJECT_ID"])
bucket = client.get_bucket(BUCKET)
blob = bucket.blob("/".join(os.environ["AIP_MODEL_DIR"].split("/")[3:][:-1] + ["model.keras"]))
blob.upload_from_filename("model.keras")

Overwriting src/train.py


In [122]:
%%writefile src/serve.py

print("foo")

Overwriting src/serve.py


In [123]:
%%writefile src/Dockerfile
FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-cpu.2-17.py310

WORKDIR /root

COPY train.py /root/train.py
COPY serve.py /root/serve.py

EXPOSE 8080

Overwriting src/Dockerfile


In [124]:
IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/bootkon/bootkon:latest"

In [125]:
!gcloud artifacts repositories create bootkon --repository-format=docker --location={REGION} --description="Bootkon repository"

[1;31mERROR:[0m (gcloud.artifacts.repositories.create) ALREADY_EXISTS: the repository already exists


In [126]:
%time

!cd src && gcloud builds submit --region={REGION} --tag={IMAGE_URI} --timeout=1h

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 11.2 µs
Creating temporary archive of 3 file(s) totalling 1.7 KiB before compression.
Uploading tarball of [.] to [gs://astute-ace-336608_cloudbuild/source/1737449826.339065-ace128e5b57d40a5ac858f5b98afb0bf.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/astute-ace-336608/locations/us-central1/builds/13bcf128-7726-4710-b8c8-4878e9990c17].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds;region=us-central1/13bcf128-7726-4710-b8c8-4878e9990c17?project=888342260584 ].
Waiting for build to complete. Polling interval: 1 second(s).
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "13bcf128-7726-4710-b8c8-4878e9990c17"

FETCHSOURCE
Fetching storage object: gs://astute-ace-336608_cloudbuild/source/1737449826.339065-ace128e5b57d40a5ac858f5b98afb0bf.tgz#1737449826634993
Copying gs://astute-ace-336608_cloudbuild/source/1737449826.339065-ace128e5b57d40a5ac85

In [127]:
run_id = 1 if "run_id" not in vars() else run_id + 1

job = aiplatform.CustomContainerTrainingJob(
    display_name=f"bootkon-run-{run_id}",
    container_uri=IMAGE_URI,
    command=["python", "train.py"],
    model_serving_container_image_uri=IMAGE_URI,
    model_serving_container_command=["python", "serve.py"]
)

Let's run the training job! It will take around 6 minutes, but the provisioning of the job may take longer if there are a lot of people requesting resources at the moment.

In [None]:
%time

vertex_model = job.run(
    machine_type="n2-standard-4",
    replica_count=1,
    dataset=dataset
)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 16 µs
Training Output directory:
gs://astute-ace-336608-bucket/aiplatform-custom-training-2025-01-21-09:07:39.362 
No dataset split provided. The service will use a default split.
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1337978623050645504?project=888342260584
CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/1337978623050645504 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/1337978623050645504 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/1337978623050645504 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/1337978623050645504 current state:
PipelineState.PIPELINE_ST

While it is running, please go to the [`Training Jobs` in the Vertex AI Console](https://console.cloud.google.com/vertex-ai/training/training-pipelines) and click on the training job where you see **Status: Training**. The training job is based on a more general concept called `CustomJob` and adds functionality such as automatic model upload to Cloud Storage and registering the model to the model registry. Hence, to see details about the running job, click on the **Custom Job** and then **View Logs**.

Once the model has been trained, navigate to the [`Model Registry` in Vertex AI](https://console.cloud.google.com/vertex-ai/models). Click on `bootkon-model`. Can you find your newly created model artifact? Open the `VERSION DETAILS` tab and try to find your `model.keras` artifact on Cloud Storage.

Next, create an endpoint to perform online predictions. The creation will take around 6min.

In [None]:
endpoint = vertex_model.deploy(machine_type="n2-standard-2")

Go to [`Endpoints` in Vertex AI](https://console.cloud.google.com/vertex-ai/endpoints). Notice how the above command first creates an `Endpoint` and then deploys our `Model` to this endpoint.

Let's make a prediction:

Interested in how to make a prediction using the standard REST API? Go back to [`Endpoints` in Vertex AI](https://console.cloud.google.com/vertex-ai/endpoints) and click on `SAMPLE REQUEST` next to your endpoint.

## Train and serve model on Vertex AI through Vertex AI Pipelines