# Lab 4: Machine Learning with Vertex AI

Author: 
* Fabian Hirschmann <<fhirschmann@google.com>>

Welcome back 👋😍. During this lab, you will train a machine learning model on the data set you already know. We will deploy it to Vertex AI and finally construct a machine learning pipeline to perform the training process automatically.

We will do it in three different maturity levels:

1. Deploying locally trained models to Vertex AI using prebuilt containers
2. Train and deploy model on Vertex AI using custom containers
3. Use Vertex AI pipeline to train and deploy the model

In this Jupyter Notebook, you can press `Shift + Return` to execute the current code junk and jump to the next one.

## Step 1: Import Dependencies and Set Environment Variables

Before we begin, let's import the necessary Python libraries and set a few environment variables for our project.

In [1]:
import random
random.seed(1337)
import os
import logging

import pandas as pd
from google.cloud import aiplatform, bigquery
from sklearn.metrics import roc_curve, auc as auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

import joblib

import google.cloud.logging
google.cloud.logging.Client().setup_logging(log_level=logging.WARNING)

project = !gcloud config get-value project
PROJECT_ID = project[0]

REGION = "us-central1"
BQ_DATASET = "ml_datasets"
BQ_TABLE = "ulb_fraud_detection_dataproc"
BQ_SOURCE = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

If you get warnings about GPUs not being available -- that's fine.

The requirements were automatically installed by the `bootstrap_workbench.sh` script we specified when we created this Workbench instance. However, if the above command fails due to import errors, uncomment the next chunk, run it, and then restart the kernel (`Kernel > Restart Kernel` in the menu).

In [2]:
# !pip install -r requirements.txt

## Step 2: Create dataset for ML

We initialize the AI Platform and BigQuery client to interact with Google Cloud services.

In [3]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"{PROJECT_ID}-bucket")
bq = bigquery.Client(project=PROJECT_ID, location=REGION)

The BigQuery table we'll be working with is as follows:

In [4]:
BQ_SOURCE

'astute-ace-336608.ml_datasets.ulb_fraud_detection_dataproc'

We execute a query to fetch the dataset from BigQuery and store it in a Pandas DataFrame.

In [5]:
data = bq.query(f"SELECT * FROM `{BQ_SOURCE}`").to_dataframe()

Let's have a look at the data set in more detail.

In [6]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.751610,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.043150,-0.046401,0.00,0
1,14332.0,1.071950,0.340678,1.784068,2.846396,-0.751538,0.403028,-0.734920,0.205807,1.092726,...,-0.169632,-0.113604,0.067643,0.468669,0.223541,-0.112355,0.014015,0.021504,0.00,0
2,32799.0,1.153477,-0.047859,1.358363,1.480620,-1.222598,-0.481690,-0.654461,0.128115,0.907095,...,0.125514,0.480049,-0.025964,0.701843,0.417245,-0.257691,0.060115,0.035332,0.00,0
3,35799.0,-0.769798,0.622325,0.242491,-0.586652,0.527819,-0.104512,0.209909,0.669861,-0.304509,...,0.152738,0.255654,-0.130237,-0.660934,-0.493374,0.331855,-0.011101,0.049089,0.00,0
4,36419.0,1.047960,0.145048,1.624573,2.932652,-0.726574,0.690451,-0.627288,0.278709,0.318434,...,0.078499,0.658942,-0.067810,0.476882,0.526830,0.219902,0.070627,0.028488,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.329760,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1
284803,90676.0,-2.405580,3.738235,-2.317843,1.367442,0.394001,1.919938,-3.106942,-10.764403,3.353525,...,10.005998,-2.454964,1.684957,0.118263,-1.531380,-0.695308,-0.152502,-0.138866,6.99,1
284804,34634.0,0.333499,1.699873,-2.596561,3.643945,-0.585068,-0.654659,-2.275789,0.675229,-2.042416,...,0.469212,-0.144363,-0.317981,-0.769644,0.807855,0.228164,0.551002,0.305473,18.96,1
284805,96135.0,-1.952933,3.541385,-1.310561,5.955664,-1.003993,0.983049,-4.587235,-4.892184,-2.516752,...,-1.998091,1.133706,-0.041461,-0.215379,-0.865599,0.212545,0.532897,0.357892,18.96,1


We separate the target variable (`Class`), which we want to predict, from the features (all other columns). The `Class` column indicates whether a transaction is fraudulent (1) or legitimate (0).

In [7]:
target = data["Class"].astype(int)
data.drop("Class", axis=1, inplace=True)

Fraud detection datasets are typically highly imbalanced, meaning the majority of transactions are legitimate. We check the distribution of our classes.

In [8]:
target.value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

We split our dataset into two parts:

- Training set (80%): Used to train the machine learning model.
- Testing set (20%): Used to evaluate the performance of the trained model.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(data, target, train_size = 0.80)

Let's also save it to Cloud Storage.

In [10]:
X_train.to_csv(f"gs://{PROJECT_ID}-bucket/data/vertex/X_train.csv", index=False)
y_train.to_frame().to_csv(f"gs://{PROJECT_ID}-bucket/data/vertex/y_train.csv", index=False)

## Train a Random Forest classifier

We use a `RandomForestClassifier`, which is an ensemble learning method that creates multiple decision trees and aggregates their predictions. This helps improve accuracy and robustness.

In [11]:
model = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=8, verbose=1)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   39.9s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:   55.3s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    0.1s finished


We calculate the accuracy of the model, which measures the proportion of correctly classified instances.

For a highly imbalanced data set, the accuracy is often meaningless, because a simple classifier that always says ***not fraud*** will have an accuracy close to 1 already.

In [12]:
accuracy_score(y_test, y_pred)

0.9996313331694814

We compute the ROC AUC (Receiver Operating Characteristic - Area Under the Curve) score. This metric evaluates the model's ability to distinguish between classes. A score closer to 1 indicates better performance.

In [13]:
roc_auc_score(y_test, y_pred_prob)

0.9484577010437408

We save the trained model to a local file so we can deploy it later.

In [14]:
joblib.dump(model, "model.joblib")

['model.joblib']

We upload the trained model to Vertex AI, where it can be used for predictions.

In [15]:
!gsutil cp model.joblib gs://{PROJECT_ID}-bucket/model/

Copying file://model.joblib [Content-Type=application/octet-stream]...
/ [1 files][  1.3 MiB/  1.3 MiB]                                                
Operation completed over 1 objects/1.3 MiB.                                      


## Serve locally trained model on Vertex AI

The Vertex AI Model Registry is a centralized repository in Google Cloud's Vertex AI platform where machine learning (ML) models are stored, managed, and versioned. It allows data scientists and ML engineers to track different model versions, store metadata, and deploy models seamlessly to Vertex AI endpoints for inference.

Key features of the Model Registry include:

* Model Versioning: Track multiple versions of a model.
* Metadata Management: Store details such as model parameters, training data, and performance metrics.
* Deployment & Serving: Deploy registered models to Vertex AI Endpoints, Batch Predictions, or export them for external use.
* Model Governance: Manage access control, approval workflows, and lineage tracking.
* Integration with Pipelines: Automate model registration via Vertex AI Pipelines.

We can register the model we just trained in this notebook as follows:

In [16]:
vertex_model_upload = aiplatform.Model.upload(
    display_name="bootkon-upload-model",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-5:latest",
    artifact_uri=f"gs://{PROJECT_ID}-bucket/model/",
    is_default_version=True,
    version_aliases=["v1"],
)

Creating Model
Create Model backing LRO: projects/888342260584/locations/us-central1/models/2743691629138280448/operations/9127100894670749696
Model created. Resource name: projects/888342260584/locations/us-central1/models/2743691629138280448@1
To use this Model in another session:
model = aiplatform.Model('projects/888342260584/locations/us-central1/models/2743691629138280448@1')


Once the model has been uploaded, navigate to the [`Model Registry` in Vertex AI](https://console.cloud.google.com/vertex-ai/models). Click on `bootkon-model`. Can you find your newly created model artifact? Open the `VERSION DETAILS` tab and try to find your model artifact on Cloud Storage.

Let's deploy the model to an endpoint for online prediction.

In [17]:
endpoint_upload = aiplatform.Endpoint.create(display_name="bootkon-endpoint-upload")

Creating Endpoint
Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/4438641579913117696/operations/8052147958612754432
Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/4438641579913117696
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/4438641579913117696')


The next code chunk will take around 10min. We don't want to wait for that, so we set `sync=False` and look at the result later.

In [18]:
vertex_model_upload.deploy(
    deployed_model_display_name="bootkon-model-upload",
    endpoint=endpoint_upload,
    machine_type="n2-standard-2",
    sync=False
)

Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/4438641579913117696


<google.cloud.aiplatform.models.Endpoint object at 0x7f722ebaf040> 
resource name: projects/888342260584/locations/us-central1/endpoints/4438641579913117696

Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/4438641579913117696/operations/3505482659805528064


The next chunk lists the currently deployed models. While the model is deploying, it wont's show up.

In [19]:
endpoint_upload.list_models()

[]

## Train and serve model using custom containers

In this section, we will train a `RandomForestClassifier` using **custom containers** on Vertex AI and deploy it for real-time predictions. Instead of using pre-built containers, we will package our training and prediction logic into Docker containers, allowing for **full control over dependencies, runtime environments, and scalability**. 

The process consists of two main steps:
1. **Model Training:** We will preprocess the dataset, train a model and save it as a serialized `joblib` file. The trained model will be uploaded to Cloud Storage for deployment.
2. **Model Serving:** Using a separate container, the stored model will be loaded from Cloud Storage, and an API will be exposed via Flask (or **FastAPI** in production) to handle inference requests.

By leveraging Vertex AI’s custom training and prediction services, we can achieve a **scalable, managed ML workflow** while keeping complete flexibility over the training and deployment pipeline.

We will create the following files:

- `train/Dockerfile`: Dockerfile for the training container
- `train/train.py`: Training script
- `predict/Dockerfile`: Dockerfile for the prediction container
- `predict/predict.py`: Prediction script

In [20]:
mkdir -p train predict

I0000 00:00:1738672582.497967 1300442 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers


In [21]:
%%writefile train/Dockerfile
FROM python:3.10-slim

WORKDIR /app
COPY train.py /app/train.py

RUN pip install --no-cache-dir --quiet pandas scikit-learn==1.5.2 google-cloud-storage fsspec gcsfs

ENTRYPOINT ["python", "/app/train.py"]

Overwriting train/Dockerfile


In [22]:
%%writefile predict/Dockerfile
FROM python:3.10-slim

WORKDIR /app
COPY predict.py /app/predict.py

RUN pip install --no-cache-dir --quiet pandas scikit-learn==1.5.2 google-cloud-storage google-cloud-aiplatform fsspec gcsfs flask
EXPOSE 8080
ENTRYPOINT ["python", "/app/predict.py"]

Overwriting predict/Dockerfile


The `train.py` script trains a `RandomForestClassifier` using scikit-learn, saves it as a `joblib` file, and uploads it to Cloud Storage. It reads the training data (`X_train` and `y_train`) from CSV files provided as command-line arguments and retrieves the target storage directory from the `AIP_MODEL_DIR` environment variable. The trained model is stored in GCS for later deployment on Vertex AI.


In [23]:
%%writefile train/train.py
import os
import sys

import joblib
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from google.cloud import storage

X_train = pd.read_csv(sys.argv[1])
y_train = pd.read_csv(sys.argv[2])

model = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=8, verbose=1)
model.fit(X_train, y_train)

joblib.dump(model, "model.joblib")
storage_client = storage.Client()
bucket = storage_client.bucket(os.environ["AIP_MODEL_DIR"].split("/")[2])
blob = bucket.blob("/".join(os.environ["AIP_MODEL_DIR"].split("/")[3:]) + "model.joblib")
blob.upload_from_filename("model.joblib")

Overwriting train/train.py


The `predict.py` script is a flask-based prediction server designed for deployment on Vertex AI using custom containers. It retrieves the model artifacts from Cloud Storage using `prediction_utils.download_model_artifacts()`, loads the model with `joblib`, and exposes two API endpoints:

- **`/predict`** for inference  
- **`/health`** for monitoring the service status  

The script reads environment variables such as `AIP_STORAGE_URI` for downloading the model and `AIP_PREDICT_ROUTE` for defining the prediction route dynamically. 

⚠ **In production,** it is recommended to use **FastAPI** instead of Flask due to its superior performance, asynchronous capabilities, and built-in request validation.


In [24]:
%%writefile predict/predict.py
import os
import joblib

import flask
import numpy as np
from google.cloud.aiplatform.utils import prediction_utils

prediction_utils.download_model_artifacts(os.environ["AIP_STORAGE_URI"])
model = joblib.load("model.joblib")

app = flask.Flask(__name__)

@app.route(os.environ.get("AIP_PREDICT_ROUTE", "/predict"), methods=["POST"])
def predict():
    data = flask.request.get_json()
    inputs = np.array(data["instances"])
    predictions = model.predict(inputs).tolist()
    return flask.jsonify({"predictions": predictions})

@app.route(os.environ.get("AIP_HEALTH_ROUTE", "/health"), methods=["GET"])
def health_check():
    print("Received health check")
    return flask.jsonify({"status": "healthy"}), 200

    
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=int(os.environ.get("AIP_HTTP_PORT", 8080)))

Overwriting predict/predict.py


In [25]:
!gcloud artifacts repositories create bootkon --repository-format=docker --location={REGION}

[1;31mERROR:[0m (gcloud.artifacts.repositories.create) ALREADY_EXISTS: the repository already exists


In [26]:
TRAIN_IMAGE_URI=f"{REGION}-docker.pkg.dev/{PROJECT_ID}/bootkon/bootkon-train:latest"

In [27]:
PREDICT_IMAGE_URI=f"{REGION}-docker.pkg.dev/{PROJECT_ID}/bootkon/bootkon-predict:latest"

In [None]:
!cd train && gcloud builds submit --region={REGION} --tag={TRAIN_IMAGE_URI} --timeout=1h --quiet

Creating temporary archive of 2 file(s) totalling 809 bytes before compression.
Uploading tarball of [.] to [gs://astute-ace-336608_cloudbuild/source/1738672584.702657-9b8aa96a022548cfbd47c778f7bbf51b.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/astute-ace-336608/locations/us-central1/builds/ff552883-06d5-442f-8d3c-f8dc27292bec].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds;region=us-central1/ff552883-06d5-442f-8d3c-f8dc27292bec?project=888342260584 ].
Waiting for build to complete. Polling interval: 1 second(s).
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "ff552883-06d5-442f-8d3c-f8dc27292bec"

FETCHSOURCE
Fetching storage object: gs://astute-ace-336608_cloudbuild/source/1738672584.702657-9b8aa96a022548cfbd47c778f7bbf51b.tgz#1738672585015672
Copying gs://astute-ace-336608_cloudbuild/source/1738672584.702657-9b8aa96a022548cfbd47c778f7bbf51b.tgz#1738672585015672...
/ [1 files][  677.0 B/  6

In [None]:
!cd predict && gcloud builds submit --region={REGION} --tag={PREDICT_IMAGE_URI} --timeout=1h --quiet

You can make predictions by using the `.predict` function of the endpoint instance. The next chunk predicts the first 2000 examples in our data set.

In [None]:
job = aiplatform.CustomContainerTrainingJob(
    display_name = "bootkon-custom",
    container_uri = TRAIN_IMAGE_URI,
    model_serving_container_image_uri = PREDICT_IMAGE_URI
)

In [None]:
vertex_model_custom = job.run(
    args=[
        f"gs://{PROJECT_ID}-bucket/data/vertex/X_train.csv",
        f"gs://{PROJECT_ID}-bucket/data/vertex/y_train.csv",
    ]
)

In [None]:
endpoint_custom = aiplatform.Endpoint.create(display_name="bootkon-endpoint-custom")

In [None]:
vertex_model_custom.deploy(
    deployed_model_display_name="bootkon-model-custom",
    endpoint=endpoint_custom,
    machine_type="n2-standard-2",
    sync=False
)

## Train and deploy models using Vertex Pipelines

**Vertex AI Pipelines** is a managed orchestration service in Vertex AI that enables the automation of machine learning (ML) workflows. It is built on **Kubeflow Pipelines (KFP)** and integrates seamlessly with **Vertex AI**, allowing for end-to-end ML lifecycle management, from data preparation to model training, evaluation, deployment, and monitoring.

Key Features of Vertex AI Pipelines:

- **Fully Managed Orchestration**: Vertex AI Pipelines automates ML workflows without the need to manage Kubernetes clusters manually. Google Cloud handles scaling, logging, and monitoring.

- **Composable and Reusable Pipelines**: Pipelines are defined using Python-based Kubeflow Pipelines (KFP) SDK, allowing components to be modular, reusable, and easily shareable across different ML projects.

- **Integration with Vertex AI Services**: Pipelines integrate seamlessly with Vertex AI services, such as Custom Training, Hyperparameter Tuning, Feature Store, and Model Deployment, enabling a streamlined ML workflow.

- **Scalability and Parallel Execution**: Supports distributed execution, allowing multiple pipeline steps (e.g., data preprocessing, model training, and evaluation) to run in parallel, optimizing resource utilization.

- **Artifact and Metadata Tracking**: All pipeline runs, datasets, and models are automatically tracked in Vertex ML Metadata, enabling easy debugging, reproducibility, and model lineage tracking.


In [None]:
response = endpoint_custom.predict(instances=X_test.head(2000).values.tolist())

Most of them are ***not fraud*** .

In [None]:
response.predictions[:10]

But there are also a few fraud cases.

In [None]:
sum(response.predictions)

## Challenge lab (optional)

Can you make a prediction using the REST API? Go the the [Vertex Console](https://console.cloud.google.com/vertex-ai/online-prediction) and click on `SAMPLE REQUEST`. You can then open a Terminal (File -> New -> Terminal) and try it out.