# Lab 4: Machine Learning with Vertex AI

Author: 
* Fabian Hirschmann
* Wissem Khlifi

Welcome back 👋😍. During this lab, you will train a machine learning model on the data set you already know. Then, we will create a MLOps pipeline to automate this process.

In a Jupyter Notebook, you can press `Shift + Return` to execute the current code junk and jump to the next one.

First, let's set a few variables and perform some Python imports

In [22]:
import random
random.seed(1337)
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import pandas as pd
import tensorflow as tf

from google.cloud import aiplatform, bigquery


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import metrics
from sklearn.metrics import roc_curve, auc as auc_score

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


project = !gcloud config get-value project
PROJECT_ID = project[0]

REGION = "us-central1"
BQ_DATASET = "ml_datasets"
BQ_TABLE = "ulb_fraud_detection_dataproc"
BQ_SOURCE = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

In [2]:
tf.__version__

'2.15.0'

If you get warnings about GPUs not being available -- that's fine.

The requirements were automatically installed by the `bootstrap_workbench.sh` script we specified when we created this Workbench instance. However, if the above command fails due to import errors, uncomment the next chunk, run it, and then restart the kernel (`Kernel > Restart Kernel` in the menu).

In [3]:
# !pip install -r requirements.txt

The next chunk represents the source table in BigQuery we will be working with.

In [4]:
BQ_SOURCE

'astute-ace-336608.ml_datasets.ulb_fraud_detection_dataproc'

Let's have a look at the data.

In [5]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"{PROJECT_ID}-bucket")
bq = bigquery.Client(project=PROJECT_ID, location=REGION)

In [6]:
data = bq.query(f"SELECT * FROM `{BQ_SOURCE}`").to_dataframe()

Let's have a look at the data set in more detail.

In [7]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.751610,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.043150,-0.046401,0.00,0
1,14332.0,1.071950,0.340678,1.784068,2.846396,-0.751538,0.403028,-0.734920,0.205807,1.092726,...,-0.169632,-0.113604,0.067643,0.468669,0.223541,-0.112355,0.014015,0.021504,0.00,0
2,32799.0,1.153477,-0.047859,1.358363,1.480620,-1.222598,-0.481690,-0.654461,0.128115,0.907095,...,0.125514,0.480049,-0.025964,0.701843,0.417245,-0.257691,0.060115,0.035332,0.00,0
3,35799.0,-0.769798,0.622325,0.242491,-0.586652,0.527819,-0.104512,0.209909,0.669861,-0.304509,...,0.152738,0.255654,-0.130237,-0.660934,-0.493374,0.331855,-0.011101,0.049089,0.00,0
4,36419.0,1.047960,0.145048,1.624573,2.932652,-0.726574,0.690451,-0.627288,0.278709,0.318434,...,0.078499,0.658942,-0.067810,0.476882,0.526830,0.219902,0.070627,0.028488,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.329760,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1
284803,90676.0,-2.405580,3.738235,-2.317843,1.367442,0.394001,1.919938,-3.106942,-10.764403,3.353525,...,10.005998,-2.454964,1.684957,0.118263,-1.531380,-0.695308,-0.152502,-0.138866,6.99,1
284804,34634.0,0.333499,1.699873,-2.596561,3.643945,-0.585068,-0.654659,-2.275789,0.675229,-2.042416,...,0.469212,-0.144363,-0.317981,-0.769644,0.807855,0.228164,0.551002,0.305473,18.96,1
284805,96135.0,-1.952933,3.541385,-1.310561,5.955664,-1.003993,0.983049,-4.587235,-4.892184,-2.516752,...,-1.998091,1.133706,-0.041461,-0.215379,-0.865599,0.212545,0.532897,0.357892,18.96,1


## Train model on Vertex AI Workbench (JupyterLab)

In [8]:
target = data["Class"].astype(int)
data.drop("Class", axis=1, inplace=True)

In [9]:
target.value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [36]:
X_train, X_test, y_train, y_test = train_test_split(data, target, train_size = 0.80)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
import joblib

model = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=8, verbose=1)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.


In [None]:
accuracy_score(y_test, y_pred)

The next chunk evaluates a binary classifier using the area under the ROC curve. The ROC curve plots the true positive rate (TPR) vs. the false positive rate (FPR) at various thresholds. The AUC (area under the curve) is a single value between 0 and 1 that summarizes the model's performance, with higher values indicating better class separation.

In [None]:
roc_auc_score(y_test, y_pred_prob)

In [None]:
joblib.dump(model, "model.joblib")

In [63]:
!gsutil cp model.joblib gs://{PROJECT_ID}-bucket/model/

Copying file://model.joblib [Content-Type=application/octet-stream]...
/ [1 files][250.3 KiB/250.3 KiB]                                                
Operation completed over 1 objects/250.3 KiB.                                    


That's not bad!

What do you think about the performance of the model? Can you improve it?

## Serve model on Vertex AI

In [66]:
vertex_model = aiplatform.Model.upload(
    display_name="bootkon-model",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-5:latest",
    artifact_uri=f"gs://{PROJECT_ID}-bucket/model/",
    is_default_version=True,
    version_aliases=["v1"],
)

Creating Model


INFO:google.cloud.aiplatform.models:Creating Model


Create Model backing LRO: projects/888342260584/locations/us-central1/models/1618073197272367104/operations/4948297002145284096


INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/888342260584/locations/us-central1/models/1618073197272367104/operations/4948297002145284096


Model created. Resource name: projects/888342260584/locations/us-central1/models/1618073197272367104@1


INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/888342260584/locations/us-central1/models/1618073197272367104@1


To use this Model in another session:


INFO:google.cloud.aiplatform.models:To use this Model in another session:


model = aiplatform.Model('projects/888342260584/locations/us-central1/models/1618073197272367104@1')


INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/888342260584/locations/us-central1/models/1618073197272367104@1')


In [67]:
endpoint = vertex_model.deploy(machine_type="n2-standard-2")

Creating Endpoint


INFO:google.cloud.aiplatform.models:Creating Endpoint


Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7919546856302968832


INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7919546856302968832


Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


To use this Endpoint in another session:


INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:


endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/2471870364519497728')


INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/2471870364519497728')


Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/2471870364519497728


Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7254140011358978048


INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7254140011358978048


Endpoint model deployed. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


In [105]:
response = endpoint.predict(instances=x_test.head(2000).values.tolist())

In [106]:
response.predictions[:10]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [107]:
sum(response.predictions)

2.0

## Train and serve model on Vertex AI

Custom training jobs (`CustomJob` resources in the Vertex AI API) are the basic way to run your custom machine learning (ML) training code in Vertex AI. In this lab, we will use a `CustomTrainingJob`, which runs a `CustomJob` and registers our model the the Vertex AI model registry. From the registry, a model can be deployed to an endpoint for online prediction or be used for batch prediction.

Vertex AI offers [prebuilt containers](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers) to serve predictions and explanations from models trained using the following machine learning (ML) frameworks:

* TensorFlow
* PyTorch
* XGBoost
* scikit-learn

To use one of these prebuilt containers, you must save your model as one or more model artifacts that comply with the requirements of the prebuilt container. These requirements apply whether or not your model artifacts are created on Vertex AI. In our case, this means we will upload the serialized model to a location in Cloud Storage specified by the Vertex AI infrastructure (`AIP_MODEL_DIR`).

To train the model on Vertex AI, let's first write our preprocessed data to Cloud Storage. There are a variety of ways of loading data in Vertex AI Training jobs. We opt for the simplest one and just upload it to Cloud Storage. However, in case of large data sets, you can write to tfrecords and stream them into your training job using a FUSE-mounted Cloud Storage directory.

In [56]:
x_train.to_csv(f"gs://{PROJECT_ID}-bucket/vertex-data/x_train.csv", index=False)
y_train.to_frame().to_csv(f"gs://{PROJECT_ID}-bucket/vertex-data/y_train.csv", index=False)

In [58]:
!mkdir -p src

In [68]:
%%writefile src/train.py
#!/usr/bin/env python
# Train a simple neural network classifier on Vertex AI

import os
from pprint import pprint
pprint(dict(os.environ))
import random
random.seed(1337)

import pandas as pd

from google.cloud import aiplatform, storage

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import metrics


BUCKET = os.environ["AIP_MODEL_DIR"].split("/")[2]

aiplatform.init(project=os.environ["CLOUD_ML_PROJECT_ID"],
                location=os.environ["CLOUD_ML_REGION"],
                staging_bucket=BUCKET)

# Load data
x_train = pd.read_csv(f"gs://{BUCKET}/vertex-data/x_train.csv")
y_train = pd.read_csv(f"gs://{BUCKET}/vertex-data/y_train.csv")["Class"]

# Train model
model = Sequential([
    Input(shape=(x_train.shape[1],)),
    Dense(16, activation='relu'),
    Dense(8, activation='relu'),  # Hidden layer with 8 neurons
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=[metrics.AUC()])

history = model.fit(x_train.values, y_train.values, 
                    epochs=4, 
                    batch_size=32, 
                    validation_split=0.2, 
                    verbose=1)

# Upload model to Cloud Storage
model.save(os.environ["AIP_MODEL_DIR"])

Overwriting src/train.py


In [71]:
run_id = 1 if "run_id" not in vars() else run_id + 1

job = aiplatform.CustomContainerTrainingJob(
    display_name=f"bootkon-run-{run_id}",
    container_uri=IMAGE_URI,
    command=["python", "train.py"],
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-17:latest", #IMAGE_URI,
    #model_serving_container_command=["python", "serve.py"]
)

Let's run the training job! It will take around 6 minutes, but the provisioning of the job may take longer if there are a lot of people requesting resources at the moment.

While it is running, please go to the [`Training Jobs` in the Vertex AI Console](https://console.cloud.google.com/vertex-ai/training/training-pipelines) and click on the training job where you see **Status: Training**. The training job is based on a more general concept called `CustomJob` and adds functionality such as automatic model upload to Cloud Storage and registering the model to the model registry. Hence, to see details about the running job, click on the **Custom Job** and then **View Logs**.

***
<font color="red">While you wait for the next code chunk to complete (10-20min), you can already start going through the notebook named `bootkon_vertex_pipelines.ipynb`. Come back here regularly and continue when it has finished.</font>
***

Once the model has been trained, navigate to the [`Model Registry` in Vertex AI](https://console.cloud.google.com/vertex-ai/models). Click on `bootkon-model`. Can you find your newly created model artifact? Open the `VERSION DETAILS` tab and try to find your model artifact on Cloud Storage.

We also deploy to model to an endpoint. Go to [`Online Predictions` in Vertex AI](https://console.cloud.google.com/vertex-ai/endpoints). Notice how the above command first creates an `Endpoint` and then deploys our `Model` to this endpoint.


In [72]:
vertex_model = job.run(
    machine_type="n2-standard-4",
    replica_count=1
)

endpoint = vertex_model.deploy(machine_type="n2-standard-2")

Training Output directory:
gs://astute-ace-336608-bucket/aiplatform-custom-training-2025-01-30-07:27:38.140 


INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://astute-ace-336608-bucket/aiplatform-custom-training-2025-01-30-07:27:38.140 


View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/9160833430374580224?project=888342260584


INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/9160833430374580224?project=888342260584


CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2871693989705154560?project=888342260584


INFO:google.cloud.aiplatform.training_jobs:View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2871693989705154560?project=888342260584


CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224 current state:
PipelineState.PIPELINE_STATE_RUNNING


CustomContainerTrainingJob run completed. Resource name: projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224


INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob run completed. Resource name: projects/888342260584/locations/us-central1/trainingPipelines/9160833430374580224


Model available at projects/888342260584/locations/us-central1/models/6459442796695650304


INFO:google.cloud.aiplatform.training_jobs:Model available at projects/888342260584/locations/us-central1/models/6459442796695650304


In [73]:
endpoint = vertex_model.deploy(machine_type="n2-standard-2")

Creating Endpoint


INFO:google.cloud.aiplatform.models:Creating Endpoint


Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/314646143009030144/operations/5900737954589966336


INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/314646143009030144/operations/5900737954589966336


Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/314646143009030144


INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/314646143009030144


To use this Endpoint in another session:


INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:


endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/314646143009030144')


INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/314646143009030144')


Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/314646143009030144


INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/314646143009030144


Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/314646143009030144/operations/5885538305847590912


INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/314646143009030144/operations/5885538305847590912


FailedPrecondition: 400 Model server exited unexpectedly. Model server logs can be found at https://console.cloud.google.com/logs/viewer?project=888342260584&resource=aiplatform.googleapis.com%2FEndpoint&advancedFilter=resource.type%3D%22aiplatform.googleapis.com%2FEndpoint%22%0Aresource.labels.endpoint_id%3D%22314646143009030144%22%0Aresource.labels.location%3D%22us-central1%22.

Let's make a prediction:

Interested in how to make a prediction using the standard REST API? Go back to [`Endpoints` in Vertex AI](https://console.cloud.google.com/vertex-ai/endpoints) and click on `SAMPLE REQUEST` next to your endpoint.

## Train and serve model on Vertex AI through Vertex AI Pipelines