# Lab 4: Machine Learning with Vertex AI

Author: 
* Fabian Hirschmann

Welcome back 👋😍. During this lab, you will train a machine learning model on the data set you already know. We will deploy it to Vertex AI and finally construct a machine learning pipeline to perform the training process manually.

In this Jupyter Notebook, you can press `Shift + Return` to execute the current code junk and jump to the next one.

## Step 1: Import Dependencies and Set Environment Variables

Before we begin, let's import the necessary Python libraries and set a few environment variables for our project.

In [117]:
import random
random.seed(1337)
import os

import pandas as pd

from google.cloud import aiplatform, bigquery
from sklearn.metrics import roc_curve, auc as auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

import joblib


project = !gcloud config get-value project
PROJECT_ID = project[0]

REGION = "us-central1"
BQ_DATASET = "ml_datasets"
BQ_TABLE = "ulb_fraud_detection_dataproc"
BQ_SOURCE = f"{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}"

If you get warnings about GPUs not being available -- that's fine.

The requirements were automatically installed by the `bootstrap_workbench.sh` script we specified when we created this Workbench instance. However, if the above command fails due to import errors, uncomment the next chunk, run it, and then restart the kernel (`Kernel > Restart Kernel` in the menu).

In [118]:
# !pip install -r requirements.txt

## Step 2: Create dataset for ML

We initialize the AI Platform and BigQuery client to interact with Google Cloud services.

In [119]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"{PROJECT_ID}-bucket")
bq = bigquery.Client(project=PROJECT_ID, location=REGION)

The BigQuery table we'll be working with is as follows:

In [120]:
BQ_SOURCE

'astute-ace-336608.ml_datasets.ulb_fraud_detection_dataproc'

We execute a query to fetch the dataset from BigQuery and store it in a Pandas DataFrame.

In [121]:
data = bq.query(f"SELECT * FROM `{BQ_SOURCE}`").to_dataframe()

Let's have a look at the data set in more detail.

In [7]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.751610,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.043150,-0.046401,0.00,0
1,14332.0,1.071950,0.340678,1.784068,2.846396,-0.751538,0.403028,-0.734920,0.205807,1.092726,...,-0.169632,-0.113604,0.067643,0.468669,0.223541,-0.112355,0.014015,0.021504,0.00,0
2,32799.0,1.153477,-0.047859,1.358363,1.480620,-1.222598,-0.481690,-0.654461,0.128115,0.907095,...,0.125514,0.480049,-0.025964,0.701843,0.417245,-0.257691,0.060115,0.035332,0.00,0
3,35799.0,-0.769798,0.622325,0.242491,-0.586652,0.527819,-0.104512,0.209909,0.669861,-0.304509,...,0.152738,0.255654,-0.130237,-0.660934,-0.493374,0.331855,-0.011101,0.049089,0.00,0
4,36419.0,1.047960,0.145048,1.624573,2.932652,-0.726574,0.690451,-0.627288,0.278709,0.318434,...,0.078499,0.658942,-0.067810,0.476882,0.526830,0.219902,0.070627,0.028488,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.329760,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1
284803,90676.0,-2.405580,3.738235,-2.317843,1.367442,0.394001,1.919938,-3.106942,-10.764403,3.353525,...,10.005998,-2.454964,1.684957,0.118263,-1.531380,-0.695308,-0.152502,-0.138866,6.99,1
284804,34634.0,0.333499,1.699873,-2.596561,3.643945,-0.585068,-0.654659,-2.275789,0.675229,-2.042416,...,0.469212,-0.144363,-0.317981,-0.769644,0.807855,0.228164,0.551002,0.305473,18.96,1
284805,96135.0,-1.952933,3.541385,-1.310561,5.955664,-1.003993,0.983049,-4.587235,-4.892184,-2.516752,...,-1.998091,1.133706,-0.041461,-0.215379,-0.865599,0.212545,0.532897,0.357892,18.96,1


We separate the target variable (`Class`), which we want to predict, from the features (all other columns). The `Class` column indicates whether a transaction is fraudulent (1) or legitimate (0).

In [8]:
target = data["Class"].astype(int)
data.drop("Class", axis=1, inplace=True)

Fraud detection datasets are typically highly imbalanced, meaning the majority of transactions are legitimate. We check the distribution of our classes.

In [9]:
target.value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

We split our dataset into two parts:

- Training set (80%): Used to train the machine learning model.
- Testing set (20%): Used to evaluate the performance of the trained model.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(data, target, train_size = 0.80)

Let's also save it to Cloud Storage.

In [122]:
x_train.to_csv(f"gs://{PROJECT_ID}-bucket/data/vertex/x_train.csv", index=False)
y_train.to_frame().to_csv(f"gs://{PROJECT_ID}-bucket/data/vertex/y_train.csv", index=False)

## Train a Random Forest classifier

We use a `RandomForestClassifier`, which is an ensemble learning method that creates multiple decision trees and aggregates their predictions. This helps improve accuracy and robustness.

In [108]:
model = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=8, verbose=1)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   42.5s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:   58.6s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    0.1s finished


We calculate the accuracy of the model, which measures the proportion of correctly classified instances.

For a highly imbalanced data set, the accuracy is often meaningless, because a simple classifier that always says ***not fraud*** will have an accuracy close to 1 already.

In [109]:
accuracy_score(y_test, y_pred)

0.9994908886626171

We compute the ROC AUC (Receiver Operating Characteristic - Area Under the Curve) score. This metric evaluates the model's ability to distinguish between classes. A score closer to 1 indicates better performance.

In [110]:
roc_auc_score(y_test, y_pred_prob)

0.942475910675364

We save the trained model to a local file so we can deploy it later.

In [111]:
joblib.dump(model, "model.joblib")

['model.joblib']

We upload the trained model to Vertex AI, where it can be used for predictions.

In [63]:
!gsutil cp model.joblib gs://{PROJECT_ID}-bucket/model/

Copying file://model.joblib [Content-Type=application/octet-stream]...
/ [1 files][250.3 KiB/250.3 KiB]                                                
Operation completed over 1 objects/250.3 KiB.                                    


## Serve model on Vertex AI

The Vertex AI Model Registry is a centralized repository in Google Cloud's Vertex AI platform where machine learning (ML) models are stored, managed, and versioned. It allows data scientists and ML engineers to track different model versions, store metadata, and deploy models seamlessly to Vertex AI endpoints for inference.

Key features of the Model Registry include:

* Model Versioning: Track multiple versions of a model.
* Metadata Management: Store details such as model parameters, training data, and performance metrics.
* Deployment & Serving: Deploy registered models to Vertex AI Endpoints, Batch Predictions, or export them for external use.
* Model Governance: Manage access control, approval workflows, and lineage tracking.
* Integration with Pipelines: Automate model registration via Vertex AI Pipelines.

We can register the model we just trained in this notebook as follows:

In [66]:
vertex_model = aiplatform.Model.upload(
    display_name="bootkon-model",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-5:latest",
    artifact_uri=f"gs://{PROJECT_ID}-bucket/model/",
    is_default_version=True,
    version_aliases=["v1"],
)

Creating Model


INFO:google.cloud.aiplatform.models:Creating Model


Create Model backing LRO: projects/888342260584/locations/us-central1/models/1618073197272367104/operations/4948297002145284096


INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/888342260584/locations/us-central1/models/1618073197272367104/operations/4948297002145284096


Model created. Resource name: projects/888342260584/locations/us-central1/models/1618073197272367104@1


INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/888342260584/locations/us-central1/models/1618073197272367104@1


To use this Model in another session:


INFO:google.cloud.aiplatform.models:To use this Model in another session:


model = aiplatform.Model('projects/888342260584/locations/us-central1/models/1618073197272367104@1')


INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/888342260584/locations/us-central1/models/1618073197272367104@1')


Once the model has been uploaded, navigate to the [`Model Registry` in Vertex AI](https://console.cloud.google.com/vertex-ai/models). Click on `bootkon-model`. Can you find your newly created model artifact? Open the `VERSION DETAILS` tab and try to find your model artifact on Cloud Storage.

Let's deploy the model to an endpoint for online prediction.
***
<font color="red">While you wait for the next code chunk to complete (10-20min), you can already start going through the notebook named `bootkon_vertex_part2.ipynb`. Come back here regularly and continue when it has finished.</font>
***

In [67]:
endpoint = vertex_model.deploy(machine_type="n2-standard-2")

Creating Endpoint


INFO:google.cloud.aiplatform.models:Creating Endpoint


Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7919546856302968832


INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7919546856302968832


Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


To use this Endpoint in another session:


INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:


endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/2471870364519497728')


INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/888342260584/locations/us-central1/endpoints/2471870364519497728')


Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/888342260584/locations/us-central1/endpoints/2471870364519497728


Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7254140011358978048


INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/888342260584/locations/us-central1/endpoints/2471870364519497728/operations/7254140011358978048


Endpoint model deployed. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/888342260584/locations/us-central1/endpoints/2471870364519497728


Once the endpoint has been deployed, navigate to [Online Predictions](https://cloud.google.com/vertex-ai/online-prediction) in the Vertex AI Console and click on your endpoint. You can see that 100% of the traffic goes to the model you just deployed (`traffic split`) and also some performance metrics. 

You can make predictions by using the `.predict` function of the endpoint instance. The next chunk predicts the first 2000 examples in our data set.

In [105]:
response = endpoint.predict(instances=x_test.head(2000).values.tolist())

Most of them are ***not fraud*** .

In [106]:
response.predictions[:10]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

But there are also a few fraud cases.

In [107]:
sum(response.predictions)

2.0