<a href="https://colab.research.google.com/github/cmplx-xyttmt/mlops/blob/main/starter_notebook_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## MLOps workflow

This is version 2 of the MLOps workflow. This version is adopted from the [vertex-ai-samples](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/pytorch_text_classification_using_vertex_sdk_and_gcloud/pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb) from Google. More details are the articles below:
- [Train and Tune pytorch models vertex ai](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai)
- [Deploy Pytorch models on Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-deploy-pytorch-models-vertex-ai)
- [Orchestrating PyTorch ML Workflows on Vertex AI Pipelines](https://cloud.google.com/blog/topics/developers-practitioners/orchestrating-pytorch-ml-workflows-vertex-ai-pipelines)
- [Scalable ML Workflows using PyTorch on Kubeflow Pipelines and Vertex Pipelines](https://cloud.google.com/blog/topics/developers-practitioners/scalable-ml-workflows-using-pytorch-kubeflow-pipelines-and-vertex-pipelines)

## Setting up the project

### Setting up GCP
Login to google cloud and set the working working project.

In [1]:
from IPython.display import clear_output

In [2]:
# Login to gcloud
!gcloud auth login
clear_output()  # don't dox yourself when this is uploaded to github

In [3]:
# set the working project
!gcloud config set project sb-gcp-project-01

Updated property [core/project].


### Installing requirements

In [4]:
!pip install transformers
!pip install transformers[torch]
!pip install datasets
!pip install tqdm
!pip install cloudml-hypertune
clear_output()

In [5]:
# install Vertex AI SDK for Python
!pip install google-cloud-aiplatform
clear_output()

### GCS Storage


In [6]:
PROJECT_ID = "sb-gcp-project-01"
BUCKET_NAME = "gs://vertexai-pytorch-sample"
REGION = "europe-west1"

In [7]:
print(f"PROJECT_ID = {PROJECT_ID}")
print(f"BUCKET_NAME = {BUCKET_NAME}")
print(f"REGION = {REGION}")

PROJECT_ID = sb-gcp-project-01
BUCKET_NAME = gs://vertexai-pytorch-sample
REGION = europe-west1


In [8]:
# run this only if your bucket doesn't already exist
!gsutil mb -l $REGION $BUCKET_NAME

Creating gs://vertexai-pytorch-sample/...
ServiceException: 409 A Cloud Storage bucket named 'vertexai-pytorch-sample' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


In [9]:
# validate access to your Cloud Storage bucket by examining its contents
!gsutil ls -al $BUCKET_NAME

### Import libraries and define constants

In [10]:
import base64
import json
import os
import random
import sys

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from google.protobuf.json_format import MessageToDict

In [11]:
from IPython.display import HTML, display

In [12]:
import datasets
import numpy as np
import pandas as pd
import torch
import transformers
from datasets import ClassLabel, Sequence, load_dataset
from transformers import (AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, Trainer, TrainingArguments, default_data_collator)

In [13]:
print(f"Notebook runtime: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {datasets.__version__}")
print(f"Datasets version: {transformers.__version__}")

Notebook runtime: GPU
PyTorch version: 2.0.1+cu118
Transformers version: 2.13.1
Datasets version: 4.30.2


In [14]:
APP_NAME = "finetuned-bert-classifier"

In [15]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Training

### Loading the dataset
Using the [IMDB movie review dataset](https://huggingface.co/datasets/imdbhttps://huggingface.co/datasets/imdb) from Hugging Face Datasets for a sentiment classification task.

In [16]:
dataset = load_dataset("imdb")
dataset



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [17]:
print(
    "Total # of rows in training dataset {} and size {:5.2f} MB".format(dataset["train"].shape[0], dataset["train"].size_in_bytes / (1024 * 1024))
)

print(
    "Total # of rows in test dataset {} and size {:5.2f} MB".format(dataset["test"].shape[0], dataset["test"].size_in_bytes / (1024 * 1024))
)

Total # of rows in training dataset 25000 and size 207.25 MB
Total # of rows in test dataset 25000 and size 207.25 MB


In [18]:
dataset["train"][65]

{'text': "SWING! is an important film because it's one of the remaining Black-produced and acted films from the 1930s. Many of these films have simply deteriorated so badly that they are unwatchable, but this one is in fairly good shape. It's also a nice chance to see many of the talented Black performers of the period just after the heyday of the old Cotton Club--a time all but forgotten today.<br /><br />Unfortunately, while the film is historically important and has some lovely performances, it's also a mess. The main plot is very similar to the Hollywood musicals of the era--including a prima donna who is going to ruin the show and the surprise unknown who appears from no where to save the day. However, the writing is just god-awful and a bit trashy at times--and projects images of Black America that some might find a bit demeaning. This is because before the plot really gets going, you are treated to a no-account bum who lives off his hard working wife (a popular stereotype of the

In [19]:
label_list = dataset["train"].unique("label")
label_list

[0, 1]

In [20]:
def show_random_elements(dataset, num_examples=2):
  assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset"
  picks = []
  for _ in range(num_examples):
    pick = random.randint(0, len(dataset) - 1)
    while pick in picks:
      pick = random.randint(0, len(dataset) - 1)
    picks.append(pick)

  df = pd.DataFrame(dataset[picks])
  for column, typ in dataset.features.items():
    if isinstance(typ, ClassLabel):
      df[column] = df[column].transform(lambda i: typ.names[i])
    elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
      df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [21]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"After watching about half of this movie I noticed something peculiar ... I found myself constantly switching through tv-channels to see what else is on - not exactly a good movie trait.<br /><br />This movie is listed as being in a number of genres, and I must say it mostly failed misserably in every one of them. 80% through the movie I switched over to watch an old rerun instead. Bottom line - the whole movie felt as if the ones making it didn't exactly know what to make and ended up in a concoction with no discernable taste.",0
1,"As others have noted, this should have been an excellent Hammer-style film, and it seems to me that that's how most of the actors were instructed to play it... but the screenplay is so leaden, poorly paced, and filled with a lot of dull soliloquies (poor Timothy Dalton is saddled with most of them) that it's all too overblown and self-important. This is an uncharacteristically weak performance from Dalton, although he quietly nails the climactic scene where Dr. Rock finally realizes what he's done. The only actor who comes off really well is Patrick Stewart who is a most welcome sight. Freddie Francis may have been a great cinematographer, but he was a lousy director.",0


Unnamed: 0,text,label
0,"After watching about half of this movie I noticed something peculiar ... I found myself constantly switching through tv-channels to see what else is on - not exactly a good movie trait.<br /><br />This movie is listed as being in a number of genres, and I must say it mostly failed misserably in every one of them. 80% through the movie I switched over to watch an old rerun instead. Bottom line - the whole movie felt as if the ones making it didn't exactly know what to make and ended up in a concoction with no discernable taste.",neg
1,"As others have noted, this should have been an excellent Hammer-style film, and it seems to me that that's how most of the actors were instructed to play it... but the screenplay is so leaden, poorly paced, and filled with a lot of dull soliloquies (poor Timothy Dalton is saddled with most of them) that it's all too overblown and self-important. This is an uncharacteristically weak performance from Dalton, although he quietly nails the climactic scene where Dr. Rock finally realizes what he's done. The only actor who comes off really well is Patrick Stewart who is a most welcome sight. Freddie Francis may have been a great cinematographer, but he was a lousy director.",neg


In [22]:
batch_size = 16
max_seq_length = 128
model_name_or_path = "bert-base-cased"

In [23]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    use_fast=True
)

In [24]:
dataset = load_dataset("imdb")
label_to_id = {1: 1, 0: 0, -1: 0}

def preprocess_function(examples):
  """
  Tokenize the input example texts
  NOTE: The same preprocessing step(s) will be applied at the time of inference as well.
  """
  args = (examples["text"],)
  result = tokenizer(
      *args, padding="max_length", max_length=max_seq_length, truncation=True
  )

  if label_to_id is not None and "label" in examples:
    result["label"] = [label_to_id[example] for example in examples["label"]]

  return result

# apply preproceeing function to input examples
dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=True)



  0%|          | 0/3 [00:00<?, ?it/s]



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]



### Fine-tuning the model

In [25]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path, num_labels=len(label_list)
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

In [26]:
args = TrainingArguments(
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    output_dir="/tmp/cls"
)

In [28]:
def compute_metrics(p: EvalPrediction):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  return {
      "accuracy": (preds == p.label_ids).astype(np.float32).mean().item()
  }

In [29]:
trainer = Trainer(
    model,
    args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=default_data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [30]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.3048,0.2867,0.87748


TrainOutput(global_step=1563, training_loss=0.35353916361022286, metrics={'train_runtime': 755.052, 'train_samples_per_second': 33.11, 'train_steps_per_second': 2.07, 'total_flos': 1644444096000000.0, 'train_loss': 0.35353916361022286, 'epoch': 1.0})

In [31]:
saved_model_local_path = "./models"
!mkdir ./models

In [32]:
trainer.save_model(saved_model_local_path)

In [33]:
history = trainer.evaluate()

In [34]:
history

{'eval_loss': 0.2866995632648468,
 'eval_accuracy': 0.8774799704551697,
 'eval_runtime': 176.2069,
 'eval_samples_per_second': 141.879,
 'eval_steps_per_second': 8.87,
 'epoch': 1.0}

### Run predictions locally with sample examples

In [37]:
model_name_or_path = "bert-base-cased"
label_text = {0: "Negative", 1: "Positive"}
saved_model_path = saved_model_local_path

def predict(input_text, saved_model_path):
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

  tokenizer_args = (input_text,)
  predict_input = tokenizer(
      *tokenizer_args,
      padding="max_length",
      max_length=128,
      truncation=True,
      return_tensors="pt"
  )

  loaded_model = AutoModelForSequenceClassification.from_pretrained(saved_model_path)

  output = loaded_model(predict_input["input_ids"])

  label_id = torch.argmax(*output.to_tuple(), dim=1)

  print(f"Review text: {input_text}")
  print(f"Sentiment: {label_text[label_id.item()]}\n")

In [38]:
review_text = (
    """Jaw dropping visual effects and action! One of the best I have seen to date."""
)
predict_input = predict(review_text, saved_model_path)

Review text: Jaw dropping visual effects and action! One of the best I have seen to date.
Sentiment: Positive



In [39]:
review_text = """Take away the CGI and the A-list cast and you end up with film with less punch."""
predict_input = predict(review_text, saved_model_path)

Review text: Take away the CGI and the A-list cast and you end up with film with less punch.
Sentiment: Negative



### Training on Vertex AI
Local experimentation can be done on Notebooks instances. However, for larger datasets or models, often a vertically scaled compute or horizontally distributed training is required.

Reasons to use the Vertex AI custom training service:
- **Automatically provision and de-provision resources**: Training job on Vertex AI will automatically provision computing resources, performs the training task and ensures deletion of compute resources once the training job is finished.
- **Reusability and portability**: You can package training code with its parameters and dependencies into a container and create a portable component. This container can then be run with different scenarios such as hyperparameter tuning, different data sources and more.
- **Training at scale**: You can run a distributed training job with AI allowing you to train models in a cluster across multiple nodes in parallel and resulting in faster training time.
- **Logging and Monitoring**: The training service logs messages from the job to Cloud Logging and can be monitored while the job is running.

There are three steps to run a training job:
- **STEP 1**: Determine training code structure - Packaging as a Python source distribution or as a custom container image.
- **STEP 2**: Choose a custom training method - custom job, hyperparameter training job or training pipeline
- **STEP 3**: Run the training job


#### Custom training methods
There are three types of Vertex AI resources you can create to train custom models on Vertex AI:
- Custom jobs: With a custom job, you configure the settings to run your training code on Vertex AI such as worker pool specs - machine types, accelerators, Python training spec or custom container spec.
- Hyperparameter tuning jobs: Hyperparameter tunning jobs automate tuning of hyperparameters of your model based on the criteria you configure such as goal/metric to optimize, hyperparameters values and number of trials to run.
- Training pipelines: Orchestrates custom training jobs or hyperparameter tunining jobs with additional steps after the training job is successfully completed.

### Packaging the training application
To run the training job on Vertex AI, the training application code and any dependencies must be packaged and uploaded to Cloud Storage bucket or Container Registry or Artifact Registry that your GCP project can access.

Two ways of packaging your application:
1. Create a Python source distribution with the training code and dependencies to use with a pre-built containers on Vertex AI
2. Use custom containers to package dependencies using Docker containers.

Recommended project structure:
```
.
├── custom_container
│   ├── Dockerfile
│   ├── README.md
│   ├── scripts
│   │   └── train-cloud.sh
│   └── trainer -> ../python_package/trainer/
├── python_package
│   ├── README.md
│   ├── scripts
│   │   └── train-cloud.sh
│   ├── setup.py
│   └── trainer
│       ├── __init__.py
│       ├── experiment.py
│       ├── metadata.py
│       ├── model.py
│       ├── task.py
│       └── utils.py
└── pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb
```

1. Main project directory contains your setup.py file or Dockerfile with the dependencies.
2. Use a subdirectory named trainer to store your main application module and scripts to submit training jobs locally or cloud.
3. Inside trainer directory:
    - `task.py`: Main application module 1) initializes and parse task arguments (hyper parameters), and 2) entry point to the trainer
    - `model.py`: Includes function to create model with a sequence classification head from a pre-trained model.
    - `experiment.py`: Runs the model training and evaluation experiment, and exports the final model.
    - `metadata.py`: Defines metadata for classification task such as predefined model dataset name, target labels.
    - `utils.py`: Includes utility functions such as data input functions to read data, save model to GCS bucket.

### Run custom job on Vertex AI Training with a pre-build container

In [40]:
# initialize the variables to define pre-build container image, location of training application and training module
PRE_BUILD_TRAINING_CONTAINER_IMAGE_URI = (
    "europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13.py310:latest"
)

PYTHON_PACKAGE_APPLICATION_DIR = "python-package"

source_package_file_name = f"{PYTHON_PACKAGE_APPLICATION_DIR}/dist/trainer-0.1.tar.gz"
python_package_gcs_uri = (
    f"{BUCKET_NAME}/pytorch-on-gcp/{APP_NAME}/train/python_package/trainer-0.1.tar.gz"
)
python_module_name = "trainer.task"

In [42]:
%%writefile ./{PYTHON_PACKAGE_APPLICATION_DIR}/setup.py

from setuptools import find_packages
from setuptools import setup
import setuptools

from distutils.command.build import build as _build
import subprocess


REQUIRED_PACKAGES = [
    'transformers',
    'datasets',
    'tqdm',
    'cloudml-hypertune'
]

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Vertex AI | Training | PyTorch | Text Classification | Python Package'
)


Writing ./python-package/setup.py


In [43]:
!cd {PYTHON_PACKAGE_APPLICATION_DIR} && python3 setup.py sdist --formats=gztar

running sdist
running egg_info
creating trainer.egg-info
writing trainer.egg-info/PKG-INFO
writing dependency_links to trainer.egg-info/dependency_links.txt
writing requirements to trainer.egg-info/requires.txt
writing top-level names to trainer.egg-info/top_level.txt
writing manifest file 'trainer.egg-info/SOURCES.txt'
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'

running check
creating trainer-0.1
creating trainer-0.1/trainer.egg-info
copying files to trainer-0.1...
copying setup.py -> trainer-0.1
copying trainer.egg-info/PKG-INFO -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-0.1/trainer.egg-info
Writing trainer-0.1/setup.cfg
creating dist
Creating tar archive
removing 'train

In [44]:
# Upload the source distribution with training application to GCS bucket
!gsutil cp {source_package_file_name} {python_package_gcs_uri}

Copying file://python-package/dist/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [0 files][    0.0 B/  914.0 B]                                                / [1 files][  914.0 B/  914.0 B]                                                
Operation completed over 1 objects/914.0 B.                                      


In [45]:
# Validate the source distribution exists on GCS
!gsutil ls -l {python_package_gcs_uri}

       914  2023-06-22T22:25:19Z  gs://vertexai-pytorch-sample/pytorch-on-gcp/finetuned-bert-classifier/train/python_package/trainer-0.1.tar.gz
TOTAL: 1 objects, 914 bytes (914 B)


In [46]:
# (Optional) Run custom training job locally
!cd {PYTHON_PACKAGE_APPLICATION_DIR} && python -m trainer.task

/usr/bin/python3: Error while finding module specification for 'trainer.task' (ModuleNotFoundError: No module named 'trainer')
