# Training a PyTorch Text Summarization Model on [Vertex AI](https://cloud.google.com/vertex-ai) using a Huggingface model
## Fine-tuning pre-trained [mT5](https://huggingface.co/google/mt5-small) model for a text summarization task in **Spanish**

# Overview

We will be fine-tuning a **`mBART`** (pre-trained) and **`mT5`** model for text summarization task in spanish.
You can find the details about this model at [Hugging Face Hub](https://huggingface.co/bert-base-cased).

For more notebooks with the state of the art PyTorch/Tensorflow/JAX, you can explore [Hugging FaceNotebooks](https://huggingface.co/transformers/notebooks.html).

### Dataset

We will be using [MLSUM Dataset](https://https://huggingface.co/datasets/mlsum) from [Hugging Face Datasets](https://huggingface.co/datasets).

MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset

### Objective

How to **Build, Train and Tune a PyTorch model on [Vertex AI](https://cloud.google.com/vertex-ai)**. 

### Install additional packages

**If your are not going to run the code locally you do not need to install these packages**

Python dependencies required for this notebook are [Transformers](https://pypi.org/project/transformers/), [Datasets](https://pypi.org/project/datasets/) and [hypertune](https://github.com/GoogleCloudPlatform/cloudml-hypertune) will be installed in the Notebooks instance itself.

In [2]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

If you want to run the training script locally run the next cell

In [None]:
!pip -q install {USER_FLAG} --upgrade transformers
!pip -q install {USER_FLAG} --upgrade datasets
!pip -q install {USER_FLAG} --upgrade tqdm
!pip -q install {USER_FLAG} --upgrade cloudml-hypertune

We will be using [Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/start/client-libraries#python) to interact with Vertex AI services. The high-level `aiplatform` library is designed to simplify common data science workflows by using wrapper classes and opinionated defaults. 

#### Install Vertex AI SDK for Python

In [3]:
!pip -q install {USER_FLAG} --upgrade google-cloud-aiplatform

### Restart the Kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [3]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Prepare the enviroment and GCP elements



### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). 
1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).
1. Enable following APIs in your project required for running the tutorial
    - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
    - [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)
    - [Container Registry API](https://console.cloud.google.com/flows/enableapi?apiid=containerregistry.googleapis.com)
    - [Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com)
1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).
1. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud` or `google.auth`.

In [4]:
PROJECT_ID = "gcp-ml-project"  # <---CHANGE THIS TO YOUR PROJECT

import os

# Get your Google Cloud project ID using google.auth
if not os.getenv("IS_TESTING"):
    import google.auth

    _, PROJECT_ID = google.auth.default()
    print("Project ID: ", PROJECT_ID)

# validate PROJECT_ID
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "gcp-ml-project":
    print(
        f"Please set your project id before proceeding to next step. Currently it's set as {PROJECT_ID}"
    )

Project ID:  gcp-ml-projects


#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [5]:
from datetime import datetime


def get_timestamp():
    return datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

If you are using Google Cloud Notebooks, your environment is already authenticated. Else you need to authenticate to your GCP account. 

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. Vertex AI runs the code from this package. In this tutorial, Vertex AI also saves the trained model that results from your job in the same bucket. Using this model artifact, you can then create Vertex AI model and endpoint resources in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may not use a Multi-Regional Storage bucket for training with Vertex AI.

In [6]:
BUCKET_NAME = "gs://gcp-ml-projects-bucket"  # <---CHANGE THIS TO YOUR BUCKET
REGION = "us-central1"  # @param {type:"string"}

In [7]:
print(f"PROJECT_ID = {PROJECT_ID}")
print(f"BUCKET_NAME = {BUCKET_NAME}")
print(f"REGION = {REGION}")

PROJECT_ID = gcp-ml-projects
BUCKET_NAME = gs://gcp-ml-projects-bucket
REGION = us-central1


---

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

---

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [5]:
! gsutil ls -al $BUCKET_NAME

                                 gs://gcp-ml-projects-bucket/aiplatform-custom-job-2022-03-10-15:53:48.692/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-job-2022-03-10-18:42:21.830/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-job-2022-03-11-08:17:57.154/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-06-20:52:30.161/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-07-18:01:30.625/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-08-08:00:26.038/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-10-15:24:33.751/
                                 gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-11-23:00:26.528/
                                 gs://gcp-ml-projects-bucket/datasets/
                                 gs://gc

### Import libraries and define constants

In [8]:
import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

In [7]:
#from IPython.display import HTML, display

In [9]:
# Set a global varibable to identify our experiment
APP_NAME = "mt5-small-summ-mlsum-es"

In [10]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Training on Vertex AI

You can do local experimentation on your Notebooks instance. However, for larger datasets or models often a vertically scaled compute or horizontally distributed training is required. The most effective way to perform this task is to leverage [Vertex AI custom training service](https://cloud.google.com/vertex-ai/docs/training/custom-training) for following reasons:

- **Automatically provision and de-provision resources**: Training job on Vertex AI will automatically provision computing resources, performs the training task and ensures deletion of compute resources once the training job is finished.
- **Reusability and portability**: You can package training code with its parameters and dependencies into a container and create a portable component. This container can then be run with different scenarios such as hyperparameter tuning, different data sources and more.
- **Training at scale**: You can run a [distributed training job](https://cloud.google.com/vertex-ai/docs/training/distributed-training) with AI allowing you to train models in a cluster across multiple nodes in parallel and resulting in faster training time. 
- **Logging and Monitoring**: The training service logs messages from the job to [Cloud Logging](https://cloud.google.com/logging/docs) and can be monitored while the job is running.

In this part of the notebook, we show how to scale the training job with Vertex AI by packaging the code and create a training pipeline to orchestrate a training job. There are three steps to run a training job using [Vertex AI custom training service](https://cloud.google.com/vertex-ai/docs/training/custom-training):

- **STEP 1**: Determine training code structure - Packaging as a Python source distribution or as a custom container image
- **STEP 2**: Chose a custom training method - custom job, hyperparameter training job or training pipeline
- **STEP 3**: Run the training job


#### Custom training methods

There are three types of Vertex AI resources you can create to train custom models on Vertex AI:

- **[Custom jobs](https://cloud.google.com/vertex-ai/docs/training/create-custom-job):** With a custom job you configure the settings to run your training code on Vertex AI such as worker pool specs - machine types, accelerators, Python training spec or custom container spec. 
- **[Hyperparameter tuning jobs](https://cloud.google.com/vertex-ai/docs/training/using-hyperparameter-tuning):** Hyperparameter tuning jobs automate tuning of hyperparameters of your model based on the criteria you configure such as goal/metric to optimize, hyperparameters values and number of trials to run.
- **[Training pipelines](https://cloud.google.com/vertex-ai/docs/training/create-training-pipeline):** Orchestrates custom training jobs or hyperparameter tuning jobs with additional steps after the training job is successfully completed.

Please refer to the [documentation](https://cloud.google.com/vertex-ai/docs/training/custom-training-methods) for further details.

In this notebook, we will cover Custom Jobs and Hyperparameter tuning jobs.

### Packaging the training application

Before running the training job on Vertex AI, the training application code and any dependencies must be packaged and uploaded to Cloud Storage bucket or Container Registry or Artifact Registry that your Google Cloud project can access. This sections shows how to package and stage your application in the cloud.

There are two ways to package your application and dependencies and train on Vertex AI:

1. [Create a Python source distribution](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container) with the training code and dependencies to use with a [pre-built containers](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) on Vertex AI
2. Use [custom containers](https://cloud.google.com/ai-platform/training/docs/custom-containers-training) to package dependencies using Docker containers

## SELECT THER MAIN OPTION
**This notebook shows both packaging options to run a custom training job on Vertex AI.**

#### Recommended Training Application Structure

You can structure your training application in any way you like. However, the [following structure](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container#structure) is commonly used in Vertex AI samples, and having your project's organization be similar to the samples can make it easier for you to follow the samples.

We have two directories `python_package` and `custom_container` showing both the packaging approaches. `README.md` files inside each directory has details on the directory structure and instructions on how to run application locally and on the cloud.

```
.
├── custom_container
│   ├── Dockerfile
│   ├── README.md
│   ├── scripts
│   │   └── train-cloud.sh
│   └── trainer -> ../python_package/trainer/
├── python_package
│   ├── README.md
│   ├── scripts
│   │   └── train-cloud.sh
│   ├── setup.py
│   └── trainer
│       ├── __init__.py
│       ├── experiment.py
│       ├── metadata.py
│       ├── model.py
│       ├── task.py
│       └── utils.py
└── pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb    --> This notebook
```

1. Main project directory contains your `setup.py` file or `Dockerfile` with the dependencies. 
2. Use a subdirectory named `trainer` to store your main application module and `scripts` to submit training jobs locally or cloud
3. Inside `trainer` directory:
    - `task.py` - Main application module 1) initializes and parse task arguments (hyper parameters), and 2) entry point to the trainer
    - `experiment.py` - Runs the model training and evaluation experiment, and exports the final model.
    - `utils.py` - Includes utility functions such as data input functions to read data, save model to GCS bucket

### Run Custom Job on Vertex AI Training with a pre-built container

Vertex AI provides Docker container images that can be run as [pre-built containers](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#available_container_images) for custom training. These containers include common dependencies used in training code based on the Machine Learning framework and framework version.

In this notebook, we are using Hugging Face Datasets and fine tuning a transformer model from Hugging Face Transformers Library for text summarization task using PyTorch. We will use [pre-built container for PyTorch](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch) and package the training application code by adding standard Python dependencies - `transformers`, `datasets`, `nltk` and `rouge_score` - in the `setup.py` file. 


Initialize the variables to define pre-built container image, location of training application and training module.

In [22]:
PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI = (
    "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
)

PYTHON_PACKAGE_APPLICATION_DIR = "python_package"

source_package_file_name = f"{PYTHON_PACKAGE_APPLICATION_DIR}/dist/trainer-0.1.tar.gz"
python_package_gcs_uri = (
    f"{BUCKET_NAME}/pytorch-on-gcp/{APP_NAME}/train/python_package/trainer-0.1.tar.gz"
)
python_module_name = "trainer.task"

Run the following command to create a source distribution, dist/trainer-0.1.tar.gz:

In [23]:
!cd {PYTHON_PACKAGE_APPLICATION_DIR} && python3 setup.py sdist --formats=gztar

running sdist
running egg_info
writing trainer.egg-info/PKG-INFO
writing dependency_links to trainer.egg-info/dependency_links.txt
writing requirements to trainer.egg-info/requires.txt
writing top-level names to trainer.egg-info/top_level.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
running check


creating trainer-0.1
creating trainer-0.1/trainer
creating trainer-0.1/trainer.egg-info
copying files to trainer-0.1...
copying README.md -> trainer-0.1
copying setup.py -> trainer-0.1
copying trainer/__init__.py -> trainer-0.1/trainer
copying trainer/experiment.py -> trainer-0.1/trainer
copying trainer/task.py -> trainer-0.1/trainer
copying trainer/utils.py -> trainer-0.1/trainer
copying trainer.egg-info/PKG-INFO -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/requires.txt 

Now upload the source distribution with training application to Cloud Storage bucket

In [24]:
!gsutil cp {source_package_file_name} {python_package_gcs_uri}

Copying file://python_package/dist/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  6.4 KiB/  6.4 KiB]                                                
Operation completed over 1 objects/6.4 KiB.                                      


Validate the source distribution exists on Cloud Storage bucket

In [25]:
!gsutil ls -l {python_package_gcs_uri}

      6529  2022-03-26T18:04:35Z  gs://gcp-ml-projects-bucket/pytorch-on-gcp/testing-mt5-summ-es/train/python_package/trainer-0.1.tar.gz
TOTAL: 1 objects, 6529 bytes (6.38 KiB)


#### **Run custom training job on Vertex AI**

We use [Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/start/client-libraries#client_libraries) to create and submit training job to the Vertex AI training service.

##### **Initialize the Vertex AI SDK for Python**

In [14]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

##### **Configure and submit Custom Job to Vertex AI Training service**

Configure a [Custom Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job) with the [pre-built container](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) image for PyTorch and training code packaged as Python source distribution. 

**NOTE:** When using Vertex AI SDK for Python for submitting a training job, it creates a [Training Pipeline](https://cloud.google.com/vertex-ai/docs/training/create-training-pipeline) which launches the Custom Job on Vertex AI Training service.

In [26]:
print(f"APP_NAME={APP_NAME}")
print(
    f"PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI={PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI}"
)
print(f"python_package_gcs_uri={python_package_gcs_uri}")
print(f"python_module_name={python_module_name}")

APP_NAME=testing-mt5-summ-es
PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI=us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest
python_package_gcs_uri=gs://gcp-ml-projects-bucket/pytorch-on-gcp/testing-mt5-summ-es/train/python_package/trainer-0.1.tar.gz
python_module_name=trainer.task


In [35]:
JOB_NAME = f"{APP_NAME}-pytorch-pkg-ar-{get_timestamp()}"
print(f"JOB_NAME={JOB_NAME}")

JOB_NAME=testing-mt5-summ-es-pytorch-pkg-ar-20220326183026


Create a Python package training job to train our model

In [36]:
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=f"{JOB_NAME}",
    python_package_gcs_uri=python_package_gcs_uri,
    python_module_name=python_module_name,
    container_uri=PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI,
)

Next we set the training arguments depends on the initial pretrained model we choose. We have selected a mBART and mT5 model as our initial models, so we should run the next cell based on the initial model we want to finetune. 

In [68]:
# mBART training arguments
training_args = ["--epochs", "5", "--model-name", "mrm8488/mbart-large-finetuned-opus-en-es-translation","--dataset_name","mlsum","--train_split","train","--val_split","validation","--test_split","test",
                "--trained-model-name", "mbart-large-finetune-mlsum-es",]

INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-11-23:00:26.528 
INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7738873559438589952?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/7738873559438589952 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/4280599977559851008?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/7738873559438589952 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob proj

In [37]:
# mT5 training arguments
training_args = ["--epochs", "2", "--model-name", "google/mt5-small", "--dataset_name","mlsum",
                 "--train_split","train[:3%]","--val_split","validation[:5%]","--test_split","test[:5%]",
                 "--lr", "1e-5", "--warmup_steps", "100", "--train_batch_size", "8", 
                 "--wandb-api-key", "no-logging","--push_to_hub","n",
                "--trained-model-name", "mT5-base-finetune-mlsum-es"]

In [38]:
model = job.run(
    replica_count=1,
    # For mBART training
    #machine_type="a2-highgpu-1g",
    #accelerator_type="NVIDIA_TESLA_A100",
    # For mT5 training
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_V100",
    accelerator_count=1,
    args=training_args,
    sync=False,
)

INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-26-18:30:45.169 
INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6643786366452760576?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/6643786366452760576 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6700679496120401920?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/6643786366452760576 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob proj

Validate the model artifacts written to GCS by the training code after the job completes successfully

In [61]:
job_response = MessageToDict(job._gca_resource._pb)
gcs_model_artifacts_uri = job_response["trainingTaskInputs"]["baseOutputDirectory"][
    "outputUriPrefix"
]
print(f"Model artifacts are available at {gcs_model_artifacts_uri}")

AttributeError: 'NoneType' object has no attribute '_pb'

In [None]:
!gsutil ls -lr $gcs_model_artifacts_uri/

## Run Custom Job on Vertex AI Training with custom container

To create a [training job with custom container](https://cloud.google.com/vertex-ai/docs/training/create-custom-container?hl=hr), you define a `Dockerfile` to install or add the dependencies required for the training job. Then, you build and test your Docker image locally to verify, push the image to Container Registry and submit a Custom Job to Vertex AI Training service.

#### **Build your container using Dockerfile with Training Code and Dependencies**

In the previous section, we wrapped the training application code and dependencies as Python source distribution. An alternate way to package the training application and dependencies is to [create a custom container](https://cloud.google.com/vertex-ai/docs/training/create-custom-container?hl=hr) using Dockerfile. We create a Dockerfile with a [pre-built PyTorch container image provided by Vertex AI](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#available_container_images) as the base image, install the dependencies - `transformers`, `datasets` , `nltk`, `rouge-score`, `wandb` and `cloudml-hypertune` and copy the training application code.

In [21]:
"""
%%writefile ./custom_container/Dockerfile

# Use pytorch GPU base image
FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-10:latest

# set working directory
WORKDIR /app

# Install required packages
RUN pip install google-cloud-storage transformers datasets nltk rouge-score sentencepiece cloudml-hypertune

# Copies the trainer code to the docker image.
COPY ./trainer/__init__.py /app/trainer/__init__.py
COPY ./trainer/experiment.py /app/trainer/experiment.py
COPY ./trainer/utils.py /app/trainer/utils.py
COPY ./trainer/task.py /app/trainer/task.py

# Set up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]
"""

Overwriting ./custom_container/Dockerfile


In [11]:
!ls

custom_container  scripts			       train.py
python_package	  summarization_mlsum_es_vertex.ipynb


**If your custom container was built previously just run the next cell, if not run the cell that run the `docker build`**.

In [12]:
CUSTOM_TRAIN_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_gpu_train_{APP_NAME}"

Build the image and tag the Container Registry path (gcr.io) that you will push to.

In [12]:
CUSTOM_TRAIN_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_gpu_train_{APP_NAME}"

!cd ./custom_container/ && docker build -f Dockerfile -t $CUSTOM_TRAIN_IMAGE_URI ../python_package

Sending build context to Docker daemon  64.05kB
Step 1/9 : FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-10:latest
 ---> 3210fbe7551d
Step 2/9 : WORKDIR /app
 ---> Using cache
 ---> 2e0346acbd78
Step 3/9 : RUN pip install google-cloud-storage transformers==4.17.0 datasets==2.0.0 nltk rouge-score sentencepiece cloudml-hypertune wandb==0.12.11
 ---> Using cache
 ---> dc4c79fd467e
Step 4/9 : RUN   apt-get update &&   apt-get install -y sudo curl git &&   curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash &&   sudo apt-get install git-lfs=3.1.2 && git lfs install
 ---> Using cache
 ---> a239344213d0
Step 5/9 : COPY ./trainer/__init__.py /app/trainer/__init__.py
 ---> Using cache
 ---> 78dcc1c8dda9
Step 6/9 : COPY ./trainer/experiment.py /app/trainer/experiment.py
 ---> 220d04dd68c0
Step 7/9 : COPY ./trainer/utils.py /app/trainer/utils.py
 ---> 377dff866431
Step 8/9 : COPY ./trainer/task.py /app/trainer/task.py
 ---> 07ba6eefdabc
Step 9/9

##### **Push the container to Container Registry**

Push your container image with training application code and dependencies to your Container Registry. 

In [13]:
!docker push $CUSTOM_TRAIN_IMAGE_URI

Using default tag: latest
The push refers to repository [gcr.io/gcp-ml-projects/pytorch_gpu_train_mt5-small-summ-mlsum-es]

[1B574c2d82: Preparing 
[1Bb91ef80c: Preparing 
[1Bdbb179c6: Preparing 
[1B1ac12390: Preparing 
[1Be2d9c00c: Preparing 
[1B214b3e27: Preparing 
[1Be421f4be: Preparing 
[1B01fd3550: Preparing 
[1B105d38de: Preparing 
[1B6867eca5: Preparing 
[1Beb8da3b6: Preparing 
[1B482e0a93: Preparing 
[1Bbfc9e711: Preparing 
[1B1e557a4a: Preparing 
[1B4a3ae7b0: Preparing 
[1B6f2be7a4: Preparing 
[1B744e683f: Preparing 
[1B615e3baa: Preparing 
[1Bd6391439: Preparing 
[15B14b3e27: Waiting g 
[1Bef90a537: Preparing 
[16B421f4be: Waiting g 
[1Bb915feb5: Preparing 
[1B7c56669b: Preparing 
[1B52b2be28: Preparing 
[1B5a3d56c5: Preparing 
[19B05d38de: Waiting g 
[1B160a6176: Preparing 
[1Bfa8c5974: Preparing 
[21B867eca5: Waiting g 
[1B3a5a2020: Preparing 
[22Bb8da3b6: Waiting g 
[1B8ec8f441: Preparing 
[23B82e0a93: Waiting g 
[1Bea620f67: Preparing 


Validate the custom container image in Container Registry

In [14]:
!gcloud container images describe $CUSTOM_TRAIN_IMAGE_URI

image_summary:
  digest: sha256:baa6099c4e4a27af1d68f32a995e57fafc5b7ba1e585715c685787cfaa8b22d6
  fully_qualified_digest: gcr.io/gcp-ml-projects/pytorch_gpu_train_mt5-small-summ-mlsum-es@sha256:baa6099c4e4a27af1d68f32a995e57fafc5b7ba1e585715c685787cfaa8b22d6
  registry: gcr.io
  repository: gcp-ml-projects/pytorch_gpu_train_mt5-small-summ-mlsum-es


##### **Initialize the Vertex AI SDK for Python**

In [15]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

In [16]:
JOB_NAME = f"{APP_NAME}-pytorch-cstm-cntr-{get_timestamp()}"

print(f"APP_NAME={APP_NAME}")
print(f"CUSTOM_TRAIN_IMAGE_URI={CUSTOM_TRAIN_IMAGE_URI}")
print(f"JOB_NAME={JOB_NAME}")

APP_NAME=mt5-small-summ-mlsum-es
CUSTOM_TRAIN_IMAGE_URI=gcr.io/gcp-ml-projects/pytorch_gpu_train_mt5-small-summ-mlsum-es
JOB_NAME=mt5-small-summ-mlsum-es-pytorch-cstm-cntr-20220327160104


### Get the Weight & biases API key to log training and validation

The W&B API key should be store in a txt file called `secrets.txt`, the first line in the text file. **This file must not be uploaded to your source code repository**.

In [36]:
# Path to the txt file with the wandb API Key
path_to_wandb_api="secrets.txt"
# Read the APi Key, if not we set the no-logging strategy
try:
    with open(path_to_wandb_api) as f:
        mywandb_api_key = f.readline()
except:
    mywandb_api_key="no-logging"
    

### Log in to the Huggingface account and get the token

In [18]:
from getpass import getpass

hf_username = "edumunozsala" # your username on huggingface.co
hf_email = "edumunozsala@gmail.com" # email used for commit
# Define the final model name
mymodel_name='mT5-small-finetune-mlsum-es'
repository_name = mymodel_name
password = getpass("Enter your password:") # creates a prompt for entering password

Enter your password: ············


In [19]:
from huggingface_hub import HfApi, Repository

# get hf token
token = HfApi().login(username=hf_username, password=password)

  from .autonotebook import tqdm as notebook_tqdm


ERROR:root:HfApi.login: This method is deprecated in favor of `set_access_token`.


In [21]:
# define training code arguments
training_args = ["--epochs", "5", "--model-name", "google/mt5-small", "--dataset_name","mlsum",
                 "--train_split","train","--val_split","validation","--test_split","test",
                 "--lr", "1e-5", "--warmup_steps", "1000", "--weight-decay", "0.01", "--train_batch_size", "8", 
                 "--max_input_length", "512", "--max_target_length", "64",
                 "--wandb-api-key", mywandb_api_key,"--push_to_hub","y","--hub_model_id", "mT5-small-finetune-mlsum-es", "--hub_token", token,
                "--trained-model-name", "mT5-small-finetune-mlsum-es"]

##### **Configure and submit Custom Job to Vertex AI Training service**

Configure a [Custom Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job) with the [custom container](https://cloud.google.com/vertex-ai/docs/training/create-custom-container) image with training code and other dependencies

**NOTE:** When using Vertex AI SDK for Python for submitting a training job, it creates a [Training Pipeline](https://cloud.google.com/vertex-ai/docs/training/create-training-pipeline) which launches the Custom Job to train on Vertex AI Training.

In [22]:
# configure the job with container image spec
job = aiplatform.CustomContainerTrainingJob(
    display_name=f"{JOB_NAME}", container_uri=f"{CUSTOM_TRAIN_IMAGE_URI}"
)

In [23]:
# submit the custom job to Vertex AI training service
model = job.run(
    replica_count=1,
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_V100",
    accelerator_count=1,
    args=training_args,
    sync=False,
)

INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://gcp-ml-projects-bucket/aiplatform-custom-training-2022-03-27-16:11:24.559 
INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2922827911826243584?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/2922827911826243584 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/4684227947337351168?project=42794247013
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/42794247013/locations/us-central1/trainingPipelines/2922827911826243584 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/4279424

# Hyperparameter Tuning

The training application code for fine-tuning a transformer model for sentiment analysis task uses hyperparameters such as learning rate and weight decay. These hyperparameters control the behavior of the training algorithm and can have a significant effect on the performance of the resulting model. This part of the notebook show how you can automate tuning these hyperparameters with Vertex AI Training service.

We submit a [Hyperparameter Tuning job](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview) to Vertex AI Training service by packaging the training application code and dependencies in a Docker container and push the container to Google Container Registry, similar to running a Custom Job on Vertex AI with Custom Container.

### How hyperparameter tuning works in Vertex AI?

Following are the high level steps involved in running a Hyperparameter Tuning job on Vertex AI Training service:

- You define the hyperparameters to tune the model along with the metric (or goal) to optimize
- Vertex AI runs multiple trials of your training application with the hyperparameters and limits you specified - maximum number of trials to run and number of parallel trials. 
- Vertex AI keeps track of the results from each trial and makes adjustments for subsequent trials. This requires your training application to report the metrics to Vertex AI using the Python package [`cloudml-hypertune`](https://github.com/GoogleCloudPlatform/cloudml-hypertune). 
- When the job is finished, get the summary of all the trials with the most effective configuration of values based on the criteria you configured

Refer to the [Vertex AI documentation](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview) to understand how to configure and select hyperparameters for tuning, configure tuning strategy and how Vertex AI optimizes the hyperparameter tuning jobs. The default tuning strategy uses results of previous trials to inform the assignment of values in subsequent trials. 

### Changes to training application code for hyperparameter tuning

There are few [requirements](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview#the_flow_of_hyperparameter_values) to follow specific to hyperparameter tuning in Vertex AI:

1. To pass the hyperparameter values to training code, you mist define a command-line argument in the main training module for each tuned hyperparameter. Use the value passed in those arguments to set the corresponding hyperparameter in the training application's code
1. You must pass metrics from the training application to Vertex AI to evaluate the effectiveness of a trial. You can use [`cloudml-hypertune` Python package](https://github.com/GoogleCloudPlatform/cloudml-hypertune) to report metrics.

Previously, in the training application code to fine-tune the transformer model for sentiment analysis task, we instantiated [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html) with hyperparameters passed as training arguments (`training_args`). These hyperparameters are passed as command line arguments to the training module `trainer.task` which are then passed to the `training_args`. Refer to `./python_package/trainer` module for training application code.

To report metrics to Vertex AI when hyperparameter tuning is enabled, we call [`cloudml-hypertune` Python package](https://github.com/GoogleCloudPlatform/cloudml-hypertune) after the evaluation phase which is added as a [callback](https://huggingface.co/transformers/main_classes/callback.html#transformers.trainer_callback.TrainerCallback) to the `trainer`. The `trainer` objects passes the metrics computed by the last evaluation phase to the callback which will be reported by `hypertune` library to Vertex AI for evaluating trials.

### Run Hyperparameter Tuning Job on Vertex AI

Before submitting the hyperparameter tuning job to Vertex AI, push the custom container image with training application to Google Cloud Container Registry and then submit the job to Vertex AI. We will be using the same image used for running Custom Job on Vertex AI Training service.

Validate the custom container image in Container Registry

Following is the `setup.py` file for the training application. The `find_packages()` function inside `setup.py` includes the `trainer` directory in the package as it contains `__init__.py` which tells [Python Setuptools](https://setuptools.readthedocs.io/en/latest/) to include all subdirectories of the parent directory as dependencies. 

##### **Initialize the Vertex AI SDK for Python**

In [24]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

##### **Configure and submit Hyperparameter Tuning Job to Vertex AI Training service**

Configure a [Hyperparameter Tuning Job](https://cloud.google.com/vertex-ai/docs/training/using-hyperparameter-tuning) with the [custom container](https://cloud.google.com/vertex-ai/docs/training/create-custom-container) image with training code and other dependencies.

When configuring and submitting a Hyperparameter Tuning job, you need to attach a Custom Job definition with worker pool specs defining machine type, accelerators and URI for container image representing the custom container.

In [26]:
JOB_NAME = f"{APP_NAME}-pytorch-hptune-{get_timestamp()}"

print(f"APP_NAME={APP_NAME}")
print(f"CUSTOM_TRAIN_IMAGE_URI={CUSTOM_TRAIN_IMAGE_URI}")
#print(f"PRE_BUILT_TRAIN_IMAGE_URI={PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI}") 
print(f"JOB_NAME={JOB_NAME}")

APP_NAME=finetuned-mt5-summ-es
CUSTOM_TRAIN_IMAGE_URI=gcr.io/gcp-ml-projects/pytorch_gpu_train_finetuned-mt5-summ-es
JOB_NAME=finetuned-mt5-summ-es-pytorch-hptune-20220315141748


Define the training arguments with `hp-tune` argument set to `y` so that training application code can report metrics to Vertex AI

In [27]:
training_args = ["--epochs", "4", "--model-name", "google/mt5-small","--dataset_name","mlsum","--train_split","train[:15%]","--val_split","validation[:50%]","--test_split","test",
                "--trained-model-name", "mt5-small-finetune-mlsum-es","--hp-tune", "y",]

Create a **`CustomJob`** with worker pool specs to define machine types, accelerators and customer container spec with the training application code

In [28]:
# The spec of the worker pools including machine type and Docker image
worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-8",
            "accelerator_type": "NVIDIA_TESLA_V100",
            "accelerator_count": 2,
        },
        "replica_count": 1,
        "container_spec": {"image_uri": CUSTOM_TRAIN_IMAGE_URI, "args": training_args},
    }
]

In [29]:
custom_job = aiplatform.CustomJob(
    display_name=JOB_NAME, worker_pool_specs=worker_pool_specs
)

Define the `parameter_spec` as a Python dictionary object with the search space i.e. parameters to search and optimize. They key is the hyperparameter name passed as command line argument to the training code and value is the parameter specification. The spec requires to specify the hyperparameter data type as an instance of a parameter value specification. 

Refer to the [documentation](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview#hyperparameters) on selecting the hyperparaneter to tune and how to define parameter specification.

In [30]:
# Dictionary representing parameters to optimize.
# The dictionary key is the parameter_id, which is passed into your training
# job as a command line argument,
# And the dictionary value is the parameter specification of the metric.
parameter_spec = {
    "lr": hpt.DoubleParameterSpec(min=1e-7, max=5e-4, scale="log"),
    "weight-decay": hpt.DiscreteParameterSpec(
        values=[0.001, 0.01], scale=None
    ),
    "train_batch_size": hpt.DiscreteParameterSpec(
        values=[2, 4, 8], scale=None
    ),
}

Define the `metric_spec` with name and goal of metric to optimize. The goal specifies whether you want to tune your model to maximize or minimize the value of this metric.

In [31]:
# Dictionary representing metrics to optimize.
# The dictionary key is the metric_id, which is reported by your training job,
# And the dictionary value is the optimization goal of the metric.
metric_spec = {"rouge2": "maximize"}

Configure and submit a Hyperparameter Tuning Job with the Custom Job, metric spec, parameter spec and trial limits.

- **`max_trial_count`**: Maximum # of Trials run by the service. We recommend to start with a smaller value to understand the impact of the hyperparameters chosen before scaling up.
- **`parallel_trial_count`**: Number of Trials to run in parallel. We recommend to start with a smaller value as Vertex AI uses results from the previous trials to inform the assignment of values in subsequent trials. Large # of parallel trials mean these trials start without having the benefit of the results of any trials still running.
- **`search_algorithm`**: Search algorithm specified for the Study. If you do not specify an algorithm, Vertex AI by default applies Bayesian optimization to arrive at the optimal solution to search over the parameter space. 

Refer to the [documentation](https://cloud.google.com/vertex-ai/docs/training/using-hyperparameter-tuning#configuration) to understand the hyperparameter training job configuration.

In [32]:
hp_job = aiplatform.HyperparameterTuningJob(
    display_name=JOB_NAME,
    custom_job=custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=15,
    parallel_trial_count=2,
    search_algorithm=None,
)

In [33]:
model = hp_job.run(sync=False)

INFO:google.cloud.aiplatform.jobs:Creating HyperparameterTuningJob
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob created. Resource name: projects/42794247013/locations/us-central1/hyperparameterTuningJobs/2753564693799895040
INFO:google.cloud.aiplatform.jobs:To use this HyperparameterTuningJob in another session:
INFO:google.cloud.aiplatform.jobs:hpt_job = aiplatform.HyperparameterTuningJob.get('projects/42794247013/locations/us-central1/hyperparameterTuningJobs/2753564693799895040')
INFO:google.cloud.aiplatform.jobs:View HyperparameterTuningJob:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2753564693799895040?project=42794247013
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob projects/42794247013/locations/us-central1/hyperparameterTuningJobs/2753564693799895040 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob projects/42794247013/locations/us-central1/hyperparameterTuningJobs/27535

##### **Monitoring progress of the Custom Job**

You can monitor the hyperparameter tuning job launched from Cloud Console following the link [here](https://console.cloud.google.com/vertex-ai/training/hyperparameter-tuning-jobs/) or use gcloud CLI command [`gcloud beta ai custom-jobs stream-logs`](https://cloud.google.com/sdk/gcloud/reference/beta/ai/custom-jobs/stream-logs)

After the job is finished, you can view and format the results of the hyperparameter tuning Trials (run by Vertex AI Training service) as a Pandas dataframe