In [1]:

# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Install additional packages

Python dependencies required for this notebook are [Transformers](https://pypi.org/project/transformers/), [Datasets](https://pypi.org/project/datasets/) and [hypertune](https://github.com/GoogleCloudPlatform/cloudml-hypertune) will be installed in the Notebooks instance itself.

In [2]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [5]:
!pip -q install {USER_FLAG} --upgrade transformers

[0m

In [6]:
!pip -q install {USER_FLAG} --upgrade datasets

[0m

In [7]:
!pip -q install {USER_FLAG} --upgrade tqdm

In [8]:
!pip -q install {USER_FLAG} --upgrade cloudml-hypertune

We will be using [Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/start/client-libraries#python) to interact with Vertex AI services. The high-level `aiplatform` library is designed to simplify common data science workflows by using wrapper classes and opinionated defaults. 

#### Install Vertex AI SDK for Python

In [9]:
!pip -q install {USER_FLAG} --upgrade google-cloud-aiplatform

[0m

### Restart the Kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [11]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Select a GPU runtime

**Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select "Runtime --> Change runtime type > GPU"**

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.
1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).
1. Enable following APIs in your project required for running the tutorial
    - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
    - [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)
    - [Container Registry API](https://console.cloud.google.com/flows/enableapi?apiid=containerregistry.googleapis.com)
    - [Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com)
1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).
1. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud` or `google.auth`.

In [2]:
PROJECT_ID = "spheric-rhythm-234515"

import os

# Get your Google Cloud project ID using google.auth
if not os.getenv("IS_TESTING"):
    import google.auth

    _, PROJECT_ID = google.auth.default()
    print("Project ID: ", PROJECT_ID)

# validate PROJECT_ID
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "spheric-rhythm-234515":
    print(
        f"Please set your project id before proceeding to next step. Currently it's set as {PROJECT_ID}"
    )

Project ID:  spheric-rhythm-234515
Please set your project id before proceeding to next step. Currently it's set as spheric-rhythm-234515


**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key** page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).
2. Click **Create service account**.
3. In the **Service account name** field, enter a name, and click **Create**.
4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI" into the filter box, and select **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.
5. Click *Create*. A JSON file that contains your key downloads to your local environment.
6. Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [4]:
# service account already created with the following command (from a previous laboratory)
# https://github.com/garcianava/vertex-ai-training/tree/main/02_vertex_ai_qwikstart

# vertex-custom-training-sa@spheric-rhythm-234515.iam.gserviceaccount.com

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. Vertex AI runs the code from this package. In this tutorial, Vertex AI also saves the trained model that results from your job in the same bucket. Using this model artifact, you can then create Vertex AI model and endpoint resources in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may not use a Multi-Regional Storage bucket for training with Vertex AI.

In [6]:
BUCKET_NAME = "gs://pytorch-text-classification-bucket"
REGION = "us-west4"

In [7]:
print(f"PROJECT_ID = {PROJECT_ID}")
print(f"BUCKET_NAME = {BUCKET_NAME}")
print(f"REGION = {REGION}")

PROJECT_ID = spheric-rhythm-234515
BUCKET_NAME = gs://pytorch-text-classification-bucket
REGION = us-west4


In [8]:
! gsutil mb -l $REGION $BUCKET_NAME

Creating gs://pytorch-text-classification-bucket/...


In [9]:
! gsutil ls -al $BUCKET_NAME

### Import libraries and define constants

In [10]:
import base64
import json
import os
import random
import sys

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from google.protobuf.json_format import MessageToDict

In [11]:
from IPython.display import HTML, display

In [12]:
import datasets
import numpy as np
import pandas as pd
import torch
import transformers
from datasets import ClassLabel, Sequence, load_dataset
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          EvalPrediction, Trainer, TrainingArguments,
                          default_data_collator)

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
/opt/conda/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev

In [13]:
print(f"Notebook runtime: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version : {torch.__version__}")
print(f"Transformers version : {datasets.__version__}")
print(f"Datasets version : {transformers.__version__}")

Notebook runtime: CPU
PyTorch version : 2.0.1+cu117
Transformers version : 2.14.2
Datasets version : 4.31.0


In [14]:
APP_NAME = "finetuned-bert-classifier"

In [15]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Training

In this section, we will train a PyTorch model by fine-tuning pre-trained model from [Hugging Face Transformers](https://github.com/huggingface/transformers). We will train the model locally first and then on [Vertex AI training service](https://cloud.google.com/vertex-ai/docs/training/custom-training).

## Training locally in the notebook

### Loading the dataset

For this example we will use [IMDB movie review dataset](https://huggingface.co/datasets/imdb) from [Hugging Face Datasets](https://huggingface.co/datasets/) for sentiment classification task. We use the [Hugging Face Datasets](https://github.com/huggingface/datasets) library to download the data. This can be easily done with the function `load_dataset`.

In [16]:
dataset = load_dataset("imdb")
dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [17]:
print(
    "Total # of rows in training dataset {} and size {:5.2f} MB".format(
        dataset["train"].shape[0], dataset["train"].size_in_bytes / (1024 * 1024)
    )
)
print(
    "Total # of rows in test dataset {} and size {:5.2f} MB".format(
        dataset["test"].shape[0], dataset["test"].size_in_bytes / (1024 * 1024)
    )
)

Total # of rows in training dataset 25000 and size 207.25 MB
Total # of rows in test dataset 25000 and size 207.25 MB


To access an actual element, you need to select a split first, then give an index:

In [19]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

Using the `unique` method to extract label list. This will allow us to experiment with other datasets without hard-coding labels.

In [20]:
label_list = dataset["train"].unique("label")
label_list

[0, 1]

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [21]:
def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
    display(HTML(df.to_html()))

In [22]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"First off, I hadn't seen ""The Blob"" since I was 7 or 8 and viewing it as an adult was an incredible experience. Pages could be written on its influence on horror films even today. And even more could be written on its social subtext with the 50s ""fear of teenagers"". But this simple little tale of interplanetary horror is still a damn fine scary movie if you let it be.<br /><br />Sure, it looks cheesy as all get out in our modern world. But ""The Blob"" packs in some genuinely frightening moments as a band of kids track the unstoppable creature when then adults don't believe them. In fact, there are even some pretty bleak moments in its candy-colored world. And Steve McQueen gives so much more than the story deserved on paper that we the viewers really get caught in the moment and believe in him.<br /><br />To sum up, if you can take off your postmodern irony filter, there's a lot more to love here than meets the eye.",pos
1,"I thoroughly enjoyed this true to form take on the Dick Tracy persona. This is a well done product that used modern technology to craft a imagery filled comic era story. If you are a fan of or recently watched some of the old Dick Tracy b&w movies then you're sure to get a kick out of this rendition. The pastel colors and larger than life characters rendered in a painstakingly authentic take on an era gone by is entertainment as it's meant to be. I personally find Madonna's musical element to be a major part of this film-the CD featuring her music from this movie is one I've listened to often over the years, it's just so well done and performed musically and tuned to that era. In my mind, Madonna's finest moment both on-screen but especially musically. This is sure to bring out the ""kid"" in you.",pos
