In [1]:
%pip install pandas datasets deepspeed==0.9.1 py-cpuinfo==9.0.0 transformers datasets accelerate tensorboard

Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
import pandas as pd
import time
import os
import tempfile
import transformers as tr
from datasets import load_dataset

In [3]:
# Define um diretório de cache
cache_dir = 'cache_dir'
username = os.environ['USER']
working_dir = os.environ['PWD']

print(f"Current user is {username}")
print(f"Current working dir is {working_dir}")

Current user is antonio
Current working dir is /linux-data/llm-course


In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
# Verifica se há uma GPU disponível e define o dispositivo (GPU ou CPU)
device = torch.device('cuda') if torch.cuda.is_available() else 'cpu'
num_gpus = torch.cuda.device_count()

if num_gpus > 0:
    print(f"GPUs available: {num_gpus}")
    for i in range(num_gpus):
        gpu_name = torch.cuda.get_device_name(i)
        print(f"GPU {i}: {gpu_name}")
else:
    print("There is no GPU available. Using CPU")

GPUs available: 1
GPU 0: NVIDIA RTX A4000


In [6]:
# Creating a local temporary directory on the Driver. 
# This will serve as a root directory for the intermediate model checkpoints created during the training process. 
# The final model will be persisted to DBFS.

tmpdir = tempfile.TemporaryDirectory()
local_training_root = tmpdir.name

print(local_training_root)

/tmp/tmpvf5_phaq


### Fine-Tuning


#### Step 1 - Data Preparation

The first step of the fine-tuning process is to identify a specific task and supporting dataset. In this notebook, we will consider the specific task to be classifying movie reviews. This idea is generally simple task where a movie review is provided as plain-text and we would like to determine whether or not the review was positive or negative.

The [IMDB dataset](https://huggingface.co/datasets/imdb) can be leveraged as a supporting dataset for this task. The dataset conveniently provides both a training and testing dataset with labeled binary sentiments, as well as a dataset of unlabeled data.

In [7]:
imdb_ds = load_dataset('imdb')

#### Step 2 - Select pre-trained model

The next step of the fine-tuning process is to select a pre-trained model. We will consider using the [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models for our fine-tuning purposes. The T5 models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for tasks such as summarization, translation, text classification, question answering, and more.

The `t5-small` version of the T5 models has 60 million parameters. This slimmed down version will be sufficient for our purposes.

In [8]:
model_checkpoint = 't5-small'

In [9]:
# load the tokenizer that was used for the t5-small model

tokenizer = tr.AutoTokenizer.from_pretrained(
    model_checkpoint,
    cache_dir=cache_dir
)

As mentioned above, the IMDB dataset is a binary sentiment dataset. Its labels therefore are encoded as `(-1 - unknown; 0 - negative; 1 - positive)` values. In order to use this dataset with a text-to-text model like T5, the label set needs to be represented as a string. There are a number of ways to accomplish this. Here, we will simply translate each label id to its corresponding string value.

In [10]:
def to_tokens(tokenizer: tr.models.t5.tokenization_t5_fast.T5TokenizerFast, label_map: dict) -> callable:
    """
    Given a `tokenizer` this closure will iterate through `x` and return the result of `apply()`.
    This function is mapped to a dataset and returned with ids and attention mask.
    """

    def apply(x) -> tr.tokenization_utils_base.BatchEncoding:
        """From a formatted dataset `x` a batch encoding `token_res` is created."""
        target_labels = [label_map[y] for y in x["label"]]
        token_res = tokenizer(
            x["text"],
            text_target=target_labels,
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        return token_res

    return apply


imdb_label_lookup = {0: "negative", 1: "positive", -1: "unknown"}

In [11]:
imdb_to_tokens = to_tokens(tokenizer, imdb_label_lookup)
tokenized_dataset = imdb_ds.map(imdb_to_tokens, batched=True, remove_columns=['text', 'label'])

#### Step 3 - Setup Training

The model training process is highly configurable. The [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class effectively exposes the configurable aspects of the process allowing one to customize them accordingly. Here, we will focus on setting up a training process that performs a single epoch of training with a batch size of 16. We will also leverage `adamw_torch` as the optimizer.


In [12]:
checkpoint_name = 'test-trainer'
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)

training_args = tr.TrainingArguments(
    local_checkpoint_path,
    num_train_epochs = 1, # default number of epochs to train is 3
    per_device_train_batch_size = 16,
    optim = 'adamw_torch',
    report_to = ['tensorboard']
)

In [13]:
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
    model_checkpoint,
    cache_dir = cache_dir
)

In [14]:
# Used to assist the trainer in batching the data

data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)
trainer = tr.Trainer(
    model,
    training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['test'],
    tokenizer = tokenizer,
    data_collator = data_collator
)

Before starting the training process, let's turn on Tensorboard. This will allow us to monitor the training process as checkpoint logs are created.

In [15]:
tensorboard_display_dir = f"{local_checkpoint_path}/runs"

In [16]:
%load_ext tensorboard
%tensorboard --logdir '{tensorboard_display_dir}'

#### Fine-Tuning do modelo t5-small

In [17]:
trainer.train()

trainer.save_model()
trainer.save_state()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.6134
1000,0.1406
1500,0.1278


In [18]:
# Persist the fine-tuned model to DBFS

final_model_path = f"{working_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir = final_model_path)

#### Predict

In [20]:
fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)

In [27]:
reviews = [
""" 'Despicable Me' is a cute and funny movie, but the plot is predictable and the characters are not very well-developed. Overall, it's a good movie for kids, but adults might find it a bit boring.""",
""" 'The Batman' is a dark and gritty take on the Caped Crusader, starring Robert Pattinson as Bruce Wayne. The film is a well-made crime thriller with strong performances and visuals, but it may be too slow-paced and violent for some viewers.""",
""" The Phantom Menace is a visually stunning film with some great action sequences, but the plot is slow-paced and the dialogue is often wooden. It is a mixed bag that will appeal to some fans of the Star Wars franchise, but may disappoint others.""",
""" I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.""",
""" What the heck was that? Superb. Disgusting. No idea if I like it or hate it """
]

inputs = tokenizer(
    reviews,
    return_tensors = 'pt',
    truncation = True,
    padding = True
)
pred = fine_tuned_model.generate(
    input_ids = inputs['input_ids'], 
    attention_mask = inputs['attention_mask'],
    max_new_tokens = 256
)

In [26]:
pdf = pd.DataFrame(zip(reviews, tokenizer.batch_decode(pred, skip_special_tokens = True)), columns = ['review', 'classification'])
display(pdf)

Unnamed: 0,review,classification
0,"'Despicable Me' is a cute and funny movie, bu...",positive
1,'The Batman' is a dark and gritty take on the...,positive
2,The Phantom Menace is a visually stunning fil...,positive
3,I'm not sure if The Matrix and the two sequel...,negative
4,What the heck was that? Superb. Disgusting. N...,negative
