# | NLP | LLM | Fine-tuning | Trainer |

## Natural Language Processing (NLP) and Large Language Models (LLM) with Fine-Tuning LLM and Trainer

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)


# <b>1 <span style='color:#78D118'>|</span> Overview</b>

In this notebook we're going to Fine-Tuning LLM:

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_2.png?raw=true" alt="Learning" width="50%">

Many LLMs are general purpose models trained on a broad range of data and use cases. This enables them to perform well in a variety of applications, as shown in previous modules. It is not uncommon though to find situations where applying a general purpose model performs unacceptably for specific dataset or use case. This often does not mean that the general purpose model is unusable. Perhaps, with some new data and additional training the model could be improved, or fine-tuned, such that it produces acceptable results for the specific use case.

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_1.png?raw=true" alt="Learning" width="50%">

Fine-tuning uses a pre-trained model as a base and continues to train it with a new, task targeted dataset. Conceptually, fine-tuning leverages that which has already been learned by a model and aims to focus its learnings further for a specific task.

It is important to recognize that fine-tuning is model training. The training process remains a resource intensive, and time consuming effort. Albeit fine-tuning training time is greatly shortened as a result of having started from a pre-trained model. 

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_3.png?raw=true" alt="Learning" width="50%">


## Learning Objectives

 By the end of this notebook, you will be able to:
1. Prepare a novel dataset
2. Fine-tune the `t5-small` model to classify movie reviews.

### Setup


In [3]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "Fill"

In [4]:
%%capture
!py-cpuinfo==9.0.0
#!pip install chromadb==0.4.10 tiktoken==0.3.3 sqlalchemy==2.0.15
#!pip install langchain==0.0.249
!pip install --force-reinstall pydantic==1.10.6 
#!pip install sentence_transformers

In [5]:
import os
import pandas as pd
import transformers as tr
from datasets import load_dataset

In [6]:
cache_dir = "./cache"

In [7]:
import pandas as pd
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

In [8]:
# Creating a local temporary directory on the Driver. 
# This will serve as a root directory for the intermediate model checkpoints created during the training process. The final model will be persisted to DBFS.
import tempfile

tmpdir = tempfile.TemporaryDirectory()
local_training_root = tmpdir.name

# <b>2 <span style='color:#78D118'>|</span> Fine-Tuning</b>

### Step 1 - Data Preparation

The first step of the fine-tuning process is to identify a specific task and supporting dataset. In this notebook, we will consider the specific task to be classifying movie reviews. This idea is generally simple task where a movie review is provided as plain-text and we would like to determine whether or not the review was positive or negative.

The [IMDB dataset](https://huggingface.co/datasets/imdb) can be leveraged as a supporting dataset for this task. The dataset conveniently provides both a training and testing dataset with labeled binary sentiments, as well as a dataset of unlabeled data.





In [9]:
imdb_ds = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
imdb_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### Step 2 - Select pre-trained model


The next step of the fine-tuning process is to select a pre-trained model. We will consider using the [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models for our fine-tuning purposes. The T5 models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for tasks such as summarization, translation, text classification, question answering, and more.

The `t5-small` version of the T5 models has 60 million parameters. This slimmed down version will be sufficient for our purposes.


In [11]:
model_checkpoint = "t5-small"

Hugging Face provides the [Auto*](https://huggingface.co/docs/transformers/model_doc/auto) suite of objects to conveniently instantiate the various components associated with a pre-trained model. Here, we use the [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) to load in the tokenizer that is associated with the `t5-small` model.

In [12]:
# load the tokenizer that was used for the t5-small model
tokenizer = tr.AutoTokenizer.from_pretrained(
    model_checkpoint, 
    cache_dir=cache_dir
)  # Use a pre-cached model

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

The IMDB dataset is a binary sentiment dataset. Its labels therefore are encoded as (-1 - unknown; 0 - negative; 1 - positive) values. In order to use this dataset with a text-to-text model like T5, the label set needs to be represented as a string. There are a number of ways to accomplish this. Here, we will simply translate each label id to its corresponding string value.

In [14]:
import torch

def to_tokens(tokenizer, label_map):
    """
    Given a `tokenizer` this closure will iterate through `x` and return the result of `apply()`.
    This function is mapped to a dataset and returned with ids and attention mask.
    """
    def apply(x):
        """From a formatted dataset `x` a batch encoding `token_res` is created."""
        target_labels = [label_map[y] for y in x["label"]]
        token_res = tokenizer(
            x["text"],
            text_target=target_labels,
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        # Convert tensors to lists or numpy arrays
        for key, value in token_res.items():
            if isinstance(value, torch.Tensor):
                token_res[key] = value.tolist()
        return token_res
    return apply

imdb_label_lookup = {0: "negative", 1: "positive", -1: "unknown"}
# Assuming tokenizer is defined somewhere before this code
imdb_to_tokens = to_tokens(tokenizer, imdb_label_lookup)
tokenized_dataset = imdb_ds.map(
    imdb_to_tokens,
    batched=True, # batched=True, it expects the function to return a dictionary of types like (<class 'list'>, <class 'numpy.ndarray'>).
    remove_columns=["text", "label"]
)


  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [16]:
def test_tokenized_dataset(tokenized_dataset, num_samples=1):
    # Print the first few samples from the tokenized dataset
    for i in range(num_samples):
        sample = tokenized_dataset[i]
        print(f"Sample {i + 1}:")
        print("Input IDs:", sample["input_ids"])
        print("Attention Mask:", sample["attention_mask"])
        print("Labels:", sample["labels"])
        print("=" * 50)

# Assuming tokenized_dataset is already defined
test_tokenized_dataset(tokenized_dataset['train'])

Sample 1:
Input IDs: [27, 3, 20907, 27, 5422, 205, 22182, 17854, 18, 476, 3577, 20573, 45, 82, 671, 1078, 250, 13, 66, 8, 21760, 24, 3, 8623, 34, 116, 34, 47, 166, 1883, 16, 18148, 5, 27, 92, 1943, 24, 44, 166, 34, 47, 3, 27217, 57, 412, 5, 134, 5, 1653, 7, 3, 99, 34, 664, 1971, 12, 2058, 48, 684, 6, 2459, 271, 3, 9, 1819, 13, 4852, 1702, 96, 23862, 2660, 23, 138, 121, 27, 310, 141, 12, 217, 48, 21, 1512, 5, 2, 115, 52, 3, 87, 3155, 2, 115, 52, 3, 87, 3155, 634, 5944, 19, 3, 12809, 300, 3, 9, 1021, 16531, 6616, 1236, 2650, 312, 29, 9, 113, 2746, 12, 669, 762, 255, 54, 81, 280, 5, 86, 1090, 255, 2746, 12, 992, 160, 1388, 7, 12, 492, 128, 1843, 13, 12481, 30, 125, 8, 1348, 7320, 15, 221, 816, 81, 824, 1827, 807, 224, 38, 8, 8940, 1602, 11, 1964, 807, 16, 8, 907, 1323, 5, 86, 344, 3558, 13446, 11, 9495, 177, 23, 1847, 7, 13, 23964, 81, 70, 8479, 30, 6525, 6, 255, 65, 3, 7, 994, 28, 160, 6616, 3145, 6, 28345, 6, 11, 4464, 1076, 5, 2, 115, 52, 3, 87, 3155, 2, 115, 52, 3, 87, 3155, 5680, 578

### Step 3 - Setup Training

The model training process is highly configurable. The [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class effectively exposes the configurable aspects of the process allowing one to customize them accordingly. Here, we will focus on setting up a training process that performs a single epoch of training with a batch size of 16. We will also leverage `adamw_torch` as the optimizer.


In [17]:
checkpoint_name = "test-trainer"
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)
training_args = tr.TrainingArguments(
    local_checkpoint_path,
    num_train_epochs=1,  # default number of epochs to train is 3
    per_device_train_batch_size=16,
    optim="adamw_torch",
    report_to=["tensorboard"],
)

The pre-trained `t5-small` model can be loaded using the [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM) class.

In [18]:
# load the pre-trained model
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
    model_checkpoint, 
    cache_dir=cache_dir
)  # Use a pre-cached model


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [19]:
# Used to assist the trainer in batching the data
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)

trainer = tr.Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)



 ### Step 4 - Train
 
 Before starting the training process, let's turn on Tensorboard. This will allow us to monitor the training process as checkpoint logs are created.

In [27]:
tensorboard_display_dir = f"{local_checkpoint_path}/runs"

%load_ext tensorboard
%tensorboard --logdir '{tensorboard_display_dir}'



The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 176), started 0:10:09 ago. (Use '!kill 176' to kill it.)

Start the fine-tuning process.

In [21]:
trainer.train()

# save model to the local checkpoint
trainer.save_model()
trainer.save_state()

# persist the fine-tuned model to DBFS
final_model_path = f"{cache_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.6229
1000,0.1395
1500,0.1314


### Step 5 - Predict

In [23]:
fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)

reviews = [
    """
    'Despicable Me' is a cute and funny movie, but the plot is predictable and the characters are not very well-developed. Overall, it's a good movie for kids, but adults might find it a bit boring.
    """,
    """ 
    'The Batman' is a dark and gritty take on the Caped Crusader, starring Robert Pattinson as Bruce Wayne. The film is a well-made crime thriller with strong performances and visuals, but it may be too slow-paced and violent for some viewers.
    """,
    """
    The Phantom Menace is a visually stunning film with some great action sequences, but the plot is slow-paced and the dialogue is often wooden. It is a mixed bag that will appeal to some fans of the Star Wars franchise, but may disappoint others.
    """,
    """
    I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.
    """,
    """
    I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.
    """,
    """
    This captivating film delivers a compelling story and outstanding performances, placing it among the best of the year.
    """,
    """
    While the special effects may impress, the lack of a coherent storyline makes this film disappointing for those seeking deeper substance.
    """
]

inputs = tokenizer(
    reviews, 
    return_tensors="pt", 
    truncation=True, 
    padding=True
)

pred = fine_tuned_model.generate(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"]
)

pdf = pd.DataFrame(
    zip(reviews, tokenizer.batch_decode(pred, skip_special_tokens=True)),
    columns=["review", "classification"],
)
display(pdf)




Unnamed: 0,review,classification
0,"\n 'Despicable Me' is a cute and funny movie, but the plot is predictable and the characters are not very well-developed. Overall, it's a good movie for kids, but adults might find it a bit boring.\n",negative
1,"\n 'The Batman' is a dark and gritty take on the Caped Crusader, starring Robert Pattinson as Bruce Wayne. The film is a well-made crime thriller with strong performances and visuals, but it may be too slow-paced and violent for some viewers.\n",positive
2,"\n The Phantom Menace is a visually stunning film with some great action sequences, but the plot is slow-paced and the dialogue is often wooden. It is a mixed bag that will appeal to some fans of the Star Wars franchise, but may disappoint others.\n",positive
3,"\n I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.\n",negative
4,"\n I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.\n",negative
5,"\n This captivating film delivers a compelling story and outstanding performances, placing it among the best of the year.\n",positive
6,"\n While the special effects may impress, the lack of a coherent storyline makes this film disappointing for those seeking deeper substance.\n",negative
