# Fine-tune a pretrained model

I'm currently working on a project that involves analyzing IMDB reviews using a dataset from HuggingFace. To do this, I'm fine-tuning a model with Bert. I actually tried to do this before with a dataset from Kaggle, but I wasn't able to because it was in a Pandas dataframe format. So now I'm giving it another shot with a different dataset

## Prepare a dataset

Before you can fine-tune a pretrained model, download a dataset and prepare it for training. 

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")
dataset["train"][100]

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset imdb (C:/Users/Warmtebron/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 39.66it/s]


{'text': "Terrible movie. Nuff Said.<br /><br />These Lines are Just Filler. The movie was bad. Why I have to expand on that I don't know. This is already a waste of my time. I just wanted to warn others. Avoid this movie. The acting sucks and the writing is just moronic. Bad in every way. The only nice thing about the movie are Deniz Akkaya's breasts. Even that was ruined though by a terrible and unneeded rape scene. The movie is a poorly contrived and totally unbelievable piece of garbage.<br /><br />OK now I am just going to rag on IMDb for this stupid rule of 10 lines of text minimum. First I waste my time watching this offal. Then feeling compelled to warn others I create an account with IMDb only to discover that I have to write a friggen essay on the film just to express how bad I think it is. Totally unnecessary.",
 'label': 0}

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Tokenize the data

Now I'm going to use a tokenizer to process the text. 
it's necessary to use a tokenizer to process the text and also to include a strategy for handling varying sequence lengths through padding and truncation.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(data):
    return tokenizer(data["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at C:\Users\Warmtebron\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-2ce6438a8b000ae7.arrow
Loading cached processed dataset at C:\Users\Warmtebron\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-3929a300281ee32b.arrow
                                                                   

Next, I will generate a smaller subset of the complete dataset for fine-tuning purposes. This can help to reduce the overall amount of time required for the fine-tuning process

In [4]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at C:\Users\Warmtebron\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-33939c877e15977f.arrow
Loading cached shuffled indices for dataset at C:\Users\Warmtebron\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-739bb2015668809b.arrow


## Train the model

Next, I'm going to train the model with the PyTorch trainer.

I'll start by loading the model and specify the number of expected label. In this case, this is 2 (positive, negative).

In [5]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

  obj = cast(Storage, torch.UntypedStorage(nbytes))
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification

## Training hyperparameters

After that, the next step is to construct a TrainingArguments class, which contains all the hyperparameters that can be adjusted and various flags for enabling different training options.

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

## Evaluate

During the training process, the Trainer does not automatically assess the performance of the model. Therefore, it's necessary to provide the Trainer with a function to compute and report the metrics.

In [7]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

To determine the accuracy of your predictions, you should use the compute function on the metric. However, it's important to note that before passing your predictions to the compute function, you must first convert them to logits.

In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Trainer

Now I'm going to vreate a Trainer object with my model, training arguments, trainig and test dataset, and evaluation function.

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [10]:
trainer.train()

  7%|▋         | 25/375 [10:23<2:26:08, 25.05s/it]