# Hugging Face Transformers Tutorial

## Lab session for the course "Introduction to Natural Language Processing (F21NL)

This notebook will give an introduction to the Hugging Face Transformers Python library and some common patterns that you can use to take advantage of it. It is most useful for using or fine-tuning pretrained transformer models for your projects.


[Hugging Face](https://huggingface.co/) provides access to:
1. [models](https://huggingface.co/docs/transformers/index) containing both the code that implements them and their weights.
2. [model-specific tokenizers](https://huggingface.co/docs/tokenizers/index).
3. [pipelines](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for common NLP tasks
4. [datasets](https://huggingface.co/docs/datasets/en/index) as a separate `datasets` package to load existing datasets and built your own dataset.
5. [metrics](https://huggingface.co/docs/evaluate/index) as a separate `evaluate` package to evaluate your moden using common NLP metrics.
6. [training](https://huggingface.co/docs/accelerate/en/index) functionalities supporting GPU hardware as a separate `accelerate` package.

all implemented mostly using PyTorch!

We're going to go through a few use cases:
* Overview of Tokenizers and Models
* Finetuning - for your own task. We'll use a sentiment-classification example.


Some of the benefits of using the Hugging Face ecosytem:

1. Applying an existing pre-trained model to a new application or task and explore how to approach/solve it.
2. Implementing a new or complex neural architecture and demonstrate its performance on some data.
3. Analyzing the behavior of a model: how it represents linguistic knowledge or what kinds of phenomena it can handle or errors that it makes.

Of these, `transformers` will be the most help for (1) and for (3). As we saw already in some previous labs, (2) involves a bit of learning curve but if you master it, you will find it very convenient to design a model based on existing ones provided by Huggingface. We won't be covering it here and please refer to [this example](https://huggingface.co/docs/transformers/en/custom_models).


Additional Links:

* [Hugging Face Docs](https://huggingface.co/docs/transformers/index)
  * Clear documentation.
  * Tutorials, walk-throughs, and example notebooks.
  * List of available models.
* [Hugging Face Course](https://huggingface.co/course/)
    * Deep-dive into Large Language Models
* [Hugging Face Examples](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
    * You can find very similar code structures accross very different downstream tasks/models using Huggingface.



In [None]:
!pip install transformers datasets accelerate bitsandbytes -U

In [None]:
from collections import defaultdict, Counter
from typing import Any
import json

from matplotlib import pyplot as plt
import numpy as np
import torch

def print_encoding(model_inputs, indent=4):
    indent_str = " " * indent
    print("{")
    for k, v in model_inputs.items():
        print(indent_str + k + ":")
        print(indent_str + indent_str + str(v))
    print("}")

## Part 1: Common Pattern for using Hugging Face Transformers

We're going to start off with a common usage pattern for Hugging Face Transformers, using the example of Sentiment Analysis.

Given a sentence, the goal is to predict the sentiment of that sentence (either positive or negative):

* This movie is awesome, I can watch it all the time, without getting bored -> Positive
* This movie is horrible, I cannot believe I wasted my time watching it -> Negative

First, find a model on [the hub](https://huggingface.co/models). For the purpose of this tutorial, Anyone can we are going to use a model from [this paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3489963)).

Then, there are two objects that need to be initialized - a **tokenizer**, and a **model**

* Tokenizer converts strings to lists of vocabulary ids that the model requires
* Model takes the vocabulary ids and produces a prediction


<div align="center">

![full_nlp_pipeline.png](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)


From [https://huggingface.co/course/chapter2/2?fw=pt](https://huggingface.co/course/chapter2/2?fw=pt)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "siebert/sentiment-roberta-large-english"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
print(model)

# The models from Hugging Face are torch.nn.Modules!
print(isinstance(model, torch.nn.Module))

As we can see from the above cell, this model `RobertaForSequenceClassification` is composed of three submodules: `RobertaModel`, `RobertaEncoder` and a `RobertaClassificationHead`:

The model accepts as input sequences containing input ids from a vocabulary of 50265 tokens (see the `word_embeddings` and classifies the sequences into 2 classes (see the final layer of the classifier).

In particular:

1. The `RobertaModel` embeds these sequences with `word_embeddings` and adds positional information to every token in the sequence.
2. The `RobertaEncoder` contains several layers that transform the embedded sequences.
3. The `RobertaClassificationHead` is an MLP accepts the transformed output from the final layer of the `RobertaEncoder` and produces a dimensional logit vector for each sequence.


Do not worry if you do not fully understand the flow of the input or some individual components of the model for now. We will talk about them in the next lab where we analyze the anatomy of a transformer.

In [None]:
# Lets tokenize an example sentence and perform a forward pass on the model

# @markdown Type a sentence here and get the sentiment of that sentence from the model!
inputs = "This movie sucks" # @param {"type":"string"}
tokenized_inputs = tokenizer(inputs, return_tensors="pt")

# Perform a forward pass on the model
outputs = model(**tokenized_inputs)

labels = ['NEGATIVE', 'POSITIVE']
prediction = torch.argmax(outputs.logits)

print("Input:")
print(inputs)
print()
print("Tokenized Inputs: {tokenized_inputs}")
print_encoding(tokenized_inputs)
print()
print("Model Outputs:")
print(outputs)
print()
print(f"The prediction is {labels[prediction]}")

## Tokenizers

Pretrained models are implemented along with **tokenizers** that are used to preprocess their inputs. The tokenizers take raw strings or list of strings and output what are effectively dictionaries that contain the the model inputs.


You can access tokenizers either with the Tokenizer class specific to the model you want to use (here DistilBERT), or with the AutoTokenizer class.
Fast Tokenizers are written in Rust, while their slow versions are written in Python.

In [None]:
from transformers import DistilBertTokenizer, DistilBertTokenizerFast, AutoTokenizer
name = "distilbert/distilbert-base-cased"
# name = `user/name when loading from the Hugging Face hub
# name = `local_path` when loading from local

# Tokenizer written in Python (slow)
tokenizer = DistilBertTokenizer.from_pretrained(name)
print(tokenizer)
# Tokenizer written in Rust (fast)
tokenizer = DistilBertTokenizerFast.from_pretrained(name)
print(tokenizer)
# Convenient! Defaults to Fast
tokenizer = AutoTokenizer.from_pretrained(name)
print(tokenizer)

We can see that this particular tokenizer has sever attributes:

1. Has a `vocab_size` of 28996 tokens
2. The `model_max_length` is 1000000000000000019884624838656, this is the maximum length of a sequence before the tokenizer truncates it. Any sequence with more than `model_max_length` tokens will be truncated to that value.
3. `padding_side` is set to right. We have not talked about padding yet, but when preparing a batch of sentences, some of them will have different lengths. Padding ensures that all sequences within the batch will have the same length and therefore can be processed by the model at the same time.
4. `truncation_side` is set to right when truncating long sequences
5. The model has a lot of special tokens `[PAD], [UNK], [CLS], [SEP], [MASK]`, just like the `<START>` and `<END>` tokens that we used in our bi-gram language model.

In [None]:
# This is how you call the tokenizer
input_str = "Hugging Face Transformers is great!"
tokenized_inputs = tokenizer(input_str)


print("Vanilla Tokenization")
print_encoding(tokenized_inputs)
print()

# Accessing outputs of the tokenizer:
print(tokenized_inputs.input_ids)
print(tokenized_inputs["input_ids"])

### How tokenization works under the hood?

Tokenization happens in a few steps:

In [None]:
# Step 1: Converting input sentences into token ids
input_tokens = tokenizer.tokenize(input_str)
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
# Step 2: Appending start and end of sequence tokens
input_ids_special_tokens = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]

# Converting back token ids into sentences
decoded_str = tokenizer.decode(input_ids_special_tokens)

print("start:                ", input_str)
print("tokenize:             ", input_tokens)
print("convert_tokens_to_ids:", input_ids)
print("add special tokens:   ", input_ids_special_tokens)
print("--------")
print("decode:               ", decoded_str)

The tokenizer can return pytorch tensors

In [None]:
model_inputs = tokenizer("Hugging Face Transformers is great!", return_tensors="pt")
print("PyTorch Tensors:")
print_encoding(model_inputs)

### Passing Multiple inputs at the same time

In [None]:
# You can pass multiple strings into the tokenizer and pad them as you need
model_inputs = tokenizer(
    [
        "Hugging Face Transformers is great!",
        "The quick brown fox jumps over the lazy dog. Then the dog got up and ran away because she didn't like foxes.",
    ],
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(f"Pad token: {tokenizer.pad_token} | Pad token id: {tokenizer.pad_token_id}")
print("Padding:")

# Notice how the tokenizer automatically converts the inputs of variable length into sequences by padding!
print(model_inputs.input_ids.shape)
print_encoding(model_inputs)

In [None]:
# You can also decode a whole batch at once:
print("Batch Decode:")
print(tokenizer.batch_decode(model_inputs.input_ids))
print()
print("Batch Decode: (no special characters)")
print(tokenizer.batch_decode(model_inputs.input_ids, skip_special_tokens=True))

For more information about tokenizers, you can look at:
[Hugging Face Transformers Docs](https://huggingface.co/docs/transformers/main_classes/tokenizer) and the [Hugging Face Tokenizers Library](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) (For the Fast Tokenizers). The Tokenizers Library even lets you train your own tokenizers!

### The role of attention mask

Perhaps you have noticed that along with the `input_ids` the tokenizer also returns an `attention_mask`. What is this `attention_mask`?

The vast majority of the models that are being developed nowadays are based on the Transformer architecture. At its core, the Transformer, relies on the `self attention` operation (you will learn about it more in detail in this week's lecture and the follow-up lab), where the individual tokens within the sequence will look at other tokens in the sequence to find useful information.

However, the because we apply padding, we do not want the non-padded tokens in the sequence to look at the padded ones, because the padding token is a **hack to enable batch processing via multiplying the inputs with the weights a linear layer**.

Suppose we have the two following examples in a batch that are tokenized with whitespace:

1. The plot was bad, I was bored in the end. (10 tokens)

    [`The` `plot` `was` `bad` `I` `was` `bored` `in` `the` `end`]

2. This movie was nice, it made me feel happy. (9 tokens + adding 1 one padding for batch processing)

    [`This` `movie` `was` `nice` `it` `made` `me` `feel` `happy` `<pad>`]


For the second sentence, we would most likely expect that the the word `happy` should look into other words like `nice`, or `feel` rather than the `padding` token.


We will come back to the role of the attention and its implementation later on. For now this section was done for explaining the `attention_mask` output of the tokenizer.

## Models


Initializing models is very similar to initializing tokenizers. You can either use the model class specific to your model or you can use an [AutoModel](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class.


While most of the pretrained transformers have similar architecture, if you there are additional weights, called "heads" that you have to train if you're doing sequence classification, question answering, or some other task. Hugging Face automatically sets up the architecture you need when you specify the model class.

For example, we are doing sentiment analysis, so we are going to use `DistilBertForSequenceClassification`. If we were going to continue training DistilBERT on its masked-language modeling training objective, we would use `DistilBertForMaskedLM`, and if we just wanted the model's representations, maybe for our own downstream task, we could just use `DistilBertModel`.


Here's a stylized picture of a model recreated from one found here: [https://huggingface.co/course/chapter2/2?fw=pt](https://huggingface.co/course/chapter2/2?fw=pt).
![model_illustration.png](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)



There are three types of models:
* Encoder-only (e.g. BERT)
* Decoder-only (e.g. GPT2)
* Encoder-Decoder models (e.g. BART or T5)

The task-specific classes you have available depend on what type of model you're dealing with.


A full list of choices are available in the [docs](https://huggingface.co/docs/transformers/model_doc/auto).


In [None]:
from transformers import AutoModelForSequenceClassification, DistilBertForSequenceClassification, DistilBertModel
print("Loading base model")
base_model = DistilBertModel.from_pretrained("distilbert-base-cased")

print("Loading classification model from base model;s checkpoint")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)


You can also initialize with random weights

In [None]:
from transformers import DistilBertConfig, DistilBertModel

# Initializing a DistilBERT configuration
configuration = DistilBertConfig()
configuration.num_labels=2
# Initializing a model (with random weights) from the configuration
model = DistilBertForSequenceClassification(configuration)

# Accessing the model configuration
configuration = model.config

We get a warning here because the sequence classification parameters haven't been trained yet.

Passing inputs to the model is super easy. They take inputs as keyword arguments

In [None]:
model_inputs = tokenizer(input_str, return_tensors="pt")

# Option 1: using the keys directly as arguments
model_outputs = model(input_ids=model_inputs.input_ids, attention_mask=model_inputs.attention_mask)

# Option 2: the keys of the dictionary the tokenizer returns are the same as the keyword arguments the model expects
# f({k1: v1, k2: v2}) = f(k1=v1, k2=v2)
# And so we a can use the ** operation which assigns the values directly to the keyword arguments of the model
model_outputs = model(**model_inputs)

print(model_inputs)
print()
print(model_outputs)
print()
print(f"Distribution over labels: {torch.softmax(model_outputs.logits, dim=1)}")

If you notice, it's a bit weird that we have two classes for a binary classification task - you could easily have a single class and just choose a threshold. It's like this because of how huggingface models calculate the loss. This will increase the number of parameters we have, but shouldn't otherwise affect performance.

These models are just Pytorch Modules!

* You can can calculate the loss with your `loss_func` and call `loss.backward`.
* You can use any of the optimizers or learning rate schedulers that you used

In [None]:
# You can calculate the loss like normal
label = torch.tensor([1])
loss = torch.nn.functional.cross_entropy(model_outputs.logits, label)
print(loss)
loss.backward()

# You can get the parameters
list(model.named_parameters())[0]

Hugging Face provides an additional easy way to calculate the loss as well:

In [None]:
# To calculate the loss, we need to pass in a label:
model_inputs = tokenizer(input_str, return_tensors="pt")

labels = ['NEGATIVE', 'POSITIVE']
model_inputs['labels'] = torch.tensor([1])

model_outputs = model(**model_inputs)


print(model_outputs)
print()
print(f"Model predictions: {labels[model_outputs.logits.argmax()]}")

One final note - you can get the hidden states and attention weights from the models really easily. This is particularly helpful if you're working on an analysis project. (For example, see [What does BERT look at?](https://arxiv.org/abs/1906.04341)).

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-cased", output_attentions=True, output_hidden_states=True)
model.eval()

model_inputs = tokenizer(input_str, return_tensors="pt")
with torch.no_grad():
    model_output = model(**model_inputs)


print("Hidden state size (per layer):  ", model_output.hidden_states[0].shape)
# (layer, batch, query_word_idx, key_word_idxs), y-axis is query, x-axis is key
print("Attention head size (per layer):", model_output.attentions[0].shape)

# print(model_output)

In [None]:
tokens = tokenizer.convert_ids_to_tokens(model_inputs.input_ids[0])
print(tokens)


n_layers = len(model_output.attentions)
n_heads = len(model_output.attentions[0][0])
fig, axes = plt.subplots(6, 12)
fig.set_size_inches(18.5*2, 10.5*2)
for layer in range(n_layers):
    for i in range(n_heads):
        axes[layer, i].imshow(model_output.attentions[layer][0, i])
        axes[layer][i].set_xticks(list(range(9)))
        axes[layer][i].set_xticklabels(labels=tokens, rotation="vertical")
        axes[layer][i].set_yticks(list(range(9)))
        axes[layer][i].set_yticklabels(labels=tokens)

        if layer == 5:
            axes[layer, i].set(xlabel=f"head={i}")
        if i == 0:
            axes[layer, i].set(ylabel=f"layer={layer}")

plt.subplots_adjust(wspace=0.3)
plt.show()

## Part 2: Finetuning

For your projects, you are much more likely to want to finetune a pretrained model. This is a little bit more involved, but is still quite easy.

### 2.1 Loading in a dataset

In addition to having models, the [the hub](https://huggingface.co/datasets) also has datasets.

In [None]:
from datasets import load_dataset, DatasetDict

# DataLoader(zip(list1, list2))
dataset_name = "stanfordnlp/imdb"

imdb_dataset = load_dataset(dataset_name)


# Just take the first 50 tokens for speed/running on cpu
def truncate(example: dict[str, Any]) -> dict[str, str]:
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }

imdb_dataset

In [None]:

# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=1111).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=1111).select(range(128, 160)).map(truncate),
)
small_imdb_dataset

In [None]:
small_imdb_dataset['train'][:10]

In [None]:
# Prepare the dataset - this tokenizes the dataset in batches of 16 examples.
small_tokenized_dataset = small_imdb_dataset.map(
    lambda example: tokenizer(example['text'], padding=True, truncation=True), # https://huggingface.co/docs/transformers/pad_truncation
    batched=True,
    batch_size=16
)

small_tokenized_dataset = small_tokenized_dataset.remove_columns(["text"])
small_tokenized_dataset = small_tokenized_dataset.rename_column("label", "labels")
small_tokenized_dataset.set_format("torch")

In [None]:
small_tokenized_dataset['train'][0:2]

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_tokenized_dataset['train'], batch_size=16)
eval_dataloader = DataLoader(small_tokenized_dataset['val'], batch_size=16)

### 2.2 Training

To train your models, you can just use the same kind of training loop that you would use in Pytorch.

Hugging Face models are also `torch.nn.Module`s so backpropagation happens the same way and you can even use the same optimizers. Hugging Face also includes learning rate schedules that were used to train Transformer models, so you can use these too.

For optimization, we're using the AdamW Optimizer, and a linear learning rate scheduler, which reduces the learning rate a little bit after each training step over the course of training.

In [None]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW
from tqdm.notebook import tqdm

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)

num_epochs = 1
num_training_steps = len(train_dataloader)
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
print(optimizer)
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

best_val_loss = float("inf")
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
    # training
    model.train()
    for batch_i, batch in enumerate(train_dataloader):
        # batch = ([text1, text2], [0, 1])

        # Step 1: forward pass
        output = model(**batch)

        # Step 2: Zero gradients for all parameters
        optimizer.zero_grad()
        # Step 3: Compute loss
        output.loss.backward()
        # Step 4: Update weights
        optimizer.step()
        # Step 5: Adjust the learning rate
        lr_scheduler.step()
        progress_bar.update(1)

    # validation
    model.eval()
    for batch_i, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            output = model(**batch)
        loss += output.loss

    avg_val_loss = loss / len(eval_dataloader)
    print(f"Validation loss: {avg_val_loss}")
    if avg_val_loss < best_val_loss:
        print("Saving checkpoint!")
        best_val_loss = avg_val_loss
        # torch.save({
        #     'epoch': epoch,
        #     'model_state_dict': model.state_dict(),
        #     'optimizer_state_dict': optimizer.state_dict(),
        #     'val_loss': best_val_loss,
        #     },
        #     f"checkpoints/epoch_{epoch}.pt"
        # )

In [None]:
batch['input_ids'].max()

While you can use PyTorch to train your models like we, Hugging Face offers a powerful [`Trainer`](https://huggingface.co/docs/transformers/en/main_classes/trainer) class to handle most needs.

In [None]:
imdb_dataset = load_dataset("stanfordnlp/imdb")

small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=1111).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=1111).select(range(128, 160)).map(truncate),
)

small_tokenized_dataset = small_imdb_dataset.map(
    lambda example: tokenizer(example['text'], truncation=True),
    batched=True,
    batch_size=16
)

`TrainingArguments` specifies different training parameters like how often to evaluate and save model checkpoints, where to save them, etc. There are **many** aspects you can customize and it's worth checking them out [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

Some things you can control include:
* learning rate, weight decay, gradient clipping,
* checkpointing, logging, and evaluation frequency

The `Trainer` actually performs the training. You can pass it the `TrainingArguments`, model, the datasets, tokenizer, optimizer, and even model checkpoints to resume training from. The `compute_metrics` function is called at the end of evaluation/validation to calculate evaluation metrics.

In [None]:
import os
from transformers import TrainingArguments, Trainer

# Dont use wandb to log the experiment
os.environ["WANDB_DISABLED"] = "true"

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)

arguments = TrainingArguments(
    output_dir="sample_hf_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224,
    log_level="debug",
)


def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return {"accuracy": np.mean(predictions == labels)}


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'], # change to test when you do your final evaluation!
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

#### Callbacks: Logging and Early Stopping


Hugging Face Transformers also allows you to write `Callbacks` if you want certain things to happen at different points during training (e.g. after evaluation or after an epoch has finished). For example, there is a callback for early stopping, and I usually write one for logging as well.

For more information on callbacks see [here](https://huggingface.co/docs/transformers/main_classes/callback#transformers.TrainerCallback).

In [None]:
from transformers import TrainerCallback, EarlyStoppingCallback

class LoggingCallback(TrainerCallback):
    def __init__(self, log_path):
        self.log_path = log_path

    # will call on_log on each logging step, specified by TrainerArgument. (i.e TrainerArguement.logginng_step)
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            with open(self.log_path, "a") as f:
                f.write(json.dumps(logs) + "\n")
    # def on_epoch(...)


trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.0))
trainer.add_callback(LoggingCallback("sample_hf_trainer/log.jsonl"))

In [None]:
# train the model
trainer.train()

In [None]:
# evaluating the model is very easy
# just gets evaluation metrics
results = trainer.evaluate()
results = trainer.predict(small_tokenized_dataset['val']) # also gives you predictions

In [None]:
results

In [None]:
# To load our saved model, we can pass the path to the checkpoint into the `from_pretrained` method:
test_str = "I enjoyed the movie!"

finetuned_model = AutoModelForSequenceClassification.from_pretrained("sample_hf_trainer/checkpoint-8")
model_inputs = tokenizer(test_str, return_tensors="pt")
prediction = torch.argmax(finetuned_model(**model_inputs).logits)
print(["NEGATIVE", "POSITIVE"][prediction])

Included here are also some practical tips for fine-tuning:

**Good default hyperparameters.** The hyperparameters you will depend on your task and dataset. You should do a hyperparameter search to find the best ones. That said, here are some good initial values for fine-tuning.
* Epochs: {2, 3, 4} (larger amounts of data need fewer epochs)
* Batch size (bigger is better: as large as you can make it)
* Optimizer: AdamW
* AdamW learning rate: {2e-5, 5e-5}
* Learning rate scheduler: linear warm up for first {3-10%} steps of training
* weight_decay (l2 regularization): {0, 0.01, 0.1}

You should monitor your validation loss to decide when you've found good hyperparameters.

## Part 3: Generation

In the example above we finetuned the model on a classification task, but you can also finetune models on language modeling tasks, where we predict the probability distribution of the next token in a sequence.

The [`generate`](https://huggingface.co/docs/transformers/v4.56.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) function makes it easy to generate from these models.

In [None]:
from transformers import AutoModelForCausalLM

gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')

gpt2 = AutoModelForCausalLM.from_pretrained('distilgpt2')
gpt2.config.pad_token_id = gpt2.config.eos_token_id  # Prevents warning during decoding

In [None]:
prompt = "Once upon a time"

tokenized_prompt = gpt2_tokenizer(prompt, return_tensors="pt")

for i in range(10):
    output = gpt2.generate(
        **tokenized_prompt,
        max_length=50,
        do_sample=True,
        top_p=0.9
    )

    print(f"{i + 1}) {gpt2_tokenizer.batch_decode(output)[0]}")

## Part 4: Defining Custom Datasets

There are a few ways to go about defining datasets, but I'm going to show an example using Pytorch Dataloaders. This example uses an encoder-decoder dataset,the [E2E Dataset](https://arxiv.org/abs/1706.09254), which is maps structured information about restaurants to natural language descriptions.

In [None]:
!wget https://raw.githubusercontent.com/tuetschek/e2e-dataset/refs/heads/master/trainset.csv

In [None]:
import pandas as pd
from datasets import Dataset

df = pd.read_csv("trainset.csv")
custom_dataset = Dataset.from_pandas(df)

In [None]:
import csv
from torch.utils.data import Dataset, DataLoader

class E2EDataset(Dataset):
    """Tokenize data when we call __getitem__"""
    def __init__(self, path, tokenizer):
        with open(path, newline="") as f:
            reader = csv.reader(f)
            next(reader) # skip the heading
            self.data = [{"source": row[0], "target": row[1]} for row in reader]
        self.tokenizer = tokenizer

    def __getitem__(self, i):
        inputs = self.tokenizer(self.data[i]['source'])
        labels = self.tokenizer(self.data[i]['target'])
        inputs['labels'] = labels.input_ids
        return inputs


In [None]:
bart_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')

In [None]:
dataset = E2EDataset("trainset.csv", bart_tokenizer)

In [None]:
dataset[0]

## Part 5: Pipelines

There are some standard NLP tasks like sentiment classification or question answering where there are already pre-trained (and fine-tuned!) models available through Hugging Face Transformer's [_Pipeline_](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/pipelines#transformers.pipeline) interface.

For your projects, you likely won't be using it too much, but it's still worth knowing about!

Here's an example with Sentiment Analysis:

In [None]:
from transformers import pipeline

sentiment_analysis = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")

You can run the pipeline by just calling it on a string

In [None]:
sentiment_analysis("Hugging Face Transformers is really cool!")

Or on a list of strings:

In [None]:
sentiment_analysis(
    [
        "I didn't know if I would like Hákarl, but it turned out pretty good.",
        "I didn't know if I would like Hákarl, and it was just as bad as I'd heard."
    ]
)

You can find more information on pipelines (including which ones are available) [here](https://huggingface.co/docs/transformers/main_classes/pipelines)