<a href="https://colab.research.google.com/github/fpgmina/DeepNLP/blob/main/HF_Transformers_Overview_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hugging Face 🤗 Transformers Overview**
---

**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

In [None]:
%%capture
! pip install transformers[torch]
! pip install accelerate -U

# Pipelines

The easiest way to use 🤗 Transformers library is to interact with [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines). They embed all the steps required to analyze input text:
- Pre-processing
- Model inference
- Post-processing

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg" width="75%"/>

In [None]:
from transformers import pipeline

## Sentiment analysis pipeline (Encoder-only models)

In [None]:
sentiment_analyzer = pipeline("sentiment-analysis", device='cuda')
res = sentiment_analyzer(["I like Deep NLP course", "I don't like Deep NLP course!"])
print(f'\n{res}')

In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models

sentiment_analyzer = pipeline("sentiment-analysis", model="finiteautomata/bertweet-base-sentiment-analysis", device='cuda')
res = sentiment_analyzer(["TLDR: the movie was amazing", "What a mess! The plot was awful"])
print(f'\n{res}')

## Text Generation pipeline (Decoder-only models)

In [None]:
text_generator = pipeline("text-generation", device='cuda')
res = text_generator("The meaning of life is")
print(f"\n{res}\n")
print(res[0]['generated_text'])

In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models

text_generator = pipeline("text-generation", model="GroNLP/gpt2-small-italian", device='cuda')
res = text_generator("Il senso della vita è")
print(f"\n{res}\n")
print(res[0]['generated_text'])

## Text summarization pipeline (Encoder-Decoder models):

In [None]:
summarizer = pipeline("summarization", device='cuda')
res = summarizer("Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings. Highlights are usually manually specified by the authors. This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles. To this end, regression models are trained on a variety of features extracted from previously annotated articles. The proposed approach extends existing extractive approaches by predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights. The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous topics, show that the proposed regression models perform better than existing methods, both supervised and not.")
print(f"\n{res}\n")
print(res[0]['summary_text'])

In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models

summarizer = pipeline("summarization", model="shamikbose89/mt5-small-finetuned-arxiv-cs-finetuned-arxiv-cs-full", device='cuda')
res = summarizer("Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings. Highlights are usually manually specified by the authors. This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles. To this end, regression models are trained on a variety of features extracted from previously annotated articles. The proposed approach extends existing extractive approaches by predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights. The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous topics, show that the proposed regression models perform better than existing methods, both supervised and not.")
print(f"\n{res}\n")
print(res[0]['summary_text'])

# Tokenizers

Text is tokenized into lexical units prior to passing to the model. Each token is mapped to an integer value that represent the token itself. Tokenizer are basically large vocabularies that allow mapping tokens into integer identifiers.

[Tokenizers](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer) contains all the pre-processing tools that are used to split long text into tokens. Once trained their vocabulary is fixed (it can always be updated with additional training phases).

**NB**: AutoClasses allow to generate tokenizer (and model) objects without instantiating the specific model tokenizer (and model).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokens_w = tokenizer.tokenize("I'm learning Deep NLP")
print(f'{tokens_w}\n')

tokens = tokenizer ("I'm learning Deep NLP")

print(tokens.keys())
ids, types, attn = tokens.values()

print()
print(tokens)
print()

for i, t, a in zip(ids, types, attn):
  print(f'token id {i:<5} -> {tokenizer.convert_ids_to_tokens(i):<10} | type = {t} | attention = {a}')

Each model configuration has a maximum length of tokens that can be used for processing. It is common to process sentences that have different lenghts. In this case:

- `max_length` parameter allow to set a maximum number of tokens for processing
- `truncation` allows to enable truncation for sentences exceeding the `max_length`
- `padding` allows to enable padding for sentences shorter than `max_length`

The tokenizer return the `attention_mask` that allows the model to compute attention weights only for tokens (and not for padding)

In [None]:
tokens = tokenizer ("I'm learning Deep NLP", padding='max_length', max_length=16)
print (tokens)

In [None]:
tokens = tokenizer ("I'm learning NLP at Politecnico di Torino. I'm a MSc student", padding='max_length', max_length=16)
ids, types, attn = tokens.values()
for i, t, a in zip(ids, types, attn):
  print(f'token id {i:<5} -> {tokenizer.convert_ids_to_tokens(i):<10} | type = {t} | attention = {a}')

In [None]:
tokens = tokenizer ("I'm learning NLP at Politecnico di Torino. I'm a MSc student", padding='max_length', max_length=16, truncation=True)
ids, types, attn = tokens.values()
for i, t, a in zip(ids, types, attn):
  print(f'token id {i:<5} -> {tokenizer.convert_ids_to_tokens(i):<10} | type = {t} | attention = {a}')

In [None]:
tokens = tokenizer ("I'm learning NLP at Politecnico di Torino. I'm a MSc student", padding='max_length', max_length=32, truncation=True)
ids, types, attn = tokens.values()
for i, t, a in zip(ids, types, attn):
  print(f'token id {i:<5} -> {tokenizer.convert_ids_to_tokens(i):<10} | type = {t} | attention = {a}')

Tokenizers do not only allow encoding text to IDs, they also allow the opposite conversion.

In [None]:
text = tokenizer.decode(tokens.input_ids, skip_special_tokens=False)
print (text)

# [CLS] special token for encoder model, used for classification/regression tasks
# [SEP] special token to separate multiple sentences
# [PAD] special token for padding

# Models

Tranformer-based models are wrapped around their own class in 🤗 Transformers. Similarly to AutoTokenizer, AutoModel class is able to take in charge the instantiation of the correct class for the model we want to use.

Given that, models for specific tasks exist with the same backbone architecture (e.g., BERT can be used both for sequence classification or for token-level classification), the Auto Model should be instantiated with the correct task appended (e.g., AutoModelForSequenceClassification).

In [None]:
from transformers import AutoModelForSequenceClassification
bert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

However, pre-trained BERT model is not fine-tuned for any specific task (this is the reason behind the warning). If we want to use this model, we first need to fine-tune it (or we can use another model already fine-tuned for the task).

[Model Hub](https://huggingface.co/models)

In [None]:
from transformers import AutoModelForSequenceClassification
bert_model_sc = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [None]:
import numpy as np

sentences = ["Google stocks went up suddently, I earned 30B$"]
tokenized_sentence = tokenizer(sentences, return_tensors="pt", padding="max_length", truncation=True, max_length=16)
pred = bert_model_sc(**tokenized_sentence)
classes = ["negative", "neutral", "positive"]
print (pred[0][0].detach().numpy(), np.argmax(pred[0][0].detach().numpy()), classes[np.argmax(pred[0][0].detach().numpy())])

# Fine-tuning a pre-trained model

Pre-training + Fine-tuning paradigm is the key of the success of the 🤗 Transformers library. [Model Hub](https://huggingface.co/models) contains plenty of pre-trained models that can be used as they are, or can be fine-tuned on new datasets.

[Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) allows user to easily fine-tune the selected model for the task at hand.

In [None]:
# Your own data
import pandas as pd

!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_test.csv

df_train = pd.read_csv("Corona_NLP_train.csv")
df_test  = pd.read_csv("Corona_NLP_test.csv")

df_train = df_train.dropna(how = 'any')
df_test  = df_test.dropna (how = 'any')

train_sentences = df_train["OriginalTweet"].tolist()
train_y = df_train["Sentiment"].tolist()

print(f"Train set: {len(train_sentences)}, {len(train_y)}")

eval_samples = int(0.05*len(train_sentences))


eval_sentences = train_sentences[:eval_samples]
eval_y = train_y[:eval_samples]

train_sentences = train_sentences[eval_samples:]
train_y = train_y[eval_samples:]

test_sentences = df_test["OriginalTweet"].tolist()
test_y = df_test["Sentiment"].tolist()

print(f"Train set: {len(train_sentences)}, {len(train_y)}")
print(f"Eval set: {len(eval_sentences)}, {len(eval_y)}")
print(f"Test set: {len(test_sentences)}, {len(test_y)}")

In [None]:
# Examples for Sequence Classification

# tokenizer and model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels = len(set(train_y)))

# Tokenization step
tokenized_train = tokenizer(train_sentences, padding="max_length", truncation=True, max_length=64)
tokenized_test  = tokenizer(test_sentences, padding="max_length", truncation=True, max_length=64)
tokenized_eval  = tokenizer(eval_sentences, padding="max_length", truncation=True, max_length=64)

# Label encoding step
from sklearn.preprocessing import LabelEncoder

def label_encoding(labels, le):
    # instantiate labelencoder object
    y = le.transform(labels)
    return y

all_labels = []
for label in set(train_y):
    all_labels.append(label)

le = LabelEncoder()
le.fit(all_labels)

train_y = label_encoding(train_y, le)
test_y = label_encoding(test_y, le)
eval_y = label_encoding(eval_y, le)

import torch
class SCDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_ds = SCDataset(tokenized_train, train_y)
eval_ds = SCDataset(tokenized_eval, eval_y)
test_ds = SCDataset(tokenized_test, test_y)

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Compute metrics function should return a dictionary with the metrics computed for the task (e.g., accuracy)

def compute_metrics(pred):
    predictions = np.argmax(pred.predictions, axis=-1)
    labels = pred.label_ids
    return {
        "acc": accuracy_score(labels, predictions),
        "f1_macro": f1_score(labels, predictions, average="macro"),
        "f1_weight": f1_score(labels, predictions, average="weighted")
    }

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=10,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    eval_steps=100,
    save_steps=100,
    load_best_model_at_end=True,
    report_to='none'
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,              # training dataset
    eval_dataset=eval_ds,                # evaluation dataset
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,              # training dataset
    eval_dataset=eval_ds,                # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
# Trainer APIs could also be used for testing the model
preds = trainer.predict(test_ds)
print(preds)

Additional information on how to fine-tune a pre-trained model both with Trainer API or with standard PyTorch/TensorFlow (Keras) could be found [here](https://huggingface.co/docs/transformers/training).

Some additional notebooks that can be useful for the project (if you want and can use HF):

- PyTorch + HF: https://github.com/huggingface/transformers/tree/master/notebooks#pytorch-examples
- TensorFlow + HF: https://github.com/huggingface/transformers/tree/master/notebooks#tensorflow-examples

# Datasets & Metrics

Hugging Face also provide separate packages for [datasets](https://huggingface.co/datasets) and [metrics](https://huggingface.co/metrics)

In [None]:
%%capture
! pip install datasets evaluate

In [None]:
# example on how to use a metric https://huggingface.co/docs/evaluate/choosing_a_metric

from evaluate import load

metric = load("accuracy")

In [None]:
y_pred = preds.predictions.argmax(-1)

metric.add_batch(predictions=y_pred, references=test_y)

print (metric.compute())

In [None]:
# using a dataset

from datasets import load_dataset

dataset = load_dataset("abisee/cnn_dailymail", '3.0.0')

In [None]:
print(dataset["train"][0])
print("Source text:", dataset["train"][0]["article"])
print("Target text:", dataset["train"][0]["highlights"])

In [None]:
max_input_length = 512
max_output_length = 64

def preprocess_function(examples):
    inputs = [s for s in examples["article"]]
    inputs = " ".join(inputs)
    targets = examples["highlights"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=max_output_length, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
dataset = load_dataset("abisee/cnn_dailymail", '3.0.0', split='train[:1000]')

dataset = dataset.map(preprocess_function)

In [None]:
print(dataset[0])
print("Source text:", dataset[0]["input_ids"])
print("Target text:", dataset[0]["labels"])

In [None]:
columns_to_return = ['input_ids', 'labels', 'attention_mask']
dataset.set_format(type='torch', columns=columns_to_return)

In [None]:
print(dataset[0])
print("Source text:", dataset[0]["input_ids"])
print("Target text:", dataset[0]["labels"])

### Let's have a look at the Hugging Face 🤗 website!

- [Models](https://huggingface.co/models)
- [Datasets](https://huggingface.co/models)
- [Spaces](https://huggingface.co/spaces)
- [Docs](https://huggingface.co/docs)