# Fine-Tuned BERT for Named-Entity Recognition (NER) using Transformers and PyTorch

In this notebook, the process of fine-tuning a pre-trained BERT-based model (DistilBERT) for the task of Named-Entity Recognition (NER) will be explored. NER is a fundamental task in Natural Language Processing (NLP) that involves identifying entities such as persons, locations, and organizations in text.

**Why Fine-Tune BERT for NER?**

Pre-trained models like BERT and DistilBERT are general-purpose language models trained on vast amounts of text data. However, for domain-specific tasks like NER, fine-tuning is essential to *adapt the model to the specific task* requirements and dataset. This enables the model to achieve higher accuracy by leveraging task-specific labeled data.

**How to Fine-Tune BERT?**

We will first utilize the **Trainer API** from the **Transformers library** to simplify the fine-tuning process. Next, we will implement a **custom training loop using PyTorch** for greater flexibility and control. Finally, we will evaluate the fine-tuned models and make predictions using the trained models.

**Steps:**

0) Install the Transformers, Datasets, and Evaluate libraries.

1) Load and understand the dataset for the NER task.

2) Preprocess the data to make it compatible with the model.

3) Fine-tune the model using the Trainer API of the Transformers library.

4) Fine-tune the model using a custom PyTorch implementation.

5) Make predictions using the pipeline API on the Trainer API fine-tuned model.

6) Make predictions using the PyTorch fine-tuned model.


The model is saved on my account of HuggingFace, you can import it and use it (you can see how it's done on the step number 5)

[Wencho/distilledber-finetuned-ner](https://huggingface.co/Wencho/distilledbert-finetuned-ner)

### 0) Install the Transformers, Datasets, and Evaluate libraries

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate -U

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting accelerate
  Downloading accelerate-1.2.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.2.1-py3-none-any.whl (336 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.4/336.4 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.34.2
    Uninstalling accelerate-0.34.2:
      Successfully uninstalled accelerate-0.34.2
Successfully installed accelerate-1.2.1


### 1) Load and understand the dataset for the NER task

In this section, we load the **CoNLL-2003 dataset** using the **datasets library** from **Hugging Face**. This dataset is a widely used benchmark for Named Entity Recognition (NER) tasks and contains labeled tokens **with four types of named entities**:
- LOC (locations)
- ORG (organizations)
- PER (persons)
- MISC (miscellaneous entities).


**The dataset is divided into three parts:**

- Train: Contains 14,041 examples for model training. This is the primary dataset used to fine-tune the model.
- Validation: Includes 3,250 examples used to evaluate the model during training and tune hyperparameters, ensuring it generalizes well.
- Test: Comprises 3,453 examples to assess the model's performance after training.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Let's take a first look at the dataset

As we can see, we have the tokens and the labels of each one of them for the following NLP tasks: POS, CHUNK, and NER.

**I'm going to use the NER section**


**IMPORTANT:**  The input texts **are not presented as sentences or documents**, but **lists of words (Pre-Tokenized dataset)**.

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [None]:
# Printing the first training example.
## For the first example we print the tokens and the NER tags that we'll use for the training part.

print(raw_datasets["train"][0])
print(raw_datasets["train"][0]["tokens"])
print(raw_datasets["train"][0]["ner_tags"])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


**0** means the word doesn’t correspond to any entity.

**B-PER/I-PER** means the word corresponds to the beginning of/is inside a person entity.

**B-ORG/I-ORG** means the word corresponds to the beginning of/is inside an organization entity.

**B-LOC/I-LOC** means the word corresponds to the beginning of/is inside a location entity.

**B-MISC/I-MISC** means the word corresponds to the beginning of/is inside a miscellaneous entity.

Each **ner_tag** has an associated **ner_feature.name**



**For example:** 3 refers to B-ORG, 7 refers to MISC

In [None]:
ner_feature = raw_datasets["train"].features["ner_tags"]
print(ner_feature)

label_names = ner_feature.feature.names
print(label_names)

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In this example below:
"EU rejects German call to boycott British lamb."

EU is B-ORG

German B-MISC

British B-MISC

and the rest is 0.

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


### 2) Preprocess the data to make it compatible with the model

Before the **DistilBERT model** can process the dataset, the **text data must be converted into token IDs**. This transformation allows the model to understand and process the input effectively. To accomplish this, we use a pretrained **DistilBERT tokenizer**, which will tokenize the input text and generate corresponding token IDs.

Since the **CoNLL-2003 dataset is pre-tokenized** (containing individual words rather than entire sentences), we utilize the tokenizer with the parameter **is_split_into_words=True**. This ensures that the tokenizer correctly processes the pre-tokenized input without treating each word as a separate sentence.

However, tokenization introduces **challenges** when aligning the labels with the tokens. For example:

1) The tokenizer may split a word into multiple subwords

2) Special tokens like [CLS] (at the beginning) and [SEP] (at the end) are added automatically.

This mismatch means that the *number of tokens may not match the number of labels*. To address this, we **align the labels with the tokenized input** while accounting for special tokens and subword splits. This alignment ensures that each token is correctly associated with its corresponding label, enabling the model to learn effectively during training.

#### A) Using AutoTokenizer with the model_checkpoint

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



 to check that the tokenizer object using is indeed backed by Tokenizers library.

In [None]:
tokenizer.is_fast

True

#### B) Example of tokenization and why we must align the labels

**KeyNotes on the tokenization:**

To tokenize a pre-tokenized input, we use the tokenizer adding: **is_split_into_words=True:**

As said before there are a few challenges that we must take into consideration after the tokenization:

**1)** The tokenizer added the **special tokens** used by the model BERT ([CLS] at the beginning and [SEP] at the end)

**2)** Some words were tokenized as **subwords** (For example unjustified as un ##just ##ified

This creates a **discrepancy** between our inputs and the labels: the label list contains only 9 elements, while the tokenized input now includes 12 tokens.

Handling the special tokens is straightforward since we know their fixed positions at the beginning and end of the input. However, *aligning the remaining labels with the corresponding words becomes more complex*, especially when a word is split into multiple subword tokens.

In [None]:
# Display the raw tokens from the dataset
print("Raw tokens from the dataset:")
print(raw_datasets["train"][10]["tokens"])

# Tokenize the input while keeping track of word boundaries
inputs = tokenizer(raw_datasets["train"][10]["tokens"], is_split_into_words=True)

# Display the tokens generated by the tokenizer
print("\nTokens generated by the tokenizer:")
print(inputs.tokens())

# Display the word IDs corresponding to each token
print("\nWord IDs corresponding to each token:")
print(inputs.word_ids())


Raw tokens from the dataset:
['Spanish', 'Farm', 'Minister', 'Loyola', 'de', 'Palacio', 'had', 'earlier', 'accused', 'Fischler', 'at', 'an', 'EU', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'unjustified', 'alarm', 'through', '"', 'dangerous', 'generalisation', '.', '"']

Tokens generated by the tokenizer:
['[CLS]', 'spanish', 'farm', 'minister', 'loyola', 'de', 'pal', '##acio', 'had', 'earlier', 'accused', 'fis', '##ch', '##ler', 'at', 'an', 'eu', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'un', '##just', '##ified', 'alarm', 'through', '"', 'dangerous', 'general', '##isation', '.', '"', '[SEP]']

Word IDs corresponding to each token:
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 19, 20, 21, 22, 23, 24, 24, 25, 26, None]


#### C) Aligning Labels with new Tokens

The **align_labels_with_tokens** function is used to align the labels from the dataset with the tokens generated by the tokenizer.

For each **word_id** (which indicates which word a token belongs to):
<br>
<br>

**A) New Word:** If the word_id changes (indicating the start of a new word), the corresponding label from the original list is used.
  
**B) Special Token:** If the word_id is None, it indicates a special token, and -100 is appended to new_labels.

--> With -100 we indicate to the loss function to not care about this tokens (so we don't affect the training)

**C) Subword of the Same Word:** If the word_id remains the same as the previous one, it means the token is a subword of the current word. In this case:
The original label is reused.

-->If the label represents the beginning of an entity (e.g., B-XXX), it is changed to the inside label (I-XXX) to maintain consistency across subwords.


In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
 ### As an Example we show the element number 10 of train

labels = raw_datasets["train"][10]["ner_tags"]
word_ids = inputs.word_ids()
print(inputs.tokens())
print(labels)
print(align_labels_with_tokens(labels, word_ids))

['[CLS]', 'spanish', 'farm', 'minister', 'loyola', 'de', 'pal', '##acio', 'had', 'earlier', 'accused', 'fis', '##ch', '##ler', 'at', 'an', 'eu', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'un', '##just', '##ified', 'alarm', 'through', '"', 'dangerous', 'general', '##isation', '.', '"', '[SEP]']
[7, 0, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 7, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


#### D) Tokenization of the whole dataset with map()

To preprocess our whole dataset, we need to **tokenize all the inputs** and apply **align_labels_with_tokens()** on all the labels.

Dataset.map() method with the option batched=True: To take advantage of the speed of the fast tokenizer, it’s best to tokenize lots of texts at the same time, processing as a list of examples.

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

### 3) Fine-tune the model using the Trainer API of the Transformers library

#### A) Data Collation

A **Data Collator** plays a crucial role during batch preparation by ensuring that **data samples are correctly formatted and padded to the same length**. This standardization allows efficient processing by the model.

The data collator dynamically **pads both inputs and labels** **within a batch** to **match the length of the longest sequence in that batch**. This approach avoids issues caused by sequences of varying lengths.

When padding, labels must be padded in the same way as inputs to maintain consistency and ensure they remain the same size. **To handle padded positions** in the labels, a special value, typically **-100, is used**. This ***value is ignored during the loss computation***, meaning it doesn’t affect the model’s learning process.


**NOTE:** This data_collator will be used on the pytorch training aswell

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
# EXAMPLE

## To test this on a few samples, we can just call it on a list of 2 examples from our tokenized training set:

## We can see that the second example of the tensor is padded with -100 to match the size of the first one.

batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100]])

#### B) Metrics

To have the Trainer from the Transformers Library compute a metric every epoch, a **compute_metrics()** function that takes the arrays of predictions and labels must be defined, it returns a dictionary with the metric names and values.

The traditional framework used to evaluate token classification prediction is **seqeval**.

In [None]:
!pip install seqeval



In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

This **compute_metrics()** function first takes the **argmax of the logits** to **convert them to predictions**.

Then it convert both labels and predictions from **integers to strings**.

All the values where the label is -100 are removed, then pass the results to the **metric.compute()**.

In [None]:
import numpy as np


def compute_metrics_trainer(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

#### C) Model

Since we are working on a **token classification problem**, we will use the **AutoModelForTokenClassification** class
- It will automatically detect which model to use based on the checkpoint which is: **"distilbert-base-uncased"*+

**id2label and label2id,** which contain the mappings from ID to label and vice versa. (So the model knows the amount of labels)

In [None]:
id2label_trainer = {i: label for i, label in enumerate(label_names)}
label2id_trainer = {v: k for k, v in id2label_trainer.items()}

In [None]:
from transformers import AutoModelForTokenClassification

model_trainer = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label_trainer,
    label2id=label2id_trainer,
)

print(model_trainer.config.num_labels)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


9


Some hyperparameters:
- Learning rate
- The number of epochs to train for
- Weight decay


In [None]:
## Login due to pushing this model to my HG account
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "distilledbert-finetuned-ner", ### Where it will be saved
    report_to="none",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True, #To save it on the hub of HG
)



**Instantiation of the Trainer.**

We pass the
- model
- the arguments
- the training and evaluation dataset
- Data Collator
- Metrics
- Tokenizer.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model_trainer,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics_trainer,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0769,0.062221,0.907273,0.923763,0.915444,0.98235
2,0.0359,0.059273,0.920528,0.939583,0.929958,0.984908
3,0.023,0.061334,0.927076,0.939246,0.933122,0.985114


TrainOutput(global_step=5268, training_loss=0.06606481395712746, metrics={'train_runtime': 200.3955, 'train_samples_per_second': 210.199, 'train_steps_per_second': 26.288, 'total_flos': 445994355589020.0, 'train_loss': 0.06606481395712746, 'epoch': 3.0})

In [None]:
trainer.push_to_hub(commit_message="Training complete")

CommitInfo(commit_url='https://huggingface.co/Wencho/distilledbert-finetuned-ner/commit/8c0ef74985e8658adc4ebb213d7179675add180e', commit_message='Training complete', commit_description='', oid='8c0ef74985e8658adc4ebb213d7179675add180e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Wencho/distilledbert-finetuned-ner', endpoint='https://huggingface.co', repo_type='model', repo_id='Wencho/distilledbert-finetuned-ner'), pr_revision=None, pr_num=None)

#### D) Evaluation

In [None]:
trainer.evaluate()

{'eval_loss': 0.061334069818258286,
 'eval_precision': 0.9270764119601329,
 'eval_recall': 0.9392460451026591,
 'eval_f1': 0.9331215515800034,
 'eval_accuracy': 0.9851144613722655,
 'eval_runtime': 4.9002,
 'eval_samples_per_second': 663.239,
 'eval_steps_per_second': 83.058,
 'epoch': 3.0}

### 4) Fine-tune the model using a custom PyTorch implementation

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

We define two **DataLoader** objects: one for the training data and one for the evaluation data. The DataLoader is used to efficiently load data in batches during training and evaluation.

**train_dataloader:** This is used to load the training data. We set the *shuffle=True* parameter to randomly shuffle the data for each epoch, which helps prevent the model from memorizing the order of the data. The collate_fn=data_collator ensures that the input data is appropriately padded and batched before being passed to the model.
We also define the batch size as 8.

**eval_dataloader:** This is used for the evaluation dataset. It uses the same collate_fn=data_collator for padding, but without shuffling, since we want to evaluate the model on the same validation set.



In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
from transformers import AutoModelForTokenClassification

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here, we define the optimizer used for training: the **AdamW** optimizer from **PyTorch**. The AdamW optimizer is commonly used in NLP tasks and provides adaptive learning rates for each parameter.

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

**Function Metrics to evualuate the mode** (Using seqeval)

In [None]:
!pip install seqeval

import numpy as np
import evaluate

metric = evaluate.load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

  pid, fd = os.forkpty()


Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=c14e8167d6536623f82d3297f5041bb1f526ca11d521c6f69263f1d4806f863b
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

#### Explanation of the Training and Evaluation Loop

In this cell, we **train** and **evaluate** the model over multiple epochs. The key steps are as follows:

**1) Device Setup:** The model and data batches are moved to the appropriate device (cuda or cpu) for computation.

**2) Training Loop:** For each epoch, the model is set to training mode, *processes the batches*, computes the *loss*, and *updates the weights* using backpropagation.

**3) Evaluation Loop:** The model is set to evaluation mode, and *predictions are made on the validation* dataset *without updating the weights*. Special token predictions are removed to align with the true labels.

**Progress Tracking:** A progress bar (tqdm) visually tracks the training progress across all steps.

The **results from all epochs** are stored for the evaluation of the whole model (next cell)

In [None]:
import torch
from tqdm.auto import tqdm

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# For tracking metrics
all_predictions = []
all_labels = []

num_train_epochs = 3  # Define number of epochs
num_training_steps = num_train_epochs * len(train_dataloader)
progress_bar = tqdm(range(num_training_steps))

# Training and evaluation loop
for epoch in range(num_train_epochs):
    print(f"Epoch {epoch + 1}/{num_train_epochs}")

    # Training
    model.train()
    for batch in train_dataloader:
        # Move batch to the appropriate device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        # Move batch to the appropriate device
        batch = {k: v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(**batch)

        # Collect predictions and labels
        predictions = outputs.logits.argmax(dim=-1).detach().cpu().numpy()
        labels = batch["labels"].detach().cpu().numpy()

        # Remove special token predictions
        true_predictions = [
            [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]
        true_labels = [
            [label_names[l] for l in label if l != -100] for label in labels
        ]

        all_predictions.extend(true_predictions)
        all_labels.extend(true_labels)

  0%|          | 0/5268 [00:00<?, ?it/s]

Epoch 1/3
Epoch 2/3
Epoch 3/3


#### Metrics of the Pytorch model

In [None]:
# Compute metrics at the end of training
final_metrics = metric.compute(predictions=all_predictions, references=all_labels)
print("Final Evaluation Metrics:")
print({
    "precision": final_metrics["overall_precision"],
    "recall": final_metrics["overall_recall"],
    "f1": final_metrics["overall_f1"],
    "accuracy": final_metrics["overall_accuracy"],
})

Final Evaluation Metrics:
{'precision': 0.9128012795764161, 'recall': 0.9284191630203075, 'f1': 0.9205439830909142, 'accuracy': 0.9831816183985469}


### 5) Make predictions using the pipeline API on the Trainer API fine-tuned model

Below we import the saved model "distilledbert-finetuned-ner" (the one trained with the Trainer API) in order to make predictions with it.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("token-classification", model="Wencho/distilledbert-finetuned-ner")

config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


**Let's make the first prediction**, we will send the text "Messi plays footbal for Argentina". The representation of the result is not the best so below there is a function so see it in a cleaner way

In [None]:
result = pipe("Messi plays footbal for Argentina")
print(result)

[{'entity': 'B-PER', 'score': 0.998197, 'index': 1, 'word': 'mess', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.9980804, 'index': 2, 'word': '##i', 'start': 4, 'end': 5}, {'entity': 'B-LOC', 'score': 0.99566746, 'index': 7, 'word': 'argentina', 'start': 24, 'end': 33}]


In [None]:
## Better representation of the results

def process_ner_result(result):
    """
    Process the result from a token classification pipeline and group entities by type.

    Args:
        result (list): List of token-level predictions from a NER pipeline.

    Returns:
        list: List of grouped entities with their labels and confidence scores.
    """
    entities = []
    current_entity = None

    for token in result:
        entity_type = token["entity"]
        word = token["word"]
        score = token["score"]

        if entity_type.startswith("B-") or current_entity is None:
            # Start a new entity
            if current_entity:
                entities.append(current_entity)
            current_entity = {
                "entity": entity_type[2:],  # Remove "B-" or "I-"
                "word": word,
                "score": score
            }
        else:
            # Continue the current entity
            current_entity["word"] += word.replace("##", "")
            current_entity["score"] = min(current_entity["score"], score)

    # Append the last entity
    if current_entity:
        entities.append(current_entity)

    return entities


def print_ner_entities(entities):
    """
    Print the detected entities in a human-readable format.

    Args:
        entities (list): List of grouped entities with their labels and confidence scores.
    """
    print("Detected Entities:")
    for entity in entities:
        print(f"  - {entity['word']} ({entity['entity']}, confidence: {entity['score']:.2f})")

Now that we created a function to have a better visualization of the output of the model, **let's test it with another example**

In [None]:
result = pipe("Ronaldo plays football for Portugal, Messi for Argentina, and Mbappe for Real Madrid")  # Example input from pipeline

# Process the NER result
entities = process_ner_result(result)

# Print the detected entities
print_ner_entities(entities)

Detected Entities:
  - ronaldo (PER, confidence: 1.00)
  - portugal (LOC, confidence: 0.99)
  - messi (PER, confidence: 0.99)
  - argentina (LOC, confidence: 0.99)
  - mbappe (PER, confidence: 1.00)
  - realmadrid (ORG, confidence: 1.00)


### 6) Make predictions using the PyTorch fine-tuned model.

In this cell, we use the fine-tuned DistilBERT model for Named Entity Recognition with Pytorch to make predictions on a given text. The steps to predict are different to the one trained with Trainer API since we didn't save this model and also because we don't use the abstractions of the Trainer API.

**Input Tokenization:** The input text is tokenized using the tokenizer associated with the model. Padding and truncation are applied as needed.

**Model Inference:** The tokenized input is passed through the fine-tuned NER model to obtain logits (raw predictions).

**Prediction Decoding:** Logits are converted into class predictions by selecting the label with the highest score for each token.

**Mapping Tokens to Labels:** The predicted labels are mapped to their corresponding tokens. Special tokens are excluded, and SentencePiece tokens are processed for readability.

**Result Display:** The final token-label pairs are displayed, showing the entities identified in the input text.

In [None]:
# Example text
text = "Ronaldo plays football for Portugal, Messi for Argentina, and Mbappe for Real Madrid"

# Tokenize the input text
encoding = tokenizer(text, return_tensors="pt", truncation=True, padding=True, is_split_into_words=False)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

# Pass the tokenized input through the model
model.eval()
with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Convert logits to predictions
predictions = torch.argmax(logits, dim=-1).squeeze().tolist()

# Decode predictions and map them to the original text
tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze().tolist())
labels = [id2label[pred] for pred in predictions]

# Post-process to combine tokens and predictions
result = []
for token, label in zip(tokens, labels):
    if token.startswith("▁"):  # For SentencePiece-based tokenizers
        token = token[1:]
    if token not in tokenizer.all_special_tokens:  # Exclude special tokens ([CLS], [SEP], etc.)
        result.append((token, label))

# Display the NER results
for token, label in result:
    print(f"{token}: {label}")

ronald: B-PER
##o: I-PER
plays: O
football: O
for: O
portugal: B-LOC
,: O
mess: B-PER
##i: I-PER
for: O
argentina: B-LOC
,: O
and: O
mba: B-PER
##ppe: I-PER
for: O
real: B-ORG
madrid: I-ORG


As we did before for the other model, we create some functions to have a better representation of the results.

In [None]:
### Better representation

def process_ner_result(tokens, labels, scores=None):
    """
    Process tokens and labels into grouped entities by type.

    Args:
        tokens (list): List of tokens.
        labels (list): List of predicted labels for each token.
        scores (list, optional): List of confidence scores for each token (default: None).

    Returns:
        list: List of grouped entities with their labels and optional confidence scores.
    """
    entities = []
    current_entity = None

    for idx, (token, label) in enumerate(zip(tokens, labels)):
        if label.startswith("B-"):
            # Start a new entity
            if current_entity:
                entities.append(current_entity)
            current_entity = {
                "entity": label[2:],  # Remove "B-" prefix
                "word": token.replace("##", ""),  # Clean token
                "score": scores[idx] if scores else None  # Initialize with score if provided
            }
        elif label.startswith("I-") and current_entity and current_entity["entity"] == label[2:]:
            # Continue the current entity
            current_entity["word"] += token.replace("##", "")
            if scores:
                current_entity["score"] = min(current_entity["score"], scores[idx])
        else:
            # Outside of any entity or a mismatch
            if current_entity:
                entities.append(current_entity)
                current_entity = None

    # Append the last entity
    if current_entity:
        entities.append(current_entity)

    return entities

def print_ner_entities(entities):
    """
    Print the detected entities in a human-readable format.

    Args:
        entities (list): List of grouped entities with their labels and confidence scores.
    """
    print("Detected Entities:")
    for entity in entities:
        if entity["score"] is not None:
            print(f"  - {entity['word']} ({entity['entity']}, confidence: {entity['score']:.2f})")
        else:
            print(f"  - {entity['word']} ({entity['entity']})")


entities = process_ner_result(tokens, labels)
print_ner_entities(entities)

Detected Entities:
  - ronaldo (PER)
  - portugal (LOC)
  - messi (PER)
  - argentina (LOC)
  - mbappe (PER)
  - realmadrid (ORG)
