# DataCollator
The `DataCollator` is a core component of the Hugging Face `transformers` library, specifically designed for managing the batching process during model training and inference. It ensures that inputs of varying lengths (like text sequences) are properly padded, tokenized, and collated into a batch that can be processed by a model. Understanding its details will give you more control over how data is processed and fed into transformers.


### 1. **What is a Data Collator?**
A `DataCollator` in `transformers` is a function or class responsible for combining several samples (like tokenized text) into a batch during training or inference. It typically handles tasks such as:
- Padding sequences of varying lengths.
- Creating attention masks.
- Handling special tokens (e.g., `[CLS]`, `[SEP]`, etc.).
- Formatting input for models.
  
It operates at the level between raw tokenized data and the input to the model. The transformers library comes with pre-built collators, but you can also create custom ones.

### 2. **Pre-Built Data Collators in the Transformers Library**

The library provides several types of `DataCollator` classes:

#### 2.1. `DataCollatorWithPadding`
This collator pads each sequence in a batch to the length of the longest sequence, making sure all the input tensors are the same size. It is particularly useful when working with models that expect fixed-length inputs (like BERT).

- **Key Parameters**:
  - `tokenizer`: The tokenizer used to convert text to tokens.
  - `padding`: Defines the padding strategy, such as `True`, `'max_length'`, `'longest'`, etc.
  - `max_length`: If specified, it will ensure that sequences longer than this are truncated.
  - `pad_to_multiple_of`: If set, it will pad sequences to a multiple of this value, which is useful for optimized GPU performance.

- **Example Usage**:
  ```python
  from transformers import DataCollatorWithPadding
  data_collator = DataCollatorWithPadding(tokenizer)
  ```

- **Behind the Scenes**:
  When a batch of tokenized sequences is passed, it uses the tokenizer’s `pad()` method to ensure that all sequences are padded to the correct length. It also generates attention masks for padded tokens.

#### 2.2. `DataCollatorForLanguageModeling`
This collator is designed for language modeling tasks (like BERT's masked language modeling). It randomly masks tokens in the input sequence and creates labels for those masked positions.

- **Key Parameters**:
  - `tokenizer`: Tokenizer used to encode the data.
  - `mlm`: Whether to use masked language modeling (default is `True`).
  - `mlm_probability`: Probability of masking a token in the sequence.

- **Example Usage**:
  ```python
  from transformers import DataCollatorForLanguageModeling
  data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=True, mlm_probability=0.15)
  ```

- **Behind the Scenes**:
  It first tokenizes the input, then randomly selects tokens to mask based on `mlm_probability`. Masked tokens are replaced with the `[MASK]` token, random words, or left unchanged with a 10%, 10%, 80% probability split, respectively.

#### 2.3. `DataCollatorForSeq2Seq`
This collator is specifically designed for sequence-to-sequence models (like BART, T5). It ensures the input and output sequences are padded separately.

- **Key Parameters**:
  - `tokenizer`: Tokenizer used to encode the input and output data.
  - `model`: The sequence-to-sequence model (important for special token handling).
  - `padding`: Strategy to pad input and target sequences.
  - `max_length`: Maximum length for the input sequences.
  - `max_target_length`: Maximum length for the target sequences.

- **Example Usage**:
  ```python
  from transformers import DataCollatorForSeq2Seq
  data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
  ```

- **Behind the Scenes**:
  When you provide a batch of tokenized source and target sequences, it pads them independently. It ensures that the model has appropriate inputs and attention masks, which is important for models like T5 and BART that operate on both encoder and decoder input/output.

#### 2.4. `DataCollatorForTokenClassification`
This collator is used in token classification tasks (e.g., Named Entity Recognition). It ensures that both the input sequences and the labels (tags for each token) are padded correctly.

- **Key Parameters**:
  - `tokenizer`: The tokenizer to handle input sequences.
  - `padding`: Strategy to pad inputs.
  - `label_pad_token_id`: ID for padded label tokens (to ensure consistency across batch sizes).

- **Example Usage**:
  ```python
  from transformers import DataCollatorForTokenClassification
  data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, label_pad_token_id=-100)
  ```

- **Behind the Scenes**:
  It ensures that token labels (tags) are padded along with the tokenized input sequences. This is important when dealing with models that predict labels for each token in the input (e.g., BERT for token classification).

### 3. **Creating Custom Data Collators**
While Hugging Face provides several built-in data collators, there are cases when you may need to create a custom one. Custom collators allow you to control how data is batched and processed before it reaches the model.

- **Custom Collator Example**:
  ```python
  class MyCustomCollator:
      def __init__(self, tokenizer, max_length=None):
          self.tokenizer = tokenizer
          self.max_length = max_length
      
      def __call__(self, batch):
          # Tokenizing and padding input sequences in the batch
          inputs = [item['input_ids'] for item in batch]
          inputs_padded = self.tokenizer.pad(
              {"input_ids": inputs},
              padding=True,
              max_length=self.max_length,
              return_tensors="pt"
          )
          
          # Tokenizing and padding labels (if applicable)
          labels = [item['labels'] for item in batch]
          labels_padded = self.tokenizer.pad(
              {"input_ids": labels},
              padding=True,
              max_length=self.max_length,
              return_tensors="pt"
          )
          
          return {
              "input_ids": inputs_padded['input_ids'],
              "labels": labels_padded['input_ids']
          }
  ```

  In this example, we define a collator that handles both input sequences and labels by tokenizing them and then padding them to the desired `max_length`. It also uses PyTorch tensors (`return_tensors="pt"`), making it compatible with models that use PyTorch as a backend.

### 4. **Collating with PyTorch's DataLoader**

The `DataCollator` is typically used in conjunction with PyTorch's `DataLoader`. Here's how it fits into a training loop:

```python
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

# Define your dataset (usually a Dataset object)
train_dataset = ...

# Initialize the collator and data loader
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=32, collate_fn=data_collator)

# Training loop
for batch in train_dataloader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    # Pass these to your model and compute loss, etc.
```

### 5. **Special Cases and Advanced Collation**

- **Dynamic Padding**: Data collators like `DataCollatorWithPadding` pad sequences dynamically within a batch, which is useful when sequences vary in length. It reduces wasted computation by avoiding padding to a fixed maximum length.
- **Handling Multiple Tokenizers**: Some models may use different tokenizers for input and output sequences (e.g., a source-target setup in translation). In such cases, the `DataCollatorForSeq2Seq` manages both encoder and decoder tokenization and padding separately.

# Model Evaluation Metrics: Perplexity, BLEU, and ROUGE

When evaluating models, particularly in Natural Language Processing (NLP), we often use a variety of metrics to assess the quality of model outputs. Below, I'll explain **perplexity**, **BLEU**, and **ROUGE**, their use cases, and how to implement them using Python and NumPy. I will also show how to use the `evaluate` library to compute these metrics, and how you can create custom evaluation metrics.

### 1. **Perplexity**

#### Explanation
**Perplexity** is commonly used to evaluate language models. It measures how well a probability distribution or probability model predicts a sample. A lower perplexity indicates the model is better at predicting the next word in a sequence. Mathematically, it is the exponentiated cross-entropy loss.

For a language model predicting the next word \( w_t \) in a sequence, perplexity is defined as:

$$
\text{Perplexity}(P) = \exp\left( - \frac{1}{N} \sum_{t=1}^{N} \log P(w_t | w_1, \dots, w_{t-1}) \right)
$$

Where:
- $ N $ is the number of words in the sequence.
- $ P(w_t | w_1, \dots, w_{t-1}) $ is the probability assigned by the model to the true next word.

#### Utilization with the `evaluate` library
The `evaluate` library provides support for metrics like perplexity.

```python
import evaluate

# Load perplexity metric
perplexity_metric = evaluate.load("perplexity")

# Simulate predictions and references (for perplexity, it works on probabilities)
predictions = [0.2, 0.3, 0.5, 0.7, 0.6]
references = [1, 0, 1, 1, 1]

# Compute perplexity
results = perplexity_metric.compute(predictions=predictions, references=references)
print("Perplexity:", results)
```

#### Implementation of Perplexity using NumPy

```python
import numpy as np

def perplexity(probs):
    N = len(probs)
    cross_entropy = -np.sum(np.log(probs)) / N
    return np.exp(cross_entropy)

# Example: probabilities assigned by a language model
probs = np.array([0.2, 0.3, 0.5, 0.7, 0.6])

# Calculate perplexity
pp = perplexity(probs)
print(f"Perplexity: {pp}")
```

### 2. **BLEU (Bilingual Evaluation Understudy Score)**

#### Explanation
**BLEU** is a precision-based metric for evaluating machine translation and text generation models. It compares the n-grams (sequences of words) in the predicted sentence to those in the reference sentence(s). BLEU ranges from 0 to 1, where higher values indicate better translations.

The formula for BLEU is:

$$
\text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log p_n \right)
$$

Where:
- $ BP $ is the brevity penalty to penalize short translations.
- $ w_n $ are the weights (usually equal) for each n-gram.
- $ p_n $ is the precision for n-grams of size $ n $.

#### Utilization with the `evaluate` library

```python
import evaluate

# Load BLEU metric
bleu_metric = evaluate.load("bleu")

# Example: predictions and references
predictions = [["this", "is", "a", "test"]]
references = [[["this", "is", "a", "test"]]]

# Compute BLEU score
results = bleu_metric.compute(predictions=predictions, references=references)
print("BLEU Score:", results)
```

#### Implementation of BLEU using NumPy

```python
from collections import Counter
import numpy as np

def n_gram_precision(reference, candidate, n):
    ref_ngrams = Counter([tuple(reference[i:i+n]) for i in range(len(reference)-n+1)])
    cand_ngrams = Counter([tuple(candidate[i:i+n]) for i in range(len(candidate)-n+1)])
    
    match_count = sum((cand_ngrams & ref_ngrams).values())
    total_count = sum(cand_ngrams.values())
    
    return match_count / total_count if total_count > 0 else 0

def brevity_penalty(candidate, reference):
    c = len(candidate)
    r = len(reference)
    return np.exp(1 - r / c) if c < r else 1

def bleu_score(reference, candidate, n_gram_weights=[0.25, 0.25, 0.25, 0.25]):
    bp = brevity_penalty(candidate, reference)
    p_n = [n_gram_precision(reference, candidate, n) for n in range(1, 5)]
    
    score = bp * np.exp(np.sum([w * np.log(p) for w, p in zip(n_gram_weights, p_n) if p > 0]))
    return score

# Example usage
reference = ["this", "is", "a", "test"]
candidate = ["this", "is", "a", "test"]

bleu = bleu_score(reference, candidate)
print(f"BLEU Score: {bleu}")
```

### 3. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**

#### Explanation
**ROUGE** measures the overlap between the n-grams in the generated text and the reference text. Unlike BLEU, ROUGE focuses on **recall** rather than precision. There are different versions of ROUGE, but the most common are:
- **ROUGE-N**: Measures n-gram overlap.
- **ROUGE-L**: Measures the longest common subsequence (LCS).
- **ROUGE-S**: Measures skip-bigrams (word pairs that occur in the same order, but not necessarily consecutively).

#### Utilization with the `evaluate` library

```python
import evaluate

# Load ROUGE metric
rouge_metric = evaluate.load("rouge")

# Example: predictions and references
predictions = ["this is a test"]
references = ["this is the test"]

# Compute ROUGE score
results = rouge_metric.compute(predictions=predictions, references=references)
print("ROUGE Score:", results)
```

#### Implementation of ROUGE-N using NumPy

```python
from collections import Counter

def n_gram_overlap(reference, candidate, n):
    ref_ngrams = Counter([tuple(reference[i:i+n]) for i in range(len(reference)-n+1)])
    cand_ngrams = Counter([tuple(candidate[i:i+n]) for i in range(len(candidate)-n+1)])
    
    match_count = sum((cand_ngrams & ref_ngrams).values())
    total_ref_count = sum(ref_ngrams.values())
    
    return match_count / total_ref_count if total_ref_count > 0 else 0

# Example usage
reference = ["this", "is", "the", "test"]
candidate = ["this", "is", "a", "test"]

rouge_1 = n_gram_overlap(reference, candidate, 1)
print(f"ROUGE-1 Score: {rouge_1}")
```

### Creating Custom Evaluation Metrics with `evaluate`

You can also create custom evaluation metrics with the `evaluate` library by defining a function that implements your metric, then passing it into `evaluate`.

#### Custom Metric Example

```python
import evaluate

# Define a custom metric function
def custom_accuracy(predictions, references):
    correct = sum(p == r for p, r in zip(predictions, references))
    return {"accuracy": correct / len(predictions)}

# Create custom evaluation metric
custom_metric = evaluate.Metric.from_function(custom_accuracy)

# Example: predictions and references
predictions = [1, 0, 1, 1, 0]
references = [1, 0, 0, 1, 0]

# Compute custom accuracy
results = custom_metric.compute(predictions=predictions, references=references)
print("Custom Accuracy:", results)
```

### Summary

- **Perplexity** is used for evaluating language models and measures uncertainty.
- **BLEU** is used for evaluating machine translation or text generation based on n-gram precision.
- **ROUGE** is more recall-oriented and is useful for summarization tasks.

The `evaluate` library makes it easy to compute these metrics and customize your own.

# Fine-tuning a model with the Trainer API

### Step 1: Preparing the Data

First things first, we need a dataset suitable for token classification. In this section we will use the [CoNLL-2003](https://huggingface.co/datasets/conll2003) dataset, which contains news stories from Reuters.

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("eriktks/conll2003")

Inspecting this object shows us the columns present and the split between the training, validation, and test sets:

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In particular, we can see the dataset contains labels for the three tasks we mentioned earlier: NER, POS, and chunking. A big difference from other datasets is that the input texts are not presented as sentences or documents, but lists of words (the last column is called tokens, but it contains words in the sense that these are pre-tokenized inputs that still need to go through the tokenizer for subword tokenization).



Let’s have a look at the first element of the training set:

In [None]:
raw_datasets["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

Since we want to perform named entity recognition, we will look at the NER tags:

In [None]:
raw_datasets["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

Those are the labels as integers ready for training, but they’re not necessarily useful when we want to inspect the data. Like for text classification, we can access the correspondence between those integers and the label names by looking at the features attribute of our dataset:

In [None]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

So this column contains elements that are sequences of ClassLabels. The type of the elements of the sequence is in the feature attribute of this ner_feature, and we can access the list of names by looking at the names attribute of that feature:



In [None]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

We already saw these labels when digging into the token-classification pipeline in Chapter 6, but for a quick refresher:

- O means the word doesn’t correspond to any entity.
- B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
- B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
- B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

Now decoding the labels we saw earlier gives us this:

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


As usual, our texts need to be converted to token IDs before the model can make sense of them.

To begin, let’s create our `tokenizer` object. As we said before, we will be using a BERT pretrained model, so we’ll start by downloading and caching the associated tokenizer:



In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



To tokenize a pre-tokenized input, we can use our `tokenizer` as usual and just add `is_split_into_words=True`:



In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

As we can see, the tokenizer added the special tokens used by the model (`[CLS]` at the beginning and `[SEP]` at the end) and left most of the words untouched. The word lamb, however, was tokenized into two subwords, la and ##mb. This introduces a mismatch between our inputs and the labels: the list of labels has only 9 elements, whereas our input now has 12 tokens. Accounting for the special tokens is easy (we know they are at the beginning and the end), but we also need to make sure we align all the labels with the proper words.

In [None]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

With a tiny bit of work, we can then expand our label list to match the tokens. The first rule we’ll apply is that special tokens get a label of -100. This is because by default -100 is an index that is ignored in the loss function we will use (cross entropy). Then, each token gets the same label as the token that started the word it’s inside, since they are part of the same entity. For tokens inside a word but not at the beginning, we replace the B- with I- (since the token does not begin the entity):

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

Let’s try it out on our first sentence:



In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


As we can see, our function added the -100 for the two special tokens at the beginning and the end, and a new 0 for our word that was split into two tokens.



To preprocess our whole dataset, we need to tokenize all the inputs and apply align_labels_with_tokens() on all the labels. To take advantage of the speed of our fast tokenizer, it’s best to tokenize lots of texts at the same time, so we’ll write a function that processes a list of examples and use the Dataset.map() method with the option batched=True. The only thing that is different from our previous example is that the word_ids() function needs to get the index of the example we want the word IDs of when the inputs to the tokenizer are lists of texts (or in our case, list of lists of words), so we add that too:



In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

Note that we haven’t padded our inputs yet; we’ll do that later, when creating the batches with a data collator.

We can now apply all that preprocessing in one go on the other splits of our dataset:

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

In [None]:
tokenized_datasets["train"][0:2]

{'input_ids': [[101,
   7270,
   22961,
   1528,
   1840,
   1106,
   21423,
   1418,
   2495,
   12913,
   119,
   102],
  [101, 1943, 14428, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]],
 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100], [-100, 1, 2, -100]]}

We can’t just use a DataCollatorWithPadding like in Chapter 3 because that only pads the inputs (input IDs, attention mask, and token type IDs). Here our labels should be padded the exact same way as the inputs so that they stay the same size, using -100 as a value so that the corresponding predictions are ignored in the loss computation.

This is all done by a DataCollatorForTokenClassification. Like the DataCollatorWithPadding, it takes the tokenizer used to preprocess the inputs:



In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

To test this on a few samples, we can just call it on a list of examples from our tokenized training set:

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

Let’s compare this to the labels for the first and second elements in our dataset:

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


As we can see, the second set of labels has been padded to the length of the first one using -100s.

### Step 2: Defining Metrics

To have the Trainer compute a metric every epoch, we will need to define a `compute_metrics()` function that takes the arrays of predictions and labels, and returns a dictionary with the metric names and values.

The traditional framework used to evaluate token classification prediction is `seqeval`. To use this metric, we first need to install the `seqeval` library:

In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=688350a32158fb3befae55f15becc3f47774885e3135116d7e463fbfab1bb3f4
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


We can then load it via the `evaluate.load()` function:

In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. Let’s see how it works. First, we’ll get the labels for our first training example:

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

We can then create fake predictions for those by just changing the value at index 2:



In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

Note that the metric takes a list of predictions (not just one) and a list of labels.


This is sending back a lot of information! We get the precision, recall, and F1 score for each separate entity, as well as overall. For our metric computation we will only keep the overall score, but feel free to tweak the compute_metrics() function to return all the metrics you would like reported.

This `compute_metrics()` function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the `metric.compute()` method:

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Now that this is done, we are almost ready to define our `Trainer`. We just need a `model` to fine-tune!



### Step 3: Defining the model


Since we are working on a token classification problem, we will use the `AutoModelForTokenClassification` class. The main thing to remember when defining this model is to pass along some information on the number of labels we have. The easiest way to do this is to pass that number with the `num_labels` argument, but if we want a nice inference widget working like the one we saw at the beginning of this section, it’s better to set the correct label correspondences instead.



They should be set by two dictionaries, `id2label` and `label2id`, which contain the mappings from ID to label and vice versa:



In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

Now we can just pass them to the `AutoModelForTokenClassification.from_pretrained()` method, and they will be set in the model’s configuration and then properly saved and uploaded to the Hub:



In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let’s double-check that our model has the right number of labels:

In [None]:
model.config.num_labels

9

### Step 4: Fine-tuning the model

We are now ready to train our model! We just need to do two last things before we define our `Trainer`: log in to Hugging Face and define our training arguments. If you’re working in a notebook, there’s a convenience function to help you with this.



This will display a widget where you can enter your Hugging Face login credentials.



In [None]:
from huggingface_hub import notebook_login

notebook_login()
# huggingface-cli login

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Once this is done, we can define our TrainingArguments:



In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "finetuned-geeks",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)



You’ve seen most of those before: we set some hyperparameters (like the learning rate, the number of epochs to train for, and the weight decay), and we specify `push_to_hub=True` to indicate that we want to save the model and evaluate it at the end of every epoch, and that we want to upload our results to the Model Hub.

Note that you can specify the name of the repository you want to push to with the `hub_model_id` argument (in particular, you will have to use this argument to push to an organization). By default, the repository used will be in your namespace and named after the output directory you set.

Finally, we just pass everything to the `Trainer` and launch the training:



In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.072,0.064491,0.905599,0.936385,0.920735,0.982619
2,0.0328,0.06708,0.928796,0.946146,0.937391,0.985195
3,0.0215,0.061545,0.933443,0.951195,0.942236,0.986446


TrainOutput(global_step=5268, training_loss=0.05452226017493594, metrics={'train_runtime': 687.525, 'train_samples_per_second': 61.268, 'train_steps_per_second': 7.662, 'total_flos': 920771584279074.0, 'train_loss': 0.05452226017493594, 'epoch': 3.0})

Note that while the training happens, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you will be able to to resume your training on another machine if necessary.

Once the training is complete, we use the `push_to_hub()` method to make sure we upload the most recent version of the model:


In [None]:
trainer.push_to_hub(commit_message="Training complete")

CommitInfo(commit_url='https://huggingface.co/sampurnr/finetuned-geeks/commit/0be48b648361ad591766a3a00af0e1b1f571dcb9', commit_message='Training complete', commit_description='', oid='0be48b648361ad591766a3a00af0e1b1f571dcb9', pr_url=None, pr_revision=None, pr_num=None)

This command returns the URL of the commit it just did, if you want to inspect it.

The Trainer also drafts a model card with all the evaluation results and uploads it. At this stage, you can use the inference widget on the Model Hub to test your model and share it with your friends. You have successfully fine-tuned a model on a token classification task — congratulations!

### Step 5: Using the fine-tuned model


In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "sampurnr/finetuned-geeks"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("My name is Sampurn and I work at GFG in India.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity_group': 'PER',
  'score': 0.9981095,
  'word': 'Sampurn',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.99874216,
  'word': 'GFG',
  'start': 33,
  'end': 36},
 {'entity_group': 'LOC',
  'score': 0.9994679,
  'word': 'India',
  'start': 40,
  'end': 45}]