<a href="https://colab.research.google.com/github/chineidu/NLP-Tutorial/blob/main/notebook/06_Transformers/06a_tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training A New Tokenizer From An Old Tokenizer

- Check [this](https://huggingface.co/learn/nlp-course/chapter6/2?fw=pt) for info on how to finetune a pretrained tokenizer.

In [1]:
!pip install rich
!pip install transformers[torch]
!pip install torch datasets evaluate

Collecting transformers[torch]
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch])
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers[torch])
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m65.0 MB/s

In [2]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
# %load_ext lab_black

# auto reload imports
# %load_ext autoreload
# %autoreload 2

<hr><br>

## Batch Encoding Using Fast Tokenizers

In [3]:
from transformers import AutoTokenizer


CHECKPOINT: str = "bert-base-cased"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding: dict["str", Any] = tokenizer(example)

print(type(encoding))

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [4]:
print(encoding)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

In [5]:
# Get the index of the word each token comes from.
# The special tokens [CLS] and [SEP] are represented as None.
print(encoding.word_ids())

In [6]:
# Try another tokenizer!
CHECKPOINT: str = "roberta-base"
tokenizer_2: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example_2: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding_2: dict["str", Any] = tokenizer_2(example_2)

print(encoding_2)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding_2.tokens())

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
print(example)
print(encoding.word_ids())

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

```text
- We can map any word or token to characters in the original text, and vice versa,
* via the:
  - word_to_chars()
  - or token_to_chars() and char_to_word()
  - or char_to_token() methods.
  
- The word_ids() method told us that ##ei is part of the word at index 3, but which word is it in the sentence? We can find out like this:
```

In [8]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Chineidu'

<hr><br>

## [Text Classification Pipeline](https://huggingface.co/learn/nlp-course/chapter6/3?fw=pt)

```text
- Using a token classification pipeline, we can get some results to compare manually.
- The model used by default is dbmdz/bert-large-cased-finetuned-conll03-english and it performs NER on sentences.
```

In [9]:
from transformers import pipeline


TASK: str = "token-classification"  # Named Entity Recognition (NER)
token_classifier: pipeline = pipeline(task=TASK)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."

token_classifier(example)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-PER',
  'score': 0.99802446,
  'index': 4,
  'word': 'Chin',
  'start': 11,
  'end': 15},
 {'entity': 'I-PER',
  'score': 0.96976656,
  'index': 5,
  'word': '##ei',
  'start': 15,
  'end': 17},
 {'entity': 'I-PER',
  'score': 0.99290186,
  'index': 6,
  'word': '##du',
  'start': 17,
  'end': 19},
 {'entity': 'I-ORG',
  'score': 0.99207014,
  'index': 11,
  'word': 'Hu',
  'start': 34,
  'end': 36},
 {'entity': 'I-ORG',
  'score': 0.99378514,
  'index': 12,
  'word': '##gging',
  'start': 36,
  'end': 41},
 {'entity': 'I-ORG',
  'score': 0.9924396,
  'index': 13,
  'word': 'Face',
  'start': 42,
  'end': 46},
 {'entity': 'I-LOC',
  'score': 0.9217939,
  'index': 15,
  'word': 'Brooklyn',
  'start': 50,
  'end': 58}]

<br>

#### Comment

```text
- The model properly identified each token generated by `Chineidu` as a person, each token generated by “Hugging Face” as an organization, and the token “Brooklyn” as a location. We can also ask the pipeline to group together the tokens that correspond to the same entity:
```

In [10]:
from transformers import pipeline


TASK: str = "token-classification"  # Named Entity Recognition (NER)

# With "simple" the score is just the mean of the scores of each token in the
# given entity: e.g., the score of “Chineidu” is the mean of the scores
# we saw in the previous example for the tokens Chin, ##ei, and ##du
token_classifier: pipeline = pipeline(task=TASK, aggregation_strategy="simple")
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."

token_classifier(example)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.98689765,
  'word': 'Chineidu',
  'start': 11,
  'end': 19},
 {'entity_group': 'ORG',
  'score': 0.99276495,
  'word': 'Hugging Face',
  'start': 34,
  'end': 46},
 {'entity_group': 'LOC',
  'score': 0.9217939,
  'word': 'Brooklyn',
  'start': 50,
  'end': 58}]

#### Other Strategies:

```text
- "first", where the score of each entity is the score of the first token of that entity (so for “Chineidu” it would be 0.99802446, the score of the token Chin)

- "max", where the score of each entity is the maximum score of the tokens in that entity (so for “Hugging Face” it would be 0.98879766, the score of “Face”)

- "average", where the score of each entity is the average of the scores of the words composing that entity (so for “Chineidu” there would be no difference from the "simple" strategy, but “Hugging Face” would have a score of 0.9819, the average of the scores for “Hugging”, 0.975, and “Face”, 0.98879)
```

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

In [11]:
from transformers import AutoTokenizer, AutoModelForTokenClassification


model_checkpoint: str = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model: AutoModelForTokenClassification = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
inputs: dict[str, Any] = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [14]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

#### Comment:

```text
- The output is a batch with 1 sequence of 18 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 18 x 9.

- Like for the text classification pipeline, a softmax function is used to convert those logits to probabilities, and the argmax is calculated to get predictions (note that we can take the argmax on the logits because the softmax does not change the order)
```

In [17]:
import torch.nn.functional as F


probabilities: list[float] = F.softmax(outputs.logits, dim=-1)[0].tolist()
predictions: list[int] = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

In [19]:
# The model.config.id2label attribute contains the mapping of indexes to labels
# that we can use to make sense of the predictions:
print(model.config.id2label)

In [23]:
print(probabilities)

#### Note:

```text
- There are 9 labels:
  - O is the label for the tokens that are not in any named entity (it stands for “outside”), and we then have two labels for each type of entity (miscellaneous, person, organization, and location).
  - The label B-XXX indicates the token is at the beginning of an entity XXX and the label I-XXX indicates the token is inside the entity XXX. For instance, in the current example we would expect our model to classify the token `Chin` as B-PER (beginning of a person entity) and the tokens ##ei, and ##du as I-PER (inside a person entity).
```

In [None]:
# With this map, we are ready to reproduce (almost entirely) the results of the first pipeline
# we can just grab the score and label of each token that was not classified as O:
results: list[str] = []
tokens: list[str] = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O": # label for tokens that's `outside`
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

In [22]:
model.config.id2label[0]

'O'