# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## 6. The 🤗 Tokenizers library

#### Assembling a corpus


> The repository for code_search_net contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/code_search_net.
> You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
>
> Do you wish to run the custom code? [y/N]


In [1]:
from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python", trust_remote_code=True)
raw_datasets

README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

code_search_net.py:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

python.zip:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 412178
    })
    test: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 22176
    })
    validation: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 23107
    })
})

In [2]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

In [3]:
print(raw_datasets["train"][123456]["whole_func_string"])

def _compress_for_distribute(max_vol, plan, **kwargs):
    """
    Combines as many dispenses as can fit within the maximum volume
    """
    source = None
    new_source = None
    a_vol = 0
    temp_dispenses = []
    new_transfer_plan = []
    disposal_vol = kwargs.get('disposal_vol', 0)
    max_vol = max_vol - disposal_vol

    def _append_dispenses():
        nonlocal a_vol, temp_dispenses, new_transfer_plan, source
        if not temp_dispenses:
            return
        added_volume = 0
        if len(temp_dispenses) > 1:
            added_volume = disposal_vol
        new_transfer_plan.append({
            'aspirate': {
                'location': source,
                'volume': a_vol + added_volume
            }
        })
        for d in temp_dispenses:
            new_transfer_plan.append({
                'dispense': {
                    'location': d['location'],
                    'volume': d['volume']
                }
            })
        a_vol = 0
        temp

> Using a Python generator, we can avoid Python loading anything into memory
> until it’s actually necessary. To create such a generator, 
> <span style="background-color:#33ffff;">you just to need to replace the brackets with parentheses</span>

In [4]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


In [5]:
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )


training_corpus = get_training_corpus()

> You can also define your generator inside a for loop by using the yield statement...
> which will produce the exact same generator as before,
> <span style="background-color:#33ffff">but allows you to use more complex logic than you can in a list comprehension</span>

#### Training a new tokenizer

In [6]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
type(old_tokenizer)

transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast

In [8]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


In [9]:
old_tokenizer.vocab_size

50257

##### Please read

> Note that AutoTokenizer.train_new_from_iterator() only works if the tokenizer you are using is a “fast” tokenizer. 

* [API documentation for `Tokenizer.train_new_from_iterator`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.train_new_from_iterator)
* StackOverflow question from alvas: [How to add new tokens to an existing Huggingface AutoTokenizer?](https://stackoverflow.com/questions/76198051/how-to-add-new-tokens-to-an-existing-huggingface-tokenizer)

In [10]:
%%time
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)




CPU times: user 6min 8s, sys: 10.7 s, total: 6min 19s
Wall time: 1min 42s


In [11]:
tokens = tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


In [12]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


In [13]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

print(tokenizer.tokenize(example))

['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


In [14]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

In [15]:
!ls -la code-search-net-tokenizer

total 4864
drwxr-xr-x 2 so_olliphant so_olliphant    4096 Feb 18 05:13 .
drwxr-xr-x 8 so_olliphant so_olliphant    4096 Feb 18 05:13 ..
-rw-r--r-- 1 so_olliphant so_olliphant  466894 Feb 18 05:13 merges.txt
-rw-r--r-- 1 so_olliphant so_olliphant      99 Feb 18 05:13 special_tokens_map.json
-rw-r--r-- 1 so_olliphant so_olliphant 3673415 Feb 18 05:13 tokenizer.json
-rw-r--r-- 1 so_olliphant so_olliphant     471 Feb 18 05:13 tokenizer_config.json
-rw-r--r-- 1 so_olliphant so_olliphant  822037 Feb 18 05:13 vocab.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


,,,

...

In [16]:
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer
tokenizer = AutoTokenizer.from_pretrained("code-search-net-tokenizer")

#### Fast tokenizers’ special powers

> <span style="background-color:#33FFFF"><b>Slow</b> tokenizers</span> are those written in Python inside the 🤗 Transformers library<p/>
> <span style="background-color:#AFFF33">the <b>fast</b> versions</span> are the ones provided by 🤗 Tokenizers, which are written in Rust

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(f"tokenizer.is_fast ? {tokenizer.is_fast}")

tokenizer.is_fast ? True


In [18]:
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)

print(f"example: {example}")
print(type(encoding))
print()

print(f"encoding.is_fast ? {encoding.is_fast}\n")
print(f"tokens:\n{encoding.tokens()}\n")
print(f"word IDs:\n{encoding.word_ids()}")

example: My name is Sylvain and I work at Hugging Face in Brooklyn.
<class 'transformers.tokenization_utils_base.BatchEncoding'>

encoding.is_fast ? True

tokens:
['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']

word IDs:
[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]


...

In [19]:
tokenizer_roberta = AutoTokenizer.from_pretrained("roberta-base")

check_this = "81s"

encoding_bert = tokenizer(check_this)
encoding_roberta = tokenizer_roberta(check_this)

print("bert-base-cased")
print(f"tokens: {encoding_bert.tokens()}")
print(f"word IDs: {encoding_bert.word_ids()}")

print()

print("roberta-based")
print(f"tokens: {encoding_roberta.tokens()}")
print(f"word IDs: {encoding_roberta.word_ids()}")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

bert-base-cased
tokens: ['[CLS]', '81', '##s', '[SEP]']
word IDs: [None, 0, 0, None]

roberta-based
tokens: ['<s>', '81', 's', '</s>']
word IDs: [None, 0, 1, None]


...

In [20]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Sylvain'

...

##### There is no `sentence_ids` method on the return encodings from `tokenizer(input)`!

In [21]:
sentence_1 = "I am the son and heir of a shyness that is criminally vulgar. I am the son and the heir of nothing in particular. Really, I am."
sentence_2 = "If it's not love, then it's the Bomb that'll bring us together."

encoding_sentence_1 = tokenizer(sentence_1)
encoding_sentence_2 = tokenizer(sentence_2)

In [22]:
print(dir(encoding_sentence_1))

['_MutableMapping__marker', '__abstractmethods__', '__class__', '__class_getitem__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_encodings', '_n_sequences', 'char_to_token', 'char_to_word', 'clear', 'convert_to_tensors', 'copy', 'data', 'encodings', 'fromkeys', 'get', 'is_fast', 'items', 'keys', 'n_sequences', 'pop', 'popitem', 'sequence_ids', 'setdefault', 'to', 'token_to_chars', 'token_to_sequence', 'token_to_word', 'tokens', 'update', 'values', 'word_ids', 'word_to_chars', 'word_to_toke

...

#### Inside the token-classification pipeline

> First, let’s grab a token classification pipeline so we can get some results to compare manually. The model used by default is [`dbmdz/bert-large-cased-finetuned-conll03-english`](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english); it performs NER on sentences


In [23]:
from transformers import pipeline

token_classifier = pipeline("token-classification")

for ntt in token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn."):
    print(ntt)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cuda:0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'entity': 'I-PER', 'score': 0.99938285, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}
{'entity': 'I-PER', 'score': 0.99815494, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}
{'entity': 'I-PER', 'score': 0.9959072, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}
{'entity': 'I-PER', 'score': 0.99923277, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}
{'entity': 'I-ORG', 'score': 0.9738931, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35}
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40}
{'entity': 'I-ORG', 'score': 0.9887976, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9932106, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}


In [24]:
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
#                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

for ntt in token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn."):
    print(ntt)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}


##### From inputs to predictions

... let's try doing the same w/out using `pipeline`

In [25]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

print(model)
print()
print(model.config)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024

In [26]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
print(f"tokenizer.is_fast ? {tokenizer.is_fast}")

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")

outputs = model(**inputs)

tokenizer.is_fast ? True


In [27]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


##### NOTE

* 1st `dim` is batch (index)
* 2nd `dim` is sequence (length)
* 3rd `dim` is logits (labels)

We use Torch's [`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html#torch-nn-functional-softmax) to convert the logits to probabilities, and then `argmax` to get the final NER prediction.

In [28]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()

In [29]:
predictions_labels = [
    model.config.id2label[p]
    for p in predictions
]

In [30]:
for tok, pred, label in zip(inputs.tokens(), predictions, predictions_labels):
    print(tok, pred, label)

[CLS] 0 O
My 0 O
name 0 O
is 0 O
S 4 I-PER
##yl 4 I-PER
##va 4 I-PER
##in 4 I-PER
and 0 O
I 0 O
work 0 O
at 0 O
Hu 6 I-ORG
##gging 6 I-ORG
Face 6 I-ORG
in 0 O
Brooklyn 8 I-LOC
. 0 O
[SEP] 0 O


In [31]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

results

[{'entity': 'I-PER', 'score': 0.9993828535079956, 'word': 'S'},
 {'entity': 'I-PER', 'score': 0.9981548190116882, 'word': '##yl'},
 {'entity': 'I-PER', 'score': 0.995907187461853, 'word': '##va'},
 {'entity': 'I-PER', 'score': 0.9992327690124512, 'word': '##in'},
 {'entity': 'I-ORG', 'score': 0.9738931059837341, 'word': 'Hu'},
 {'entity': 'I-ORG', 'score': 0.9761149883270264, 'word': '##gging'},
 {'entity': 'I-ORG', 'score': 0.9887974858283997, 'word': 'Face'},
 {'entity': 'I-LOC', 'score': 0.99321049451828, 'word': 'Brooklyn'}]

In [32]:
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
#                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^

inputs_with_offsets["offset_mapping"]

[(0, 0),
 (0, 2),
 (3, 7),
 (8, 10),
 (11, 12),
 (12, 14),
 (14, 16),
 (16, 18),
 (19, 22),
 (23, 24),
 (25, 29),
 (30, 32),
 (33, 35),
 (35, 40),
 (41, 45),
 (46, 48),
 (49, 57),
 (57, 58),
 (0, 0)]

In [33]:
example[12:14]

'yl'

In [34]:
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

results

[{'entity': 'I-PER',
  'score': 0.9993828535079956,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.9981548190116882,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.995907187461853,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.9992327690124512,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.9738931059837341,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.9761149883270264,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9887974858283997,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.99321049451828,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [35]:
example[33:45]

'Hugging Face'

In [36]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

results

[{'entity_group': 'PER',
  'score': 0.998169407248497,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796018600463867,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.99321049451828,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

#### Fast tokenizers in the QA pipeline

##### QA using `pipeline`

In [37]:
from transformers import pipeline

question = "Which deep learning libraries back 🤗 Transformers?"

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question_answerer = pipeline("question-answering")
question_answerer(question=question, context=context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.98026043176651,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

In [38]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
#                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# ... see now how the answer is almost at the very end of this
#     long, long context ...?

question_answerer(question=question, context=long_context)

{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

##### Using a model for question answering

... where we do things the hard way.

> The checkpoint used by default for the question-answering pipeline is [`distilbert-base-cased-distilled-squad`](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad)

> Models for question answering work a little differently from the models we’ve seen up to now. Using the picture above as an example, the model has been trained to predict the index of the token starting the answer (here 21) and the index of the token where the answer ends (here 24). This is why those models don’t return one tensor of logits but two: one for the logits corresponding to the start token of the answer, and one for the logits corresponding to the end token of the answer.

In [39]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

##### For QA, we have logits on both the start of span with the answer; and the end of the span with the answer.

... or we need to use the `[CLS]` token for indicating an impossible answer.

In [40]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print(f"num. of tokens? {len(inputs.tokens())}")
print(start_logits.shape, end_logits.shape)

num. of tokens? 67
torch.Size([1, 67]) torch.Size([1, 67])


> To convert those logits into probabilities, we will apply a softmax function — but before that, we need to make sure we mask the indices that are not part of the context. Our input is `[CLS]` question [SEP] context `[SEP]`, so we need to mask the tokens of the question as well as the `[SEP]` token. We’ll keep the `[CLS]` token, however, as some models use it to indicate that the answer is not in the context.

##### To clarify

* We want to calculate the probabilities for the `start` and `end` tokens using <u>only the context</u> and not the question.
* We do that by setting the probabilities on the tokens in the question, as well as the `[SEP]` BERT tokens to 0
* Some models use `[CLS]` for indicating an impossible answer (answer could not be found in the context), so we need to allow that through for calculating probabilities.
* So, we need to create a mask of `1`s (or `True` on the positions for the tokens in the question and those `[SEP]` tokens
* We can then set the logits on those tokens to some large, negative numbers, since $Softmax(x_{i}) = \frac{exp(x_{i})}{\sum_j{exp(x_{j})}}$ and the exponent of a large, negative number $x_{i}$ yields 0.

We will use the `sequence_ids` values for building up a mask which we will use for setting the probabilities for the tokens in the question (and BERT special tokens) to 0. 

`sequence_ids` values are:
* `None` for `[CLS]`, `[SEP]` special BERT tokens
* `0` for the question
* `1` for the context

In [41]:
import torch

sequence_ids = inputs.sequence_ids()
#print(inputs.sequence_ids())
#print()

list(zip(inputs.tokens(), sequence_ids))

[('[CLS]', None),
 ('Which', 0),
 ('deep', 0),
 ('learning', 0),
 ('libraries', 0),
 ('back', 0),
 ('[UNK]', 0),
 ('Transformers', 0),
 ('?', 0),
 ('[SEP]', None),
 ('[UNK]', 1),
 ('Transformers', 1),
 ('is', 1),
 ('backed', 1),
 ('by', 1),
 ('the', 1),
 ('three', 1),
 ('most', 1),
 ('popular', 1),
 ('deep', 1),
 ('learning', 1),
 ('libraries', 1),
 ('—', 1),
 ('Jax', 1),
 (',', 1),
 ('P', 1),
 ('##y', 1),
 ('##T', 1),
 ('##or', 1),
 ('##ch', 1),
 (',', 1),
 ('and', 1),
 ('Ten', 1),
 ('##sor', 1),
 ('##F', 1),
 ('##low', 1),
 ('—', 1),
 ('with', 1),
 ('a', 1),
 ('sea', 1),
 ('##m', 1),
 ('##less', 1),
 ('integration', 1),
 ('between', 1),
 ('them', 1),
 ('.', 1),
 ('It', 1),
 ("'", 1),
 ('s', 1),
 ('straightforward', 1),
 ('to', 1),
 ('train', 1),
 ('your', 1),
 ('models', 1),
 ('with', 1),
 ('one', 1),
 ('before', 1),
 ('loading', 1),
 ('them', 1),
 ('for', 1),
 ('in', 1),
 ('##ference', 1),
 ('with', 1),
 ('the', 1),
 ('other', 1),
 ('.', 1),
 ('[SEP]', None)]

In [42]:
# Mask indicating the positions of the question tokens and [SEP]
mask = [i != 1 for i in sequence_ids]
#print(mask)

# Unmask the [CLS] token
mask[0] = False
#print(mask)
#print()

# N.B., the mask needs to be the same shape
#       as the tensor of output logits
#mask = torch.tensor(mask)[None]
mask = torch.tensor(mask).unsqueeze(dim=0)
#mask = torch.tensor(mask)
print(mask.shape)

torch.Size([1, 67])


In [43]:
start_logits[mask] = -10000
end_logits[mask] = -10000

In [44]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

##### Create a matrix of scores

`start_probabilities[:, None] * end_probabilities[None, :]`, or `unsqueeze`ing the tensors will create a $n \times n$ tensor (matrix) of scores.

In [45]:
scores = start_probabilities[:, None] * end_probabilities[None, :]
scores.shape

torch.Size([67, 67])

[API documentation for `torch.triu`](https://pytorch.org/docs/stable/generated/torch.triu.html#torch-triu)

In [46]:
scores = torch.triu(scores)
scores

tensor([[9.4340e-13, 0.0000e+00, 0.0000e+00,  ..., 1.1023e-12, 1.6345e-12,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 9.1136e-14, 1.3514e-13,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 1.2744e-13,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]], grad_fn=<TriuBackward0>)

> Now we just have to get the index of the maximum. Since PyTorch will return the index in the flattened tensor, we need to use the floor division // and modulus % operations to get the `start_index` and `end_index`

##### Clarification: `//` and `%` operations to get start and end???

That explanation above is somewhat lacking...

From ['s response to Get indices of the max of a 2D Tensor question at discuss.pytorch.org](https://discuss.pytorch.org/t/get-indices-of-the-max-of-a-2d-tensor/82150/5):

> Since `argmax()` gives you the index in a flattened tensor, you can infer the position in your 2D tensor from size of the last dimension.
> E.g. if `argmax()` returns 10 and you’ve got 4 columns, you know it’s on row 2, column 2.
> You can use Python's (built-in function) [`divmod`](https://docs.python.org/3/library/functions.html#divmod) for this

In [47]:
max_index = scores.argmax().item()
print(type(max_index))

start_index = max_index // scores.shape[1]
end_index   = max_index %  scores.shape[1]
#start_index, end_index = divmod(max_index, scores.shape[1])

print(f"start_index: {start_index}")
print(f"end_index: {end_index}")

print(scores[start_index, end_index])

<class 'int'>
start_index: 23
end_index: 35
tensor(0.9803, grad_fn=<SelectBackward0>)


##### Slight detour for extra credit

> ✏️ Try it out! Compute the start and end indices for the five most likely answers.

Please see API documentation for [`torch.topk`](https://pytorch.org/docs/stable/generated/torch.Tensor.topk.html)

In [48]:
values, indices = torch.topk(scores.flatten(), 5)
start_and_end = [
    divmod(max_index, scores.shape[1])
    for max_index in indices.tolist()
]
start_and_end

[(23, 35), (23, 36), (16, 35), (23, 29), (25, 35)]

...

In [49]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
print(inputs_with_offsets)

{'input_ids': [101, 5979, 1996, 3776, 9818, 1171, 100, 25267, 136, 102, 100, 25267, 1110, 5534, 1118, 1103, 1210, 1211, 1927, 1996, 3776, 9818, 783, 13612, 117, 153, 1183, 1942, 1766, 1732, 117, 1105, 5157, 21484, 2271, 6737, 783, 1114, 170, 2343, 1306, 2008, 9111, 1206, 1172, 119, 1135, 112, 188, 21546, 1106, 2669, 1240, 3584, 1114, 1141, 1196, 10745, 1172, 1111, 1107, 16792, 1114, 1103, 1168, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 5), (6, 10), (11, 19), (20, 29), (30, 34), (35, 36), (37, 49), (49, 50), (0, 0), (1, 2), (3, 15), (16, 18), (19, 25), (26, 28), (29, 32), (33, 38), (39, 43), (44, 51), (52, 56), (57, 65), (66, 75), (76, 77), (78, 81), (81, 82), (83, 84), (84, 85), (85, 86), (86, 88), (88, 90), (90, 91), (92, 95), (96, 99), (99, 102), (102, 103), (103, 10

In [50]:
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

In [51]:
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
result

{'answer': 'Jax, PyTorch, and TensorFlow',
 'start': 78,
 'end': 106,
 'score': tensor(0.9803, grad_fn=<SelectBackward0>)}

...

In [52]:
question_answerer(question=question, context=context, top_k=5)

[{'score': 0.98026043176651,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247777819633484,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.001367696444503963,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.00038108485750854015,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'},
 {'score': 0.00021684444800484926,
  'start': 96,
  'end': 106,
  'answer': 'TensorFlow'}]

#### Handling long contexts

In [53]:
inputs = tokenizer(question, long_context)

print(len(inputs["input_ids"]))

461


In [54]:
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")

In [55]:
print(tokenizer.decode(inputs["input_ids"]))

[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP [UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting - edge NLP easier to use for everyone. [UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine - tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Why should I use transformers? 1. Easy - to - use state - of - the - art models : - High performance on NLU and NLG tasks. - Low barrier to entry for educators and practitioners. - Few user - facing abstractions with just three classes to learn. - A unified A

In [56]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


In [57]:
print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])


In [58]:
print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]


In [59]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
    "If it's not Love, then it's the Bomb that'll bring us together!"
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]


...

In [60]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,                        # stride wrt the model
    max_length=384,                    # max_seq_len wrt the model
    padding="longest",                 # so we can treat this as a batch to build tensors
    truncation="only_second",          # only truncate the second input, the long_context
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

In [61]:
long_context

"\n🤗 Transformers: State of the Art NLP\n\n🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,\nquestion answering, summarization, translation, text generation and more in over 100 languages.\nIts aim is to make cutting-edge NLP easier to use for everyone.\n\n🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and\nthen share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and\ncan be modified to enable quick research experiments.\n\nWhy should I use transformers?\n\n1. Easy-to-use state-of-the-art models:\n  - High performance on NLU and NLG tasks.\n  - Low barrier to entry for educators and practitioners.\n  - Few user-facing abstractions with just three classes to learn.\n  - A unified API for using all our pretrained models.\n  - Lower compute costs, 

In [62]:
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])


In [63]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)        # since our long_context was split into 2, we have 2 starts and 2 ends

torch.Size([2, 384]) torch.Size([2, 384])


In [64]:
sequence_ids = inputs.sequence_ids()

# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]

# Unmask the [CLS] token
mask[0] = False

# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

In [65]:
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

In [66]:
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    #start_idx = idx // scores.shape[1]
    #end_idx = idx % scores.shape[1]
    start_idx, end_idx = divmod(idx, scores.shape[1])
    
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33866992592811584), (173, 184, 0.9714868664741516)]


In [67]:
print(question)
print()

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

Which deep learning libraries back 🤗 Transformers?

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33866992592811584}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9714868664741516}


...

##### ✏️ Try it out! (extra-curricular work)

> Adapt the code above to return the scores and spans for the five most likely answers (in total, not per chunk).

> Use the best scores you computed before to show the five most likely answers
> (for the whole context, not each chunk).
> To check your results, go back to the first pipeline and pass in `top_k=5` when calling it.

##### Naive implementation for `top_k=5` for QA

In [68]:
results = []

for i, (start_probs, end_probs) in enumerate(list(zip(start_probabilities, end_probabilities))):
    
    scores = start_probs[:, None] * end_probs[None, :]
    
    values, indices = torch.topk(scores.flatten(), 5)
    
    start_and_end = [
        divmod(max_index, scores.shape[1])
        for max_index in indices.tolist()
    ]
    
    for (start_token, end_token), score in zip(start_and_end, values):
        if start_token != 0 and end_token != 0:
            #print(i)
            start_char, _ = offset[start_token]
            _, end_char = offset[end_token]
            results.append({
                'score': score.item(),
                'start': start_char,
                'end': end_char,
                'answer': long_context[start_char:end_char]              
            })

results = sorted(results, key=lambda r: r['score'], reverse=True)[:5]
results

[{'score': 0.9714868664741516,
  'start': 1892,
  'end': 1919,
  'answer': 'Jax, PyTorch and TensorFlow'},
 {'score': 0.14949451386928558,
  'start': 1153,
  'end': 1182,
  'answer': 'of architectures with over 10'},
 {'score': 0.015565164387226105,
  'start': 1892,
  'end': 1921,
  'answer': 'Jax, PyTorch and TensorFlow —'},
 {'score': 0.013705423101782799,
  'start': 1175,
  'end': 1182,
  'answer': 'over 10'},
 {'score': 0.007162676192820072,
  'start': 1847,
  'end': 1919,
  'answer': 'three most popular deep learning libraries — Jax, PyTorch and TensorFlow'}]

##### Specifying `top_k=5` in `pipeline("question-answering")`

In [69]:
question_answerer(question=question, context=long_context, top_k=5)

[{'score': 0.9714871048927307,
  'start': 1892,
  'end': 1919,
  'answer': 'Jax, PyTorch and TensorFlow'},
 {'score': 0.1494959443807602,
  'start': 17,
  'end': 37,
  'answer': 'State of the Art NLP'},
 {'score': 0.015565137378871441,
  'start': 1892,
  'end': 1921,
  'answer': 'Jax, PyTorch and TensorFlow —'},
 {'score': 0.013705498538911343, 'start': 34, 'end': 37, 'answer': 'NLP'},
 {'score': 0.010596820153295994,
  'start': 3,
  'end': 37,
  'answer': 'Transformers: State of the Art NLP'}]

...

#### Normalization and pre-tokenization

> Before splitting a text into subtokens (according to its model), the tokenizer performs two steps: _normalization_ and _pre-tokenization._

##### Normalization

> ... general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents

* Fast tokenizers provide access to the underlying tokenization implementation logic via the `backend_tokenzier` attribute
* `backend_tokenizer.normalizer.normalize_str` method lets us see how an input string is normalized

In [70]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))
print()
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

<class 'tokenizers.Tokenizer'>

hello how are u?


> ✏️ Try it out! Load a tokenizer from the `bert-base-cased` checkpoint and pass the same example to it.<br/> What are the main differences you can see between the cased and uncased versions of the tokenizer (normalization)?

In [71]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(type(tokenizer.backend_tokenizer))
print()
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

<class 'tokenizers.Tokenizer'>

Héllò hôw are ü?


##### Unicode normalization

Some tokenizer schemes also do Unicode normalization. There are four Normalization Forms:

1. Normalization Form D (NFD): canonical decomposition
2. Normalization Form C (NFC): canonical decomposition, followed by canonical composition
3. Normalization Form KD (NFKD): compatibility decomposition
4. Normalization Form KC (NFKC): compatibility decomposition, followed by canonical composition

Please see [Normalization Forms](https://www.unicode.org/reports/tr15/#Norm_Forms) at unicode.org

If this is important to your task, then **you better be sure to choose the right tokenizer**.

In [72]:
print('\u00E7')

ç


In [73]:
print('\u0063', '\u0327')
print()
print('\u0063\u0327')

c ̧

ç


...

##### Pre-tokenization

Performed after the normalization step, pre-tokenization applies rules for an initial division of the input text. These rules do not need to be learned.

* A Fast tokenizer's `backend_tokenizer.pre_tokenizer.pre_tokenize_str` method lets us see how an input string is initially divided into tokens

In [74]:
text = "3.2.1: let's get started!"

In [75]:
# gpt2
AutoTokenizer.from_pretrained("gpt2").backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)

[('3', (0, 1)),
 ('.', (1, 2)),
 ('2', (2, 3)),
 ('.', (3, 4)),
 ('1', (4, 5)),
 (':', (5, 6)),
 ('Ġlet', (6, 10)),
 ("'s", (10, 12)),
 ('Ġget', (12, 16)),
 ('Ġstarted', (16, 24)),
 ('!', (24, 25))]

In [76]:
# albert-base-v1
AutoTokenizer.from_pretrained("albert-base-v1").backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

[('▁3.2.1:', (0, 6)),
 ("▁let's", (7, 12)),
 ('▁get', (13, 16)),
 ('▁started!', (17, 25))]

In [77]:
# bert-base-uncased
AutoTokenizer.from_pretrained("bert-base-uncased").backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)

[('3', (0, 1)),
 ('.', (1, 2)),
 ('2', (2, 3)),
 ('.', (3, 4)),
 ('1', (4, 5)),
 (':', (5, 6)),
 ('let', (7, 10)),
 ("'", (10, 11)),
 ('s', (11, 12)),
 ('get', (13, 16)),
 ('started', (17, 24)),
 ('!', (24, 25))]

...

##### SentencePiece

In [78]:
AutoTokenizer.from_pretrained("albert-base-v1").backend_tokenizer.pre_tokenizer.pre_tokenize_str('日本語だよ、これが。')

[('▁日本語だよ、これが。', (0, 10))]

#### Byte-Pair Encoding tokenization

##### Implementing BPE

... a naive approach...

In [79]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [80]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [81]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})


In [82]:
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

[',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


In [83]:
vocab = ["<|endoftext|>"] + alphabet.copy()

print(vocab)

['<|endoftext|>', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


In [84]:
splits = {word: [c for c in word] for word in word_freqs.keys()}

print(splits)

{'This': ['T', 'h', 'i', 's'], 'Ġis': ['Ġ', 'i', 's'], 'Ġthe': ['Ġ', 't', 'h', 'e'], 'ĠHugging': ['Ġ', 'H', 'u', 'g', 'g', 'i', 'n', 'g'], 'ĠFace': ['Ġ', 'F', 'a', 'c', 'e'], 'ĠCourse': ['Ġ', 'C', 'o', 'u', 'r', 's', 'e'], '.': ['.'], 'Ġchapter': ['Ġ', 'c', 'h', 'a', 'p', 't', 'e', 'r'], 'Ġabout': ['Ġ', 'a', 'b', 'o', 'u', 't'], 'Ġtokenization': ['Ġ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n'], 'Ġsection': ['Ġ', 's', 'e', 'c', 't', 'i', 'o', 'n'], 'Ġshows': ['Ġ', 's', 'h', 'o', 'w', 's'], 'Ġseveral': ['Ġ', 's', 'e', 'v', 'e', 'r', 'a', 'l'], 'Ġtokenizer': ['Ġ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r'], 'Ġalgorithms': ['Ġ', 'a', 'l', 'g', 'o', 'r', 'i', 't', 'h', 'm', 's'], 'Hopefully': ['H', 'o', 'p', 'e', 'f', 'u', 'l', 'l', 'y'], ',': [','], 'Ġyou': ['Ġ', 'y', 'o', 'u'], 'Ġwill': ['Ġ', 'w', 'i', 'l', 'l'], 'Ġbe': ['Ġ', 'b', 'e'], 'Ġable': ['Ġ', 'a', 'b', 'l', 'e'], 'Ġto': ['Ġ', 't', 'o'], 'Ġunderstand': ['Ġ', 'u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd'], 'Ġh

In [85]:
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair_freqs[pair] += freq
    return pair_freqs

In [86]:
pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3


In [87]:
best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

('Ġ', 't') 7


In [88]:
sorted_pair_freqs = sorted(pair_freqs.items(), key=lambda kv: kv[1])
sorted_pair_freqs[-5:][::-1]

[(('Ġ', 't'), 7),
 (('Ġ', 'a'), 5),
 (('e', 'r'), 5),
 (('i', 's'), 5),
 (('e', 'n'), 4)]

In [89]:
merges = {("Ġ", "t"): "Ġt"}

vocab.append("Ġt")
print(vocab)

['<|endoftext|>', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt']


In [90]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

In [91]:
splits = merge_pair("Ġ", "t", splits)

print(splits["Ġtrained"])

['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']


In [92]:
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

In [93]:
print(vocab)

['<|endoftext|>', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt', 'is', 'er', 'Ġa', 'Ġto', 'en', 'Th', 'This', 'ou', 'se', 'Ġtok', 'Ġtoken', 'nd', 'Ġis', 'Ġth', 'Ġthe', 'in', 'Ġab', 'Ġtokeni']


In [94]:
print(merges)

{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok', ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe', ('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}


In [95]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

In [96]:
tokenize("This is not a token.")

['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']

...

> ⚠️ Our implementation will throw an error if there is an unknown character since we didn’t do anything to handle them. GPT-2 doesn’t actually have an unknown token (it’s impossible to get an unknown character when using byte-level BPE), but this could happen here because we did not include all possible bytes in the initial vocabulary.

From Byte Level Text Representation, Encoding Byte-Level Representation, page 1 of [Neural Machine Translation with Byte-Level Subwords, Wnag, Cho & Gu, 2019](https://arxiv.org/pdf/1909.03341).
> While there are 138K Unicode characters covering over 150 languages, we represent a sentence in any language as a sequence of UTF-8 bytes (248 out of 256 possible bytes).

...

#### WordPiece tokenization

> ⚠️ Google never open-sourced its implementation of the training algorithm of WordPiece, so what follows is our best guess based on the published literature.

In [97]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [98]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [99]:
from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'This': 3,
             'is': 2,
             'the': 1,
             'Hugging': 1,
             'Face': 1,
             'Course': 1,
             '.': 4,
             'chapter': 1,
             'about': 1,
             'tokenization': 1,
             'section': 1,
             'shows': 1,
             'several': 1,
             'tokenizer': 1,
             'algorithms': 1,
             'Hopefully': 1,
             ',': 1,
             'you': 1,
             'will': 1,
             'be': 1,
             'able': 1,
             'to': 1,
             'understand': 1,
             'how': 1,
             'they': 1,
             'are': 1,
             'trained': 1,
             'and': 1,
             'generate': 1,
             'tokens': 1})

In [100]:
alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
alphabet

print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y']


In [101]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

In [102]:
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

In [103]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

In [104]:
pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904


In [105]:
best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.2


In [106]:
vocab.append("ab")

In [107]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

In [108]:
splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

In [109]:
vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

In [110]:
print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 'ab', '##fu', 'Fa', 'Fac', '##ct', '##ful', '##full', '##fully', 'Th', 'ch', '##hm', 'cha', 'chap', 'chapt', '##thm', 'Hu', 'Hug', 'Hugg', 'sh', 'th', 'is', '##thms', '##za', '##zat', '##ut']


...

In [111]:
def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

In [112]:
print(encode_word("Hugging"))
print(encode_word("HOgging"))

['Hugg', '##i', '##n', '##g']
['[UNK]']


In [113]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

In [114]:
tokenize("This is the Hugging Face course!")

['Th',
 '##i',
 '##s',
 'is',
 'th',
 '##e',
 'Hugg',
 '##i',
 '##n',
 '##g',
 'Fac',
 '##e',
 'c',
 '##o',
 '##u',
 '##r',
 '##s',
 '##e',
 '[UNK]']

...

#### Building a tokenizer, block by block

##### Acquiring a corpus

In [115]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [116]:
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n") 

In [117]:
!head wikitext-2.txt


 = Valkyria Chronicles III = 


 Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . 

 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series 

...

##### Building a WordPiece tokenizer from scratch

General steps are:

1. instantiate a `Tokenizer` object with a `model`
2. specify its `normalizer`
3. specify its `pre_tokenizer`
4. specify its `post_processor`
5. and also a `decoder`

for your custom tokenizer's attributes. Also,

> We have to specify the `unk_token` so the model knows what to return when it encounters characters it hasn’t seen before. 

In [118]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

normalization...

> The library provides a `Lowercase` normalizer and a `StripAccents` normalizer,
> and you can compose several normalizers using a `Sequence`...<br/>
> We're also using an `NFD` Unicode normalizer, as otherwise the `StripAccents` normalizer
> won’t properly recognize the accented characters and thus won’t strip them out.

In [119]:
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(), 
    normalizers.Lowercase(), 
    normalizers.StripAccents()
])

In [120]:
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


> ... the `Whitespace` pre-tokenizer splits on whitespace and all characters that are not letters,
> digits, or the underscore character, so it technically splits on whitespace and punctuation

In [121]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

In [122]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [123]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()

pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

> As with ... normalizers, you can use a `Sequence` to compose several pre-tokenizers

In [124]:
pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(), 
    pre_tokenizers.Punctuation()
])

pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [125]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

In [126]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






...

In [127]:
encoding = tokenizer.encode("Let's test this tokenizer.")

print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


In [128]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

print(cls_token_id, sep_token_id)

2 3


In [129]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

In [130]:
encoding = tokenizer.encode("Let's test this tokenizer.")

print(encoding.tokens)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']


In [131]:
encoding = tokenizer.encode("Let's test this tokenizer.",  "Again!")

print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]', 'again', '!', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]


In [132]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [133]:
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer. again!"

In [134]:
tokenizer.save("tokenizer.json")

In [135]:
!ls -la

total 638956
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 05:14 .
drwxr-xr-x 3 so_olliphant so_olliphant      4096 Feb 18 02:34 ..
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 02:28 .git
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 05:10 .ipynb_checkpoints
-rw-r--r-- 1 so_olliphant so_olliphant     14239 Feb 18 02:28 Ch2_p38.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     26388 Feb 18 02:51 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     58778 Feb 18 03:05 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant      7373 Feb 18 03:06 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    293035 Feb 18 05:09 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    115721 Feb 18 05:14 HF_NLP_Ch6.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant        15 Feb 18 02:28 README.md
-rw-r--r-- 1 so_olliphant so_olliphant   8385528 Feb 18 03:06 SQuAD_it-test.json
-rw-r--r-- 1 so_olliphant so_olliphant   1051245 Feb 18 03:06 SQuAD_it-test.json.gz
-r

In [136]:
!head tokenizer.json

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
      "content": "[UNK]",
      "single_word": false,
      "lstrip": false,


In [137]:
new_tokenizer = Tokenizer.from_file("tokenizer.json")

In [138]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

...

##### Building a BPE tokenizer from scratch

> We also <span style="background-color: #33ffff">don’t need to specify an `unk_token` because GPT-2 uses byte-level BPE</span>, which doesn't require it

In [139]:
tokenizer = Tokenizer(models.BPE())

> GPT-2 does not use a `normalizer`, so we skip that step and go directly to the pre-tokenization

In [140]:
# yes, it's true... there's no normalizer set
foo = AutoTokenizer.from_pretrained("gpt2")

print(foo.backend_tokenizer.normalizer)

None


In [141]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

In [142]:
foo.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

> Next is the model, which needs training. For GPT-2, the only special token is the end-of-text token

In [143]:
foo.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

In [144]:
%%time

trainer = trainers.BpeTrainer(
    vocab_size=25000, 
    special_tokens=["<|endoftext|>"]
)

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)




CPU times: user 22.6 s, sys: 35.5 ms, total: 22.6 s
Wall time: 22.6 s


....

In [145]:
encoding = tokenizer.encode("Let's test this tokenizer.")

print(encoding.tokens)

['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']


In [146]:
tokens = foo.tokenize("Let's test this tokenizer.")

print(tokens)

['Let', "'s", 'Ġtest', 'Ġthis', 'Ġtoken', 'izer', '.']


In [147]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [148]:
sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' test'

In [149]:
tokenizer.decoder = decoders.ByteLevel()

In [150]:
tokenizer.decode(encoding.ids)

"Let's test this tokenizer."

...

In [151]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

...

##### Building a Unigram tokenizer from scratch

> the model starts with [`tokenizers.models.Unigram`](https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.Unigram)

In [152]:
tokenizer = Tokenizer(models.Unigram())

> For the normalization, XLNet uses a few replacements (which come from SentencePiece)
>
> This replaces “ and ” with ” and any sequence of two or more spaces with a single space, as well as removing the accents in the texts to tokenize.

In [153]:
from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence([
    normalizers.Replace("``", '"'),
    normalizers.Replace("''", '"'),
    normalizers.NFKD(),
    normalizers.StripAccents(),
    normalizers.Replace(Regex(" {2,}"), " "),
])

> The pre-tokenizer to use for any `SentencePiece` tokenizer is [`Metaspace`](https://huggingface.co/docs/tokenizers/api/pre-tokenizers#tokenizers.pre_tokenizers.Metaspace):
>
> ... replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece)
>
> ... add(s) a space to the first word if there isn't already one. This lets us treat _hello_ exactly like _say hello_

In [154]:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

In [155]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer!', (14, 29))]

In [156]:
tokenizer.pre_tokenizer.pre_tokenize_str("hello")

[('▁hello', (0, 5))]

In [157]:
tokenizer.pre_tokenizer.pre_tokenize_str("say hello")

[('▁say', (0, 3)), ('▁hello', (3, 9))]

Special tokens for Unigram are:
* `<cls>`
* `<sep>`
* `<unk>`
* `<pad>`
* `<mask>`
* `<s>`
* `</s>`

We need to specify this behavior in the model.

In [158]:
special_tokens = [
    "<cls>", 
    "<sep>", 
    "<unk>",
    "<pad>", 
    "<mask>", 
    "<s>", 
    "</s>"
]

trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)





...

In [159]:
encoding = tokenizer.encode("Let’s  test this tokenizer.")

print(encoding.tokens)

['▁Let', '’', 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']


> A peculiarity of XLNet is that it puts the `<cls>` token at the end of the sentence,
> with a type ID of 2 (to distinguish it from the other tokens). It’s padding on the left,
> as a result.
> We can deal with all the special tokens and token type IDs with a (post-processor) template,
> like for BERT, but first we have to get the IDs of the `<cls>` and `<sep>` tokens

In [160]:
cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")

print(cls_token_id, sep_token_id)

0 1


In [161]:
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

In [162]:
encoding = tokenizer.encode(
    "Let's test this tokenizer...", 
    "on a pair of sentences!"
)

print(encoding.tokens)
print(encoding.type_ids)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '!', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


> Finally, we add a [`tokenizers.decoders.Metaspace`](https://huggingface.co/docs/tokenizers/api/decoders#tokenizers.decoders.Metaspace) decoder

In [163]:
tokenizer.decoder = decoders.Metaspace()

> We can save the tokenizer like before, and wrap it in a `PreTrainedTokenizerFast`
> or `XLNetTokenizerFast` if we want to use it in 🤗 Transformers. One thing to note
> when using `PreTrainedTokenizerFast` is that on top of the special tokens,
> <span style="background-color:#33FFFF">we need to tell the 🤗 Transformers library to pad on the left</span>

In [164]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",        # THIS IS KEY FOR (a SentencePiece tokenizer, wrapped in) PreTrainedTokenizerFast 
)

In [165]:
!ls -la

total 638956
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 05:14 .
drwxr-xr-x 3 so_olliphant so_olliphant      4096 Feb 18 02:34 ..
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 02:28 .git
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 05:10 .ipynb_checkpoints
-rw-r--r-- 1 so_olliphant so_olliphant     14239 Feb 18 02:28 Ch2_p38.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     26388 Feb 18 02:51 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     58778 Feb 18 03:05 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant      7373 Feb 18 03:06 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    293035 Feb 18 05:09 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    115721 Feb 18 05:14 HF_NLP_Ch6.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant        15 Feb 18 02:28 README.md
-rw-r--r-- 1 so_olliphant so_olliphant   8385528 Feb 18 03:06 SQuAD_it-test.json
-rw-r--r-- 1 so_olliphant so_olliphant   1051245 Feb 18 03:06 SQuAD_it-test.json.gz
-r

### Tokenizers, check!

1. When should you train a new tokenizer?

> When your dataset is different from the one used by an existing pretrained model, and you want to pretrain a new model.<br/>
> In this case there's no advantage to using the same tokenizer.

2. What is the advantage of using a generator of lists of texts compared to a list of lists of texts when using `train_new_from_iterator()`?

> Each batch of texts will be released from memory when you iterate, and the gain will be especially visible if you use 🤗 Datasets to store your texts.


3. What are the advantages of using a “fast” tokenizer?

> It can process inputs faster than a slow tokenizer when you batch lots of inputs together.<br/>
> ...Thanks to parallelism implemented in Rust, it will be faster on batches of inputs.
>
> It has some additional features allowing you to map tokens to the span of text that created them.<br/>
> ...offset mappings. That's not the only advantage, though

4. How does the token-classification pipeline handle entities that span over several tokens?

> There is a label for the beginning of an entity and a label for the continuation of an entity.
>
> In a given word, as long as the first token has the label of the entity, the whole word is considered labeled with that entity.
>
> When a token has the label of a given entity, any other following token with the same label is considered part of the same entity, unless it's labeled as the start of a new entity.

5. How does the question-answering pipeline handle long contexts?

> It splits the context into several parts (with overlap) and finds the maximum score for an answer in each part.

6. What is normalization?

> It's any cleanup the tokenizer performs on the texts in the initial stages.

7. What is pre-tokenization for a subword tokenizer?

It's the step before the tokenizer model is applied, to split the input into words.

8. Select the sentences that apply to the BPE model of tokenization.

> BPE is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.
>
> BPE tokenizers learn merge rules by merging the pair of tokens that is the most frequent.
>
> BPE tokenizes words into subwords by splitting them into characters and then applying the merge rules.

9. Select the sentences that apply to the WordPiece model of tokenization.

> WordPiece is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.
>
> A WordPiece tokenizer learns a merge rule by merging the pair of tokens that maximizes a score that privileges frequent pairs with less frequent individual parts.
>
> WordPiece tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.

10. Select the sentences that apply to the Unigram model of tokenization.

> Unigram is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.
>
> Unigram adapts its vocabulary by minimizing a loss computed over the whole corpus.
>
> Unigram tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.