<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/06_the_%F0%9F%A4%97_Tokenizers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 4.2 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 34.2 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 35.6 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.0 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 47.7 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 47.6

When fine-tuning a model, it only makes sense to use the same tokenizer that it was trained on. But what do you do when you want to create a model from scratch? 

Well, that's exactly what we're going to do in this chapter. 

# Training a new tokeinizer from an old one

Key point: if a language model is not available in our target language or, and this is more likely in my case, the corpus is significantly different from the one a language model was trained on, then we're going to want to train a model from scratch using a tokenizer adapted to our data. 

For instance, if we want to tokenize a simple sentence like, "I went shopping with my mother last week," the standard Bert-based tokenizer works well:

In [None]:
checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sample_sentence = "I went shopping with my mother last week."

print(tokenizer.tokenize(sample_sentence))

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

['i', 'went', 'shopping', 'with', 'my', 'mother', 'last', 'week', '.']


However, if we try to pass a highly technical, academic, or archaic text, then the results aren't nearly as good: 

In [None]:
medical = "the medical vocabulary is divided into many sub-token: paracetamol, pharyngitis, and oxycodone."

print(tokenizer.tokenize(medical))

['the', 'medical', 'vocabulary', 'is', 'divided', 'into', 'many', 'sub', '-', 'token', ':', 'para', '##ce', '##tam', '##ol', ',', 'ph', '##ary', '##ng', '##itis', ',', 'and', 'ox', '##y', '##co', '##don', '##e', '.']


To that end, training a tokenizer consists of four steps: 
- building a corpus
- selecting the tokenizer architecture
- training the tokenizer on the corpus
- saving the result

## [Assembling a corpus](https://huggingface.co/course/chapter6/2?fw=pt#assembling-a-corpus)

Once we have the corpus, we can use ```AutoTokenizer.train_new_from_iterator()``` so that the new tokenizer will have the same characteristics as the one for the model we wish to emulate. 

What do I mean by that? 

Bascially, if we're going to be using ```GPT-2``` model architecture, we're going to want out tokenizer to tokenize in the same maner as ```GPT-2```. 

For this code through, I'm going to just follow along using the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) but, in the future, I'm going to do something more classical (like Shakespear) or possibly exotic. 

But, for now...

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

Downloading builder script:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading and preparing dataset code_search_net/python (download: 897.32 MiB, generated: 1.62 GiB, post-processed: Unknown size, total: 2.49 GiB) to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

Dataset code_search_net downloaded and prepared to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's have a look at the columns we're working with. 

In [None]:
raw_datasets['train']

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

OK, so the docstrings are separated from the code and the dataset recommends tokenizing both of them. 

Let's have a look at en example to see what we're working with:

In [None]:
print(raw_datasets["train"][123456]['whole_func_string'])

def rank(self, issue, next_issue):
        """Rank an issue before another using the default Ranking field, the one named 'Rank'.

        :param issue: issue key of the issue to be ranked before the second one.
        :param next_issue: issue key of the second issue.
        """
        if not self._rank:
            for field in self.fields():
                if field['name'] == 'Rank':
                    if field['schema']['custom'] == "com.pyxis.greenhopper.jira:gh-lexo-rank":
                        self._rank = field['schema']['customId']
                        break
                    elif field['schema']['custom'] == "com.pyxis.greenhopper.jira:gh-global-rank":
                        # Obsolete since JIRA v6.3.13.1
                        self._rank = field['schema']['customId']

        if self._options['agile_rest_path'] == GreenHopperResource.AGILE_BASE_REST_PATH:
            url = self._get_url('issue/rank', base=self.AGILE_BASE_URL)
            payload = {'issues': [i

OK, now the key is to transform the dataset into an *iterator*. 

Why? Because if our dataset is an iterator, we can feed it to our function in batches as opposed to all at once. 

Why does that matter? If we pass it to our function all at once, we need to load the ***entire dataset into memory*** which will most likely, crash our computer. 

For example, doing the following would be bad 🙁

In [None]:
# Don't uncomment the following line unless your dataset is small!

'''
training_corpus = [
                   raw_datasets["train"][i: i + 1000]["whole_func_string"] 
                   for i in range(0, len(raw_datasets["train"]), 1000)]
'''

'\ntraining_corpus = [\n                   raw_datasets["train"][i: i + 1000]["whole_func_string"] \n                   for i in range(0, len(raw_datasets["train"]), 1000)]\n'

Instead, we want to create a generator like this: 

In [None]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

So what's the difference between the two? 

Instead of using brackets we use parentheses; fun fact, if you've ever wondered why you can't do tuple comprehension like list comprehension, now you know 😆)

Now the important thing to remember about generator objects is they can only be used once like this: 

In [None]:
gen = (l for l in "abcdefg")
print(list(gen))
print(list(gen))

['a', 'b', 'c', 'd', 'e', 'f', 'g']
[]


So what do we do if we want to use a generator more than once? 

Simply write a function which returns a generator 😀

If the object is straight forward, we can use comprehension syntax like above: 

In [None]:
def get_training_corpus():
  return(
      raw_datasets['train'][i: i + 1000]["whole_func_string"]
      for i in range(0, len(raw_datasets['train']), 1000)
  )

training_corpus = get_training_corpus()

Now if we're going to do something more complicated, then a better idea is to use ```yield``` instead of ```return``` statement.

In [None]:
def get_training_corpus():
  dataset = raw_datasets["train"]
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx + 1000]
    yield samples["whole_func_string"]

training_corpus = get_training_corpus()

##[Training a new tokenizer](https://huggingface.co/course/chapter6/2?fw=pt#training-a-new-tokenizer)

Now that we've created a function which will generate batches of text to feed into our tokenizer, we needd to load the tokenizer we'd like to emulate. 

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

> _So if we're creating a new tokenizer, why not just start it from scratch?_

Solid question. 

The answer is it's best to stand on the shoulders of giants so, rather than defining every aspect of the tokenizer (e.g., special tokens) and instead just train it up using our specific vocabulary. 

Now let's create a baseline by identifying how the standard ```gpt2``` tokenizer would tokenize the following: 

In [None]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


We can see some real problems such as it dividing the function name at the underscore incorrectly as well as tokenizing the white space. 

Let's see if a tokenizer trained for this corpus peforms better:

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Now, without getting too much into the weeds, the 🤗Tokenizers library has both fast and slow tokenizers. Fast tokenizers are written in Rust wheresas slow tokenizers are written in pure Python. 

> _Why does that matter?_ 

Training a tokenizer in pure Python is exceptionally slow. 

As such, check [here](https://huggingface.co/transformers/#supported-frameworks) to see if the model your tokenizer is based on has a fast version.  

In [None]:
tokens = tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


Much better! Our tokenizer recognizes an indentation 'ĊĠĠĠ', a docstring 'Ġ"""', and properly splits the function name on the underscore. 

Furthermore, to get a sense of how many fewer tokens we'd create by correctly identifying white space as well as docstring markers, we can simply do the following: 

In [None]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


And, since all learning is repetition, let's see another example: 

In [None]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

print(tokenizer.tokenize(example))

['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


Oh very nice! We can see that camel-cased names are correctly tokenized as well as dunder methods. 

## Saving the Tokenizer 

Now there is no point to doing all that work to only redo it again at a later date. Additionally, if we save and share the tokenizer on the Hub, others will benefit from our hard work. 

To that end, be sure to save your tokenizer by using the ```save_pretrained()``` method like so: 

In [None]:
# Be sure to uncomment out the line below and pass in a unique name

# tokenizer.save_pretrained(name-of-tokenizer)

Now let's push it to the hub. 

If you're working in a notebook: 

In [None]:
#from huggingface_hub import notebook_login

#notebook_login()

If not: 

In [None]:
# huggingface-cli login

Now that you're logged in, you can simply push it like so: 

In [None]:
# tokenizer.push_to_hub(name-of-tokenizer)

And with that, your tokenizer lives in the Hub where anyone can load it like so: 

In [None]:
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer

# tokenizer = AutoTokenizer.from_pretrained("huggingface-course/name-of-tokenizer")

# [Fast tokenizer's special powers](https://huggingface.co/course/chapter6/3?fw=pt#fast-tokenizers-special-powers)

As previously mentioned, "slow" tokenizers are slow becasue they are written in pure Python whereas "fast" tokenizers are written in Rust. 

Let's see how much faster the "fast" tokenizer truly is using both to tokenize the glue dataset.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mnli")

raw_datasets

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mnli (download: 298.29 MiB, generated: 78.65 MiB, post-processed: Unknown size, total: 376.95 MiB) to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/5 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

In [None]:
from logging import fatal
from transformers import AutoTokenizer

checkpoint = 'bert-base-cased'

fast_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_with_fast(examples):
   return fast_tokenizer(
       examples['premise'],
       examples["hypothesis"],
       truncation=True
       )

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained(checkpoint,
                                               use_fast=False)

def tokenize_with_slow(examples):
  return slow_tokenizer(
      examples['premise'],
      examples["hypothesis"],
      truncation=True
  )

In [None]:
#Fast batched
%time tokenized_datasets = raw_datasets.map(tokenize_with_fast, batched=True)

  0%|          | 0/393 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

CPU times: user 1min 25s, sys: 870 ms, total: 1min 26s
Wall time: 1min 6s


In [None]:
%time tokenized_datasets = raw_datasets.map(tokenize_with_slow, batched=True)

  0%|          | 0/393 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

CPU times: user 5min 56s, sys: 1.7 s, total: 5min 58s
Wall time: 5min 59s


Clearly, it pays to use a fast tokenizer whenever possible. 

### [Batch ecoding](https://huggingface.co/course/chapter6/3?fw=pt#batch-encoding)

The output of a tokenizer is a ```BatchEncoding``` object which is a special subclass of a dictionary.

In short, it contains a lot more than just the tokens and their id mappings. How much more? Let's look at an example: 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
example = "My name is Evan and I work at Engoo in Split."
encoding = tokenizer(example)
print(type(encoding))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


```AutoTokenizer``` selects the fast tokenizer by default and since we set the checkpoint to 'bert-base-cased' earlier, ```AutoTokenizer``` _should_ be using a fast tokenizer since it is available but it's always good to double check. 


In [None]:
tokenizer.is_fast

True

Good! Now we can explore the capabilities of fast tokenizer. 

To begin with, we can access the tokens without coverting the IDs back to tokens: 

In [None]:
print(encoding.tokens())

['[CLS]', 'My', 'name', 'is', 'Evan', 'and', 'I', 'work', 'at', 'En', '##go', '##o', 'in', 'Split', '.', '[SEP]']


We can see that '##go' and '##o' are part of Engoo in the original sentence. 

If we want to get the index of each word, we can use the ```word_ids()``` method like so: 

In [None]:
encoding.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 10, 11, None]

However, that might be a bit difficult to read so we can just zip the two list objects together like this: 

In [None]:
for e in zip(encoding.tokens(), encoding.word_ids()):
  print(e)

('[CLS]', None)
('My', 0)
('name', 1)
('is', 2)
('Evan', 3)
('and', 4)
('I', 5)
('work', 6)
('at', 7)
('En', 8)
('##go', 8)
('##o', 8)
('in', 9)
('Split', 10)
('.', 11)
('[SEP]', None)


Much easier to read 😀

Also, keep in mind that we're basically just slicing a list so if we want to pass the index to get a string, it is really simple. 

In [None]:
start, end = encoding.word_to_chars(8)
example[start:end]

'Engoo'

Note, the above mehtod only works for selecing a single word; it will throw an error if you try to select a range:

In [None]:
start, end = encoding.word_to_chars(8), encoding.word_to_chars(10)
example[start:end]

TypeError: ignored

### [Inside the ```token-classification``` pipeline](https://huggingface.co/course/chapter6/3?fw=pt#inside-the-tokenclassification-pipeline)

The ```token-classification``` pipeline works the same way as the text classification pipeline: 

tokenization -> model -> postprocessing 

How can we combine all three steps into one? Simply use ```pipeline```.




In [None]:
# Base Model 
from transformers import pipeline

token_classifier = pipeline("token-classification")

example = "My name is Evan and I work at Engoo in Split."

for _ in token_classifier(example): 
  print(_)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


{'entity': 'I-PER', 'score': 0.9990989, 'index': 4, 'word': 'Evan', 'start': 11, 'end': 15}
{'entity': 'I-ORG', 'score': 0.99787736, 'index': 9, 'word': 'En', 'start': 30, 'end': 32}
{'entity': 'I-ORG', 'score': 0.98045117, 'index': 10, 'word': '##go', 'start': 32, 'end': 34}
{'entity': 'I-ORG', 'score': 0.9792252, 'index': 11, 'word': '##o', 'start': 34, 'end': 35}
{'entity': 'I-LOC', 'score': 0.99367493, 'index': 13, 'word': 'Split', 'start': 39, 'end': 44}


Now, we can easily see that 'Engoo' has been broken up into three tokens but wouldn't it be better if the output printed this out for us? 

Well, to do so, we set the ```aggregation_strategy" to 'simple' like this: 

In [None]:
token_classifier = pipeline("token-classification",
                            aggregation_strategy='simple')

example = "My name is Evan and I work at Engoo in Split."

for _ in token_classifier(example): 
  print(_)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


{'entity_group': 'PER', 'score': 0.9990989, 'word': 'Evan', 'start': 11, 'end': 15}
{'entity_group': 'ORG', 'score': 0.9858512, 'word': 'Engoo', 'start': 30, 'end': 35}
{'entity_group': 'LOC', 'score': 0.99367493, 'word': 'Split', 'start': 39, 'end': 44}


Excellent!

Now, now if you wanted to do the same from scratch, you would follow the steps outlined [here](https://huggingface.co/course/chapter6/3?fw=pt#from-inputs-to-predictions)

# Fast tokenizers in the QA pipeline

Again, the QA pipeline follows the same steps as other pipelines with the key difference being the output has two outputs: start logits and end logits. 

> _Why?_ 

Because we need to know where the answer begins (i.e., start logits) and finishes (i.e., end logits). 

However, we're just going to make use of the ```pipeline``` function to automate this process like this: 


In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

# Where to find the answer
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question = "Which deep learning libraries back 🤗 Transformers?"

question_answerer(question=question,
                  context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'answer': 'Jax, PyTorch, and TensorFlow',
 'end': 106,
 'score': 0.9802599549293518,
 'start': 78}

BTW, something which makes the QA pipeline different from others is its ability to split texts which are longer than the typical max length of 512 tokens. 

For example, 

In [None]:
long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, 
                  context=long_context)

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'Jax, PyTorch and TensorFlow',
 'end': 1919,
 'score': 0.9714912176132202,
 'start': 1892}

Again, if you want to dive into how to hard code the functions rather than relying on pipelines, see [here](https://huggingface.co/course/chapter6/3b?fw=pt#using-a-model-for-question-answering).

# [Normalization and pre-tokenization](https://huggingface.co/course/chapter6/4?fw=pt#normalization-and-pretokenization)

Normalization is, as the name suggests, a process where the text is normalized, i.e., extra whitespace and accents are removed and the text is lowercased. 

What does that look like? Something like this: 



In [None]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
example_text = "Héllò hôw are ü?"
print(tokenizer.backend_tokenizer.normalizer.normalize_str(example_text))


hello how are u?


Did you notice our checkpoint above was uncased? 

What happens when we switch it out with one that is cased? 

In [None]:
checkpoint_cased = 'bert-base-cased'

tokenizer_cased = AutoTokenizer.from_pretrained(checkpoint_cased)

In [None]:
print("Uncased")
print(tokenizer.backend_tokenizer.normalizer.normalize_str(example_text))

print()
print()

print("Cased")
print(tokenizer_cased.backend_tokenizer.normalizer.normalize_str(example_text))


Uncased
hello how are u?


Cased
Héllò hôw are ü?


Needles to say, the differneces are striking. 

## Pre-Tokenization

Simply put, pre-tokenization is the process by which raw text is split into smaller entities. 

> _How?_

Well, that depends on the tokenizer: 

- BERT tokenizers split on whitespace and punctuation
- GPT-2 does the same but replaces the space before a token with a Ġ symbol
- T5 uses and underscore instead of a Ġ symbol and only splits on white space  

In [None]:
from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [None]:
print("BERT")
print(bert_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?"))
print()
print("GPT-2")
print(gpt2_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?"))
print()
print("T5")
print(t5_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?"))


BERT
[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]

GPT-2
[('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)), ('?', (19, 20))]

T5
[('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]


# [Byte-Pair Encoding tokenization](https://huggingface.co/course/chapter6/5?fw=pt#bytepair-encoding-tokenization)

This algorithm is incredibly powerful but basic. Why? Because all it does is: 
- separate strings on white space 
- identify 
  - the characters which compose each word in the white space
  - frequent pairs of characters found of those characters

Once it finds those pairs, it adds those pairs to the existing character list and attempts to identify which characters match wich pairs of characters. For example: 

In [2]:
corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [4]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus: 
  words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
  new_words = [word for word, offset in words_with_offsets]
  for word in new_words:
    word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'Ġcourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})


Now we identify all the base vobabulary (i.e., the characters) in the corpus: 

In [5]:
alphabet = []

for word in word_freqs.keys():
  for letter in word: 
    if letter not in alphabet:
      alphabet.append(letter)

alphabet.sort()

print(alphabet)

[',', '.', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']


Don't forget that each transformer tokenizer uses a special token to denote the beginning and end of a string. However, for ```GPT-2```, the only special token is:

In [6]:
vocab = ["<|endfotext|"] + alphabet.copy()

Next we need to decompose the words into their indvidual characters in order to start training. 

In [9]:
splits = {word: [c for c in word] for word in word_freqs.keys()}

OK, here comes the fun part: we're going to write a function which computes the frequency of each pair. 

In [10]:
def compute_pair_freqs(splits):
  pair_freqs = defaultdict(int)
  for word, freq in word_freqs.items():
    split = splits[word]
    if len(split) == 1:
      continue
    for i in range(len(split) - 1):
      pair = (split[i], split[i+1])
      pair_freqs[pair] += freq
  return pair_freqs


In [11]:
pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
  print(f"{key}: {pair_freqs[key]}")
  if i >=5:
    break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3


Now, to find the most frequent pair, we just call max on the dictionary items and ask it to return the highest value: 

In [14]:

max(pair_freqs.items(), key=lambda k: k[1])


(('Ġ', 't'), 7)

Continue from [here](https://huggingface.co/course/chapter6/5?fw=pt#implementing-bpe)