https://huggingface.co/learn/nlp-course/chapter6/4#normalization-and-pre-tokenization

Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less random or “cleaner”. Common operations include stripping whitespace, removing accented characters or lowercasing all text. If you’re familiar with Unicode normalization, it is also a very common normalization operation applied in most tokenizers.

Each normalization operation is represented in the 🤗 Tokenizers library by a Normalizer, and you can combine several of those by using a normalizers.Sequence. Here is a normalizer applying NFD Unicode normalization and removing accents as an example:

### TODO Recording:

- Go to https://colab.research.google.com/
- Login with your account, create a new notebook and give the name of this notebook
- Show that you are using the regular CPU runtime
- Setting up a hugging face secret token to access HF from Colab (this is optional right now but will become compulsory later)
- Go to https://huggingface.co/ (you should already be logged in)
- Click on the account icon at the top-right -> go to Settings
- Click on Access Tokens from the left
- Create a new token with WRITE privileges
- Copy the token over and come to this notebook
- Click on the key on the left of the screen
- Add Secret_Name = `HF_TOKEN`
- Add Value = `<token generated>`
- Enable notebook access
- Now you can write code

In [None]:
from tokenizers import normalizers

from tokenizers.normalizers import NFD, StripAccents, Lowercase

We can manually test that normalizer by applying it to any string

In [None]:
normalizer = normalizers.Sequence([Lowercase()])

normalizer.normalize_str("Café culture is prominent in many cities around the world.")

'café culture is prominent in many cities around the world.'

In [None]:
normalizer = normalizers.Sequence([NFD(), StripAccents(), Lowercase()])

normalizer.normalize_str("Café culture is prominent in many cities around the world.")

'cafe culture is prominent in many cities around the world.'

In [None]:
normalizer.normalize_str("The protagonist had déjà vu when he entered the old mansion.")

'the protagonist had deja vu when he entered the old mansion.'

In [None]:
normalizer.normalize_str("Héllò hôw are ü?")

'hello how are u?'

In [None]:
from tokenizers.pre_tokenizers import Whitespace

pre_tokenizer = Whitespace()

pre_tokenizer.pre_tokenize_str("She can't attend the meeting due to prior commitments.")

[('She', (0, 3)),
 ('can', (4, 7)),
 ("'", (7, 8)),
 ('t', (8, 9)),
 ('attend', (10, 16)),
 ('the', (17, 20)),
 ('meeting', (21, 28)),
 ('due', (29, 32)),
 ('to', (33, 35)),
 ('prior', (36, 41)),
 ('commitments', (42, 53)),
 ('.', (53, 54))]

We can combine together any PreTokenizer together. For instance, here is a pre-tokenizer that will split on space, punctuation and digits, separating numbers in their individual digits:

### TODO Recording:

- First run the code with False, then change False -> True and run the code with True

In [None]:
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits

pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits = False)])

pre_tokenizer.pre_tokenize_str("I am calling you on 93457654")

[('I', (0, 1)),
 ('am', (2, 4)),
 ('calling', (5, 12)),
 ('you', (13, 16)),
 ('on', (17, 19)),
 ('93457654', (20, 28))]

# Tokenizers used by different transformer models

### TODO Recording

- Go to https://huggingface.co/
- Search for bert-base-uncased, show the model card
- Come back here to the code


"##" means that the rest of the token should be attached to the previous one, without space (for decoding or reversal of the tokenization).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.tokenize("I have a new SAMSUNG GLITE")

['i', 'have', 'a', 'new', 'samsung', 'g', '##lite']

In [None]:
tokenizer.tokenize("Hello, y'all! How   are you 😁 ?")

['hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?']

The pre-tokenizer for BERT can be accessed using the backend_tokenizer

Notice how the tokenizer is already keeping track of the offsets, which is how it can give us the offset mapping we used in the previous section. Here the tokenizer ignores the two spaces and replaces them with just one, but the offset jumps between are and you to account for that.

Since we’re using a BERT tokenizer, the pre-tokenization involves splitting on whitespace and punctuation.

In [None]:
tokenizer.backend_tokenizer

<tokenizers.Tokenizer at 0x57619b550a10>

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("I have a new SAMSUNG GLITE")

[('I', (0, 1)),
 ('have', (2, 6)),
 ('a', (7, 8)),
 ('new', (9, 12)),
 ('SAMSUNG', (13, 20)),
 ('GLITE', (21, 26))]

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, y'all! How   are you 😁 ?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('y', (7, 8)),
 ("'", (8, 9)),
 ('all', (9, 12)),
 ('!', (12, 13)),
 ('How', (14, 17)),
 ('are', (20, 23)),
 ('you', (24, 27)),
 ('😁', (28, 29)),
 ('?', (30, 31))]

In [None]:
tokenizer.backend_tokenizer.normalizer.normalize_str("The protagonist had déjà vu when he entered the old mansion.")

'the protagonist had deja vu when he entered the old mansion.'

In [None]:
tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?")

'hello how are u?'

Let's use the GPT-2 tokenizer

It will split on whitespace and punctuation as well, but it will keep the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens.

Also note that unlike the BERT tokenizer, this tokenizer does not ignore the double space.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.tokenize("I have a new SAMSUNG GLITE")

['I', 'Ġhave', 'Ġa', 'Ġnew', 'ĠSAM', 'S', 'UN', 'G', 'ĠGL', 'ITE']

In [None]:
tokenizer.tokenize("Hello, y'all! How   are you 😁 ?")

['Hello',
 ',',
 'Ġy',
 "'",
 'all',
 '!',
 'ĠHow',
 'Ġ',
 'Ġ',
 'Ġare',
 'Ġyou',
 'ĠðŁĺ',
 'ģ',
 'Ġ?']

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("I have a new SAMSUNG GLITE")

[('I', (0, 1)),
 ('Ġhave', (1, 6)),
 ('Ġa', (6, 8)),
 ('Ġnew', (8, 12)),
 ('ĠSAMSUNG', (12, 20)),
 ('ĠGLITE', (20, 26))]

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, y'all! How   are you 😁 ?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġy', (6, 8)),
 ("'", (8, 9)),
 ('all', (9, 12)),
 ('!', (12, 13)),
 ('ĠHow', (13, 17)),
 ('ĠĠ', (17, 19)),
 ('Ġare', (19, 23)),
 ('Ġyou', (23, 27)),
 ('ĠðŁĺģ', (27, 29)),
 ('Ġ?', (29, 31))]

Let's use the T5-small tokenizer (this uses the SentencePiece algorithm)

Like the GPT-2 tokenizer, this one keeps spaces and replaces them with a specific token (_), but the T5 tokenizer only splits on whitespace, not punctuation. Also note that it added a space by default at the beginning of the sentence and ignored the double space.



In [None]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")

tokenizer.tokenize("I have a new SAMSUNG GLITE")

['▁I', '▁have', '▁', 'a', '▁new', '▁S', 'AMS', 'UNG', '▁', 'GL', 'ITE']

In [None]:
tokenizer.tokenize("Hello, y'all! How   are you 😁 ?")

['▁Hello',
 ',',
 '▁',
 'y',
 "'",
 'all',
 '!',
 '▁How',
 '▁are',
 '▁you',
 '▁',
 '😁',
 '▁',
 '?']

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("I have a new SAMSUNG GLITE")

[('▁I', (0, 1)),
 ('▁have', (2, 6)),
 ('▁a', (7, 8)),
 ('▁new', (9, 12)),
 ('▁SAMSUNG', (13, 20)),
 ('▁GLITE', (21, 26))]

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, y'all! How   are you 😁 ?")

[('▁Hello,', (0, 6)),
 ("▁y'all!", (7, 13)),
 ('▁How', (14, 17)),
 ('▁are', (20, 23)),
 ('▁you', (24, 27)),
 ('▁😁', (28, 29)),
 ('▁?', (30, 31))]