In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


# Using Hugging Face Tokenizers

### Loading Tokenizer

In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained
model. Hugging Face has many tokenizers available that have already been trained
for specific models and tasks!

In [2]:
# Choose a pretrained tokenizer to use
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizer_config.json: 100%|██████████| 49.0/49.0 [00:00<00:00, 247kB/s]


## Encoding: Text to Tokens

### Tokens: String Representations

In [4]:
# Simple method getting tokens from text
raw_text = '''Harry Potter and the Sorcerer's Stone (chapter 1)
CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 
Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache.
Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. 
little tyke," chortled Mr.
'''
tokens = my_tokenizer.tokenize(raw_text)

print(tokens)

['Harry', 'Potter', 'and', 'the', 'So', '##rcerer', "'", 's', 'Stone', '(', 'chapter', '1', ')', 'CHAPTER', 'ONE', 'THE', 'B', '##O', '##Y', 'WHO', 'L', '##IVE', '##D', 'Mr', '.', 'and', 'Mrs', '.', 'Du', '##rs', '##ley', ',', 'of', 'number', 'four', ',', 'P', '##rive', '##t', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'", 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', "'", 't', 'hold', 'with', 'such', 'nonsense', '.', 'Mr', '.', 'Du', '##rs', '##ley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'G', '##run', '##ning', '##s', ',', 'which', 'made', 'drill', '##s', '.', 'He', 'was', 'a', 'big', ',', 'beef', '##y', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'must', '##ache', '.', 'Mrs', '.', 'Du', '##rs', 

In [5]:
# This method also returns special tokens depending on the pretrained tokenizer
detailed_tokens = my_tokenizer(raw_text).tokens()

print(detailed_tokens)

['[CLS]', 'Harry', 'Potter', 'and', 'the', 'So', '##rcerer', "'", 's', 'Stone', '(', 'chapter', '1', ')', 'CHAPTER', 'ONE', 'THE', 'B', '##O', '##Y', 'WHO', 'L', '##IVE', '##D', 'Mr', '.', 'and', 'Mrs', '.', 'Du', '##rs', '##ley', ',', 'of', 'number', 'four', ',', 'P', '##rive', '##t', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'", 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', "'", 't', 'hold', 'with', 'such', 'nonsense', '.', 'Mr', '.', 'Du', '##rs', '##ley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'G', '##run', '##ning', '##s', ',', 'which', 'made', 'drill', '##s', '.', 'He', 'was', 'a', 'big', ',', 'beef', '##y', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'must', '##ache', '.', 'Mrs', '.', 'Du',

### Tokens: Integer ID Representations

In [6]:
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))

[101, 3466, 11434, 1105, 1103, 1573, 25989, 112, 188, 4118, 113, 6073, 122, 114, 8203, 24497, 7462, 139, 2346, 3663, 23750, 149, 26140, 2137, 1828, 119, 1105, 2823, 119, 12786, 1733, 1926, 117, 1104, 1295, 1300, 117, 153, 17389, 1204, 6877, 117, 1127, 6884, 1106, 1474, 1115, 1152, 1127, 6150, 2999, 117, 6243, 1128, 1304, 1277, 119, 1220, 1127, 1103, 1314, 1234, 1128, 112, 173, 5363, 1106, 1129, 2017, 1107, 1625, 4020, 1137, 8198, 117, 1272, 1152, 1198, 1238, 112, 189, 2080, 1114, 1216, 17466, 119, 1828, 119, 12786, 1733, 1926, 1108, 1103, 1900, 1104, 170, 3016, 1270, 144, 10607, 3381, 1116, 117, 1134, 1189, 15227, 1116, 119, 1124, 1108, 170, 1992, 117, 14413, 1183, 1299, 1114, 6374, 1251, 2455, 117, 1780, 1119, 1225, 1138, 170, 1304, 1415, 1538, 12804, 119, 2823, 119, 12786, 1733, 1926, 1108, 4240, 1105, 9853, 1105, 1125, 2212, 3059, 1103, 4400, 2971, 1104, 2455, 117, 1134, 1338, 1107, 1304, 5616, 1112, 1131, 2097, 1177, 1277, 1104, 1123, 1159, 172, 23851, 2118, 1166, 4605, 25617, 117,

In [7]:
print(detailed_tokens)

# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)

['[CLS]', 'Harry', 'Potter', 'and', 'the', 'So', '##rcerer', "'", 's', 'Stone', '(', 'chapter', '1', ')', 'CHAPTER', 'ONE', 'THE', 'B', '##O', '##Y', 'WHO', 'L', '##IVE', '##D', 'Mr', '.', 'and', 'Mrs', '.', 'Du', '##rs', '##ley', ',', 'of', 'number', 'four', ',', 'P', '##rive', '##t', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'", 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', "'", 't', 'hold', 'with', 'such', 'nonsense', '.', 'Mr', '.', 'Du', '##rs', '##ley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'G', '##run', '##ning', '##s', ',', 'which', 'made', 'drill', '##s', '.', 'He', 'was', 'a', 'big', ',', 'beef', '##y', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'must', '##ache', '.', 'Mrs', '.', 'Du',

Another way can look a little complex but can be useful when working with
tokenizers for certain tasks.

In [8]:
# Returns an object that has a few different keys available
my_tokenizer(raw_text)

{'input_ids': [101, 3466, 11434, 1105, 1103, 1573, 25989, 112, 188, 4118, 113, 6073, 122, 114, 8203, 24497, 7462, 139, 2346, 3663, 23750, 149, 26140, 2137, 1828, 119, 1105, 2823, 119, 12786, 1733, 1926, 117, 1104, 1295, 1300, 117, 153, 17389, 1204, 6877, 117, 1127, 6884, 1106, 1474, 1115, 1152, 1127, 6150, 2999, 117, 6243, 1128, 1304, 1277, 119, 1220, 1127, 1103, 1314, 1234, 1128, 112, 173, 5363, 1106, 1129, 2017, 1107, 1625, 4020, 1137, 8198, 117, 1272, 1152, 1198, 1238, 112, 189, 2080, 1114, 1216, 17466, 119, 1828, 119, 12786, 1733, 1926, 1108, 1103, 1900, 1104, 170, 3016, 1270, 144, 10607, 3381, 1116, 117, 1134, 1189, 15227, 1116, 119, 1124, 1108, 170, 1992, 117, 14413, 1183, 1299, 1114, 6374, 1251, 2455, 117, 1780, 1119, 1225, 1138, 170, 1304, 1415, 1538, 12804, 119, 2823, 119, 12786, 1733, 1926, 1108, 4240, 1105, 9853, 1105, 1125, 2212, 3059, 1103, 4400, 2971, 1104, 2455, 117, 1134, 1338, 1107, 1304, 5616, 1112, 1131, 2097, 1177, 1277, 1104, 1123, 1159, 172, 23851, 2118, 1166, 460

In [9]:
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)

[101, 3466, 11434, 1105, 1103, 1573, 25989, 112, 188, 4118, 113, 6073, 122, 114, 8203, 24497, 7462, 139, 2346, 3663, 23750, 149, 26140, 2137, 1828, 119, 1105, 2823, 119, 12786, 1733, 1926, 117, 1104, 1295, 1300, 117, 153, 17389, 1204, 6877, 117, 1127, 6884, 1106, 1474, 1115, 1152, 1127, 6150, 2999, 117, 6243, 1128, 1304, 1277, 119, 1220, 1127, 1103, 1314, 1234, 1128, 112, 173, 5363, 1106, 1129, 2017, 1107, 1625, 4020, 1137, 8198, 117, 1272, 1152, 1198, 1238, 112, 189, 2080, 1114, 1216, 17466, 119, 1828, 119, 12786, 1733, 1926, 1108, 1103, 1900, 1104, 170, 3016, 1270, 144, 10607, 3381, 1116, 117, 1134, 1189, 15227, 1116, 119, 1124, 1108, 170, 1992, 117, 14413, 1183, 1299, 1114, 6374, 1251, 2455, 117, 1780, 1119, 1225, 1138, 170, 1304, 1415, 1538, 12804, 119, 2823, 119, 12786, 1733, 1926, 1108, 4240, 1105, 9853, 1105, 1125, 2212, 3059, 1103, 4400, 2971, 1104, 2455, 117, 1134, 1338, 1107, 1304, 5616, 1112, 1131, 2097, 1177, 1277, 1104, 1123, 1159, 172, 23851, 2118, 1166, 4605, 25617, 117,

## Decoding: Tokens to Text

We of course can use the tokenizer to go from token IDs to tokens and back to text!

In [10]:
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)

# The inverse of the .enocde() method: .decode()
my_tokenizer.decode(ids)

'[CLS] Harry Potter and the Sorcerer\'s Stone ( chapter 1 ) CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you\'d expect to be involved in anything strange or mysterious, because they just didn\'t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. little tyke, " chortled Mr. [SEP]'

In [11]:
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)

'Harry Potter and the Sorcerer\'s Stone ( chapter 1 ) CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you\'d expect to be involved in anything strange or mysterious, because they just didn\'t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. little tyke, " chortled Mr.'

In [12]:
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)

['[CLS]',
 'Harry',
 'Potter',
 'and',
 'the',
 'So',
 '##rcerer',
 "'",
 's',
 'Stone',
 '(',
 'chapter',
 '1',
 ')',
 'CHAPTER',
 'ONE',
 'THE',
 'B',
 '##O',
 '##Y',
 'WHO',
 'L',
 '##IVE',
 '##D',
 'Mr',
 '.',
 'and',
 'Mrs',
 '.',
 'Du',
 '##rs',
 '##ley',
 ',',
 'of',
 'number',
 'four',
 ',',
 'P',
 '##rive',
 '##t',
 'Drive',
 ',',
 'were',
 'proud',
 'to',
 'say',
 'that',
 'they',
 'were',
 'perfectly',
 'normal',
 ',',
 'thank',
 'you',
 'very',
 'much',
 '.',
 'They',
 'were',
 'the',
 'last',
 'people',
 'you',
 "'",
 'd',
 'expect',
 'to',
 'be',
 'involved',
 'in',
 'anything',
 'strange',
 'or',
 'mysterious',
 ',',
 'because',
 'they',
 'just',
 'didn',
 "'",
 't',
 'hold',
 'with',
 'such',
 'nonsense',
 '.',
 'Mr',
 '.',
 'Du',
 '##rs',
 '##ley',
 'was',
 'the',
 'director',
 'of',
 'a',
 'firm',
 'called',
 'G',
 '##run',
 '##ning',
 '##s',
 ',',
 'which',
 'made',
 'drill',
 '##s',
 '.',
 'He',
 'was',
 'a',
 'big',
 ',',
 'beef',
 '##y',
 'man',
 'with',
 'hardly'

## A Note on the Unknown

> One thing to consider is if a string is outside of the tokenizer's vocabulary,
> also known as an "unkown" token.
> 
> They are typically represented with `[UNK]` or
> some other similar variant.


<!--
If the tokenizer encoded the text so each character was a token (which is
actually not as easy as it sounds), then it would be impossible to have an
"unknown" token. Word-based tokenization will always be in danger of having 
"unknown" tokens since it's virtually impossible to have every possible word (
and "non-word") in its vocabulary!

And so you might think that subword tokenization wouldn't have an issue with
"unknown" tokens. And although there are fewer than word-based tokenization, it
does happen!

--------------------------------------------------------------------------------

Tokenizers are specific so it's important to use a tokenizer that will recognize
most of the text you're working with! For example, a lot of tokenizers might not
consider emoji as tokens but could be really important if emoji are especially
numerous in your data (like a corpus of chat messages)!

If you're seeing a lot of "unknown" tokens with the text you're working with,
might consider using a different tokenizer appropiate for the task. Or it's also
possible to fine-tune a pretrained model or train one from scratch!

-->

In [13]:
phrase = '🥱 the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

🥱 the dog next door kept barking all night!!
['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]']
[CLS] [UNK] the dog next door kept barking all night!! [SEP]


In [14]:
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

wow my dad thought mcdonalds sold tacos 💀
['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]']
[CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]
