# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.1-py3-none-any.whl (431 kB)
[K     |████████████████████████████████| 431 kB 2.1 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 6.6 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 63.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 50.2 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 50.4 MB/s 
[?25hCollecting multiprocess
  Down

## 1. Word tokenization

Take a shot at tokenizing the text below by splitting it up by **word**. You should only need to call one standard Python method.

In [None]:
original_text = "Jim Henson was a puppeteer"

tokenized_text = "" # ... tokenize it here!

print(tokenized_text)  # Should print ['Jim', 'Henson', 'was', 'a', 'puppeteer']




## 2. Character tokenization

Do the same thing, but split the string up by **character**.

In [None]:
original_text = "Jim Henson was a puppeteer"

tokenized_text = "" # ... tokenize it here!

print(tokenized_text)




## 3. Subword Tokenization

You can load up any Transformer model's tokenizer by calling `from_pretrained` on its architecture's XTokenizer class.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

...but it's WAY more convenient to just use the AutoTokenizer class, which will look at the model's `config.json`.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

We can use the tokenizer by just calling it on the string we want to tokenize. Run the cell! What are you getting out of it? Try passing more than one string to the tokenizer, each as a separate argument. What do you think `token_type_ids` is? (See https://huggingface.co/docs/transformers/glossary#token-type-ids for an explanation!)

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

To illustrate the encoding process, it's helpful to look at the steps of the tokenizer. First, let's run `.tokenize()` on a sequence. What do you get out of it?

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


Then, we can call `.convert_tokens_to_ids()` on the `tokens`.

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


Ultimately, we can also go backwards from our encodings to our original sentence. Mess around a bit with the original sequence – is the encoding-decoding process lossless?

In [None]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a Transformer network is simple


And lastly, the tokenizer takes a bunch of optional parameters. For example, you can tell the tokenizer to truncate, pad, and return the encodings as PyTorch tensors, numpy arrays, or TensorFlow tensors. Note that we'll be using PyTorch models, so we'll specify "pt" as our `return_tensors`.

In [None]:
tokenizer(["Hello world, what's up!"], return_tensors="pt")

{'input_ids': tensor([[ 101, 8667, 1362,  117, 1184,  112,  188, 1146,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}