<a href="https://colab.research.google.com/github/anmolkohli18/Hugging-Face-Course-Examples/blob/main/The_tokenization_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install transformers
!pip install transformers[sentencepiece]

## Bert Tokenization

In [2]:
from transformers import AutoTokenizer

In [40]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize and call it tokenization!")
print(tokens)

['let', "'", 's', 'try', 'to', 'token', '##ize', 'and', 'call', 'it', 'token', '##ization', '!']


In [41]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2292, 1005, 1055, 3046, 2000, 19204, 4697, 1998, 2655, 2009, 19204, 3989, 999]


In [42]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 1998, 2655, 2009, 19204, 3989, 999, 102]


In [43]:
inputs = tokenizer("Let's try to tokenize and call it tokenization!")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] let's try to tokenize and call it tokenization! [SEP]


In [44]:
print(inputs)

{'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 1998, 2655, 2009, 19204, 3989, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Albert Tokenization

In [4]:
from transformers import AutoTokenizer

In [31]:
tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Let's try to tokenize and call it tokenization!")
print(tokens)

['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize', '▁and', '▁call', '▁it', '▁to', 'ken', 'ization', '!']


In [32]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[408, 22, 18, 1131, 20, 20, 2853, 2952, 17, 645, 32, 20, 2853, 1829, 187]


In [33]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])

You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[2, 408, 22, 18, 1131, 20, 20, 2853, 2952, 17, 645, 32, 20, 2853, 1829, 187, 3]


In [34]:
inputs = tokenizer("Let's try to tokenize and call it tokenization!")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] let's try to tokenize and call it tokenization![SEP]


## Roberta Tokenizer

In [26]:
from transformers import AutoTokenizer

In [36]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
tokens = tokenizer.tokenize("Let's try to tokenize and call it tokenization!")
print(tokens)

['Let', "'s", 'Ġtry', 'Ġto', 'Ġtoken', 'ize', 'Ġand', 'Ġcall', 'Ġit', 'Ġtoken', 'ization', '!']


In [37]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[7939, 18, 860, 7, 19233, 2072, 8, 486, 24, 19233, 1938, 328]


In [38]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[0, 7939, 18, 860, 7, 19233, 2072, 8, 486, 24, 19233, 1938, 328, 2]


In [39]:
inputs = tokenizer("Let's try to tokenize and call it tokenization!")
print(tokenizer.decode(inputs["input_ids"]))

<s>Let's try to tokenize and call it tokenization!</s>
