# Tokenizer Example
Ashok Kumar Pant

In [1]:
from transformers import AutoTokenizer

# Load a pretrained tokenizer (e.g., BERT tokenizer)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenizers convert words to numbers!"

# Tokenize the input
tokens = tokenizer(text)

print("Input IDs:", tokens["input_ids"])
print("Token Type IDs:", tokens["token_type_ids"])
print("Attention Mask:", tokens["attention_mask"])
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens["input_ids"]))

Input IDs: [101, 19204, 17629, 2015, 10463, 2616, 2000, 3616, 999, 102]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Tokens: ['[CLS]', 'token', '##izer', '##s', 'convert', 'words', 'to', 'numbers', '!', '[SEP]']


## Explanation of the Output
#### Tokens: ['[CLS]', 'token', '##izer', '##s', 'convert', 'words', 'to', 'numbers', '!', '[SEP]']
These are the subword tokens produced by the tokenizer (in this case, BERT's tokenizer).
- '[CLS]' — Special classification token added at the beginning of the input (used by BERT for classification tasks).
- 'token', '##izer', '##s' — BERT uses WordPiece tokenization, which breaks unfamiliar or compound words into smaller known subwords.
- 'tokenizers' was split into ['token', '##izer', '##s'].
- The ## prefix means “this is a continuation of the previous token.”
- 'convert', 'words', 'to', 'numbers', '!' — These are standard tokens.
- '[SEP]' — Special separator token that marks the end of a single sentence or separates multiple sentences.


#### Input IDs: [101, 19204, 17629, 2015, 10463, 2616, 2000, 3616, 999, 102]

These are the IDs that correspond to each token above. The tokenizer uses a vocabulary to map each token to a unique integer.
- 101 is the ID for [CLS]
- 102 is the ID for [SEP]
- The rest are IDs for 'token', '##izer', etc., specific to BERT's vocabulary.

#### Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

These are used to distinguish between multiple sentences in tasks like question answering. Since this is a single sentence input, all values are 0.

If you had two segments like:
tokenizer("Question?", "Answer.")
You'd get 0s for the first part, and 1s for the second.

#### Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

This tells the model which tokens should be attended to (i.e., not padding).
- 1 = real token (should be processed)
- 0 = padding (ignore this token)
Here, since there’s no padding, all tokens are marked 1.

In [6]:
seq1 = "Tokenizers convert words to numbers."
seq2 = "This is going very nice."

tokens = tokenizer(seq1, seq2)

print("Input IDs:", tokens["input_ids"])
print("Token Type IDs:", tokens["token_type_ids"])
print("Attention Mask:", tokens["attention_mask"])
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens["input_ids"]))

Input IDs: [101, 19204, 17629, 2015, 10463, 2616, 2000, 3616, 1012, 102, 2023, 2003, 2183, 2200, 3835, 1012, 102]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Tokens: ['[CLS]', 'token', '##izer', '##s', 'convert', 'words', 'to', 'numbers', '.', '[SEP]', 'this', 'is', 'going', 'very', 'nice', '.', '[SEP]']
