### What is tokenizer?

- Process of breaking down a sequence of text (such as a sentence or document) into smaller units called "tokens." 
- These tokens can be words, subwords, characters, or sentences, depending on the level of granularity chosen.

### Types of Tokenization:

- Word-level Tokenization:
    - Breaks text into individual words.
    - Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat"]

- Subword Tokenization:
    - Breaks words into smaller units, often used in models like BERT, GPT, and other transformer-based models.
    - Example: "unhappiness" → ["un", "happiness"] or "un", "##happy", "##ness" (with a special marker like ## for subwords).

- Character-level Tokenization:
    - Breaks text into individual characters.
    - Example: "The cat" → ["T", "h", "e", " ", "c", "a", "t"]

- Sentence-level Tokenization:
    - Splits text into individual sentences.
    - Example: "I love NLP. It's fascinating!" → ["I love NLP.", "It's fascinating!"]

### Tokenization in Different Models:
- Bag-of-Words Models: Use word-level tokenization.

- Transformer Models (BERT, GPT, etc.): Use subword tokenization methods like WordPiece, Byte Pair Encoding (BPE), or Unigram.

- Recurrent Neural Networks (RNN): Often use word-level or subword tokenization.

### Loading Tokenizer

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [18]:
text = "Hi, I love NLP"

In [19]:
tokens = tokenizer.encode(text)
tokens

[0, 30086, 6, 38, 657, 234, 21992, 2]

In [20]:
len(text.split()), len(tokens)

(4, 8)

In [21]:
for token in tokens:
    print(f'{token} --> {tokenizer.convert_ids_to_tokens(token)}')

0 --> <s>
30086 --> Hi
6 --> ,
38 --> ĠI
657 --> Ġlove
234 --> ĠN
21992 --> LP
2 --> </s>


special symbols Ġ and Ċ that denote spaces and newlines