# Tokenization

&copy; 2025-2026 by [Damir Cavar](http://damir.cavar.me/)

The following code examples show how tokenization works using the various different algorithms.

## Word Tokenizer



### NLTK Tokenizer


In [42]:
from nltk.tokenize import word_tokenize

In [43]:
text = "Our friend Grungle Svonitovich is going to Paris. The students are going to New York."

In [44]:
tokens = word_tokenize(text)
print(tokens)

['Our', 'friend', 'Grungle', 'Svonitovich', 'is', 'going', 'to', 'Paris', '.', 'The', 'students', 'are', 'going', 'to', 'New', 'York', '.']


NLTK offers also a RegexpTokenizer, using regular expressions:

In [45]:
from nltk.tokenize import RegexpTokenizer

In [46]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)

In [47]:
print(tokens)

['Our', 'friend', 'Grungle', 'Svonitovich', 'is', 'going', 'to', 'Paris', '.', 'The', 'students', 'are', 'going', 'to', 'New', 'York', '.']


The NLTK WhiteSpaceTokenizer splits text by whitespace:

In [48]:
from nltk.tokenize import WhitespaceTokenizer

In [49]:
tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize(text)

In [50]:
print(tokens)

['Our', 'friend', 'Grungle', 'Svonitovich', 'is', 'going', 'to', 'Paris.', 'The', 'students', 'are', 'going', 'to', 'New', 'York.']


Another NLTK tokenizer splits text based on blank lines (e.g., text into paragraphs):

In [51]:
from nltk.tokenize import BlanklineTokenizer

In [52]:
text = """Our friend Grungle Svonitovich is going to Paris.

The students are going to New York."""

In [53]:
tokenizer = BlanklineTokenizer()
tokens = tokenizer.tokenize(text)

In [54]:
print(tokens)

['Our friend Grungle Svonitovich is going to Paris.', 'The students are going to New York.']


## WordPiece Tokenizer

The WordPiece tokenizer is used in BERT.

In [None]:
!pip install word-piece-tokenizer

In [55]:
from word_piece_tokenizer import WordPieceTokenizer

Create a tokenizer object:

In [56]:
tokenizer = WordPieceTokenizer()

In [57]:
text = 'Our friend Grungle Svonitovich is going to Paris. The students are going to New York.'

Tokenizing a sentence will return token IDs:

In [58]:
token_ids = tokenizer.tokenize(text)

In [59]:
print(token_ids)

[101, 2256, 2767, 24665, 5575, 2571, 17917, 10698, 26525, 7033, 2003, 2183, 2000, 3000, 1012, 1996, 2493, 2024, 2183, 2000, 2047, 2259, 1012, 102]


To convert the IDs to string tokens, use:

In [60]:
tokens = tokenizer.convert_ids_to_tokens(token_ids)

In [61]:
print(tokens)

['[CLS]', 'our', 'friend', 'gr', '##ung', '##le', 'sv', '##oni', '##tov', '##ich', 'is', 'going', 'to', 'paris', '.', 'the', 'students', 'are', 'going', 'to', 'new', 'york', '.', '[SEP]']


Converting the tokens back to a text:

In [62]:
tokenizer.convert_tokens_to_string(tokens)

'[CLS] our friend grungle svonitovich is going to paris . the students are going to new york . [SEP]'

### BERT Tokenizer


In [63]:
from transformers import BertTokenizer

In [64]:
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

In [65]:
tokenizer.tokenize(text)

['our',
 'friend',
 'gr',
 '##ung',
 '##le',
 'sv',
 '##oni',
 '##tov',
 '##ich',
 'is',
 'going',
 'to',
 'paris',
 '.',
 'the',
 'students',
 'are',
 'going',
 'to',
 'new',
 'york',
 '.']

## Byte Pair Encoding (BPE)

### Tiktoken

Tiktoken is an OpenAI module that offers fast and efficient BPE.

In [66]:
import tiktoken

Load a specific encoding model:

In [67]:
enc = tiktoken.get_encoding("o200k_base")

Load the encoding for a specific OpenAI model:

In [68]:
enc = tiktoken.encoding_for_model("gpt-4o")

In [69]:
print(enc)

<Encoding 'o200k_base'>


Encoding models are translators between text and the model's internal representation or language. The models cl100k_base and o200k_base differ by:

| Feature | o200k_base (Newer) | cl100k_base (Older) |
| --- | --- | --- |
| Associated Models | GPT-4o, GPT-4o-mini | GPT-4, GPT-3.5-Turbo, Text-Embedding-Ada-002 |
| Vocabulary Size | approx. 200,000 tokens | approx. 100,000 tokens |
| Token Compression | Significantly higher and more efficient. | Good, but less efficient than o200k_base. |
| Multilingual Support | Highly optimized for non-English and non-Latin scripts (e.g., Tamil, Chinese, Japanese). | Less efficient for many non-English languages, often splitting diacritics and letters into more tokens. |
| Tokenization Rules | Features a major upgrade in the regex pattern for handling word boundaries, specifically to better group diacritics (â, ê, î) and other Unicode character categories. | Uses an older, less sophisticated regex pattern for word boundaries. |
| Cost &amp; Performance | Generally results in a lower token count for the same body of text (especially non-English), which can lead to lower API costs and faster processing. | Results in a higher token count for non-English texts, potentially increasing cost and context window usage. |


There are a couple of different encoding models in titoken:

| Encoding Name | Associated OpenAI Models (Examples) |
| --- | --- |
| o200k_base    | gpt-4o, gpt-4o-mini, gpt-4.5-*, gpt-4.1-* (Newer models) |
| cl100k_base   | gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large (Newer GPT and Embedding models) |
| p50k_base     | Codex models (for code), text-davinci-002, text-davinci-003 |
| r50k_base     | GPT-3 models (like davinci, curie, babbage, ada), also often referred to as gpt2 |
| p50k_edit     | Older edit models like text-davinci-edit-001, code-davinci-edit-001 |

We can look up the models for a specific encoding:

In [70]:
enc = tiktoken.encoding_for_model("gpt-4o")

Check whether the model encodes and decodes the text correctly:

In [71]:
assert enc.decode(enc.encode("hello world")) == "hello world"

Tokenize the following text:

In [72]:
text_to_tokenize = "Our friend Grungle Svonitovich is going to Paris. The students are going to New York."

Get the IDs for the individual tokens:

In [73]:
token_ids = enc.encode(text_to_tokenize)

In [74]:
print(token_ids)

[7942, 5168, 2502, 44361, 42625, 263, 278, 106793, 382, 2966, 316, 12650, 13, 623, 4501, 553, 2966, 316, 2036, 6175, 13]


Get the strings for the token IDs:

In [75]:
token_strings = [enc.decode_single_token_bytes(token_id) for token_id in token_ids]

In [76]:
print([ x.decode('utf-8') for x in token_strings ])

['Our', ' friend', ' Gr', 'ungle', ' Sv', 'on', 'it', 'ovich', ' is', ' going', ' to', ' Paris', '.', ' The', ' students', ' are', ' going', ' to', ' New', ' York', '.']


We can print out the tokens by their ID:

In [77]:
for t in zip(token_ids, token_strings):
    print(f"{t[0]}:\t '{t[1].decode('utf-8')}'")

7942:	 'Our'
5168:	 ' friend'
2502:	 ' Gr'
44361:	 'ungle'
42625:	 ' Sv'
263:	 'on'
278:	 'it'
106793:	 'ovich'
382:	 ' is'
2966:	 ' going'
316:	 ' to'
12650:	 ' Paris'
13:	 '.'
623:	 ' The'
4501:	 ' students'
553:	 ' are'
2966:	 ' going'
316:	 ' to'
2036:	 ' New'
6175:	 ' York'
13:	 '.'


&copy; 2025-2026 by [Damir Cavar](http://damir.cavar.me/)