<div style="text-align: center;">
        <img src="./static/tokenizer_header.png" width="400px" style="height: auto;"></img>
</div>

---

This notebook demonstrates how to use the `WordPieceTokenizer` class for training and tokenization.

#### 📦 Importing dependencies

In [26]:
from babybert.tokenizer import TokenizerConfig, WordPieceTokenizer
from babybert.data import load_corpus

#### 📖 Loading corpus

In [27]:
corpus = load_corpus("./data/corpus.txt")

#### ⚙️ Instantiating tokenizer

In [28]:
config = TokenizerConfig(
    target_vocab_size=5000
)

tokenizer = WordPieceTokenizer(config)

#### 🏋️ Training tokenizer

In [29]:
tokenizer.train(corpus)

In [30]:
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

Tokenizer vocab size: 5000


In [31]:
print(f"First ten tokens in vocab: {tokenizer.vocab[:10]}")

First ten tokens in vocab: ['##i', ',', 'u', 's', '##p', '##b', '##-', '##1', '##+', '5']


#### 🚀 Using trained tokenizer

In [32]:
examples = [
    "Hello, world!",
    "Here is a sentence.",
    "How are you today?"
]

for example in examples:
    tokenized_example = tokenizer.tokenize(example)
    print(f"Original sentence: {example}")
    print(f"Tokenized sentence: {tokenized_example}")

Original sentence: Hello, world!
Tokenized sentence: ['h', '##e', '##ll', '##o', ',', 'world', '!']
Original sentence: Here is a sentence.
Tokenized sentence: ['h', '##e', '##r', '##e', 'is', 'a', 'sentence', '.']
Original sentence: How are you today?
Tokenized sentence: ['how', 'ar', '##e', 'you', 'to', '##d', '##ay', '[UNK]']


In [33]:
token_ids = tokenizer.encode(examples[0])
print(f"Token IDs: {token_ids}")

Token IDs: [53, 61, 3671, 41, 1, 4498, 78]


In [34]:
tokens = tokenizer.decode(token_ids)
print(f"Decoded tokens: {tokens}")

Decoded tokens: ['h', '##e', '##ll', '##o', ',', 'world', '!']


#### 💾 Saving trained tokenizer

In [35]:
path = "./my_tokenizer"
tokenizer.save_pretrained(path)

In [36]:
del tokenizer

tokenizer = WordPieceTokenizer.from_pretrained(path)
print(f"Tokenized text: {tokenizer.tokenize(examples[0])}")

Tokenized text: ['h', '##e', '##ll', '##o', ',', 'world', '!']
