# Filipino-English BPE Tokenizer Test

This notebook tests the BPE tokenizer implementation for Filipino-English translation.

## 1. Install Dependencies

Run this cell if packages are not already installed:

In [None]:
# Uncomment if you need to install packages
# !pip install datasets tokenizers transformers torch

## 2. Import Libraries

In [1]:
from datasets import load_dataset, concatenate_datasets
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast
import torch

## 3. Load Dataset

In [2]:
# Load the Filipino-English translation dataset
ds = load_dataset("rhyliieee/tagalog-filipino-english-translation")

print("Dataset structure:")
print(ds)
print("\nFirst training example:")
print(ds["train"][0])

README.md:   0%|          | 0.00/736 [00:00<?, ?B/s]



train_data.csv:   0%|          | 0.00/68.5M [00:00<?, ?B/s]

test_data.csv:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84177 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21057 [00:00<?, ? examples/s]

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['tagalog', 'english'],
        num_rows: 84177
    })
    test: Dataset({
        features: ['tagalog', 'english'],
        num_rows: 21057
    })
})

First training example:
{'tagalog': ' Ilarawan kung ano ang makikita mo kung pupunta ka sa Grand Canyon.', 'english': 'Describe what you would see if you went to the Grand Canyon.'}


## 4. Explore Dataset Statistics

In [3]:
# Check dataset sizes
print(f"Training samples: {len(ds['train'])}")
print(f"Test samples: {len(ds['test'])}")
print(f"Total samples: {len(ds['train']) + len(ds['test'])}")

# Sample a few examples
print("\n=== Sample Translations ===")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Tagalog: {ds['train'][i]['tagalog']}")
    print(f"English: {ds['train'][i]['english']}")

Training samples: 84177
Test samples: 21057
Total samples: 105234

=== Sample Translations ===

Example 1:
Tagalog:  Ilarawan kung ano ang makikita mo kung pupunta ka sa Grand Canyon.
English: Describe what you would see if you went to the Grand Canyon.

Example 2:
Tagalog: Saang bansa ipinanganak si Pangulong Roosevelt?
English: In what country was President Roosevelt born?

Example 3:
Tagalog:  Dahil sa pangalan ng kanta, hulaan ang genre ng kanta.
English: Given a song name, predict the genre of the song.


## 5. Train BPE Tokenizer

In [4]:
# Combine train and test for tokenizer training
# (Tokenizer needs to see all vocabulary, unlike model training)
all_text = concatenate_datasets([ds["train"], ds["test"]])

def text_generator():
    """Generator to yield all text samples from both languages"""
    for row in all_text:
        yield row["tagalog"]
        yield row["english"]

print(f"Training tokenizer on {len(all_text) * 2} text samples...")

Training tokenizer on 210468 text samples...


In [5]:
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=32000,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)

# Train the tokenizer
print("Training in progress...")
tokenizer.train_from_iterator(text_generator(), trainer=trainer)
print("âœ“ Training complete!")

Training in progress...



âœ“ Training complete!


In [6]:
# Wrap as a transformers-compatible tokenizer
hf_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="[BOS]",
    eos_token="[EOS]",
    unk_token="[UNK]",
    pad_token="[PAD]"
)

print("Tokenizer wrapped and ready to use!")

Tokenizer wrapped and ready to use!


## 6. Save and Reload Tokenizer

In [7]:
# Save the tokenizer
save_path = "fil_en_bpe_tokenizer"
hf_tokenizer.save_pretrained(save_path)
print(f"âœ“ Tokenizer saved to '{save_path}'")

# Reload to verify it works
tokenizer = PreTrainedTokenizerFast.from_pretrained(save_path)
print("âœ“ Tokenizer reloaded successfully")

âœ“ Tokenizer saved to 'fil_en_bpe_tokenizer'
âœ“ Tokenizer reloaded successfully


## 7. Basic Tokenizer Tests

In [8]:
# Test vocabulary size
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.all_special_tokens}")
print(f"Special token IDs: {tokenizer.all_special_ids}")

Vocabulary size: 32000
Special tokens: ['[BOS]', '[EOS]', '[UNK]', '[PAD]']
Special token IDs: [2, 3, 1, 0]


In [9]:
# Inspect some tokens from vocabulary
vocab = tokenizer.get_vocab()
print(f"\nFirst 30 tokens in vocabulary:")
print(list(vocab.keys())[:30])


First 30 tokens in vocabulary:
['Reduces', 'currency', 'napakagandang', 'makat', 'Div', 'particulate', 'å¤–', 'Contact', 'oss', 'iproseso', 'Bry', 'bumili', 'PDF', 'learns', 'igas', 'just', 'kad', 'pamamahagi', 'clo', 'katangiang', 'weekends', 'quantities', 'hospital', 'columns', 'siguradong', 'kilograms', 'churn', 'spin', 'magpapahusay', 'rooms']


## 8. Test on Sample Sentences

In [10]:
# Test sentences: Filipino, English, and code-mixed
test_sentences = [
    "Kumusta ka?",  # Filipino - How are you?
    "Hello, how are you?",  # English
    "Ang weather ngayon ay maganda.",  # Code-mixed - The weather today is beautiful
    "Mahal kita.",  # Filipino - I love you
    "Good morning!",  # English
    "Salamat sa iyong tulong.",  # Filipino - Thank you for your help
]

print("=== Tokenization Tests ===")
for sent in test_sentences:
    tokens = tokenizer.tokenize(sent)
    token_ids = tokenizer.encode(sent, add_special_tokens=False)
    print(f"\nOriginal: {sent}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Token count: {len(tokens)}")

=== Tokenization Tests ===

Original: Kumusta ka?
Tokens: ['Kumusta', 'ka', '?']
Token IDs: [14043, 665, 34]
Token count: 3

Original: Hello, how are you?
Tokens: ['Hello', ',', 'how', 'are', 'you', '?']
Token IDs: [5432, 15, 1527, 742, 839, 34]
Token count: 6

Original: Ang weather ngayon ay maganda.
Tokens: ['Ang', 'weather', 'ngayon', 'ay', 'maganda', '.']
Token IDs: [710, 3612, 3009, 649, 7444, 17]
Token count: 6

Original: Mahal kita.
Tokens: ['Mahal', 'kita', '.']
Token IDs: [11085, 2303, 17]
Token count: 3

Original: Good morning!
Tokens: ['Good', 'morning', '!']
Token IDs: [7155, 6821, 4]
Token count: 3

Original: Salamat sa iyong tulong.
Tokens: ['Salamat', 'sa', 'iyong', 'tulong', '.']
Token IDs: [5897, 641, 845, 3299, 17]
Token count: 5


## 9. Test Encoding and Decoding

In [11]:
# Test full encode-decode cycle
sample_text = ds["train"][0]["tagalog"]

print(f"Original text: {sample_text}")
print()

# Encode with special tokens
encoded = tokenizer(sample_text, return_tensors="pt")
print(f"Input IDs shape: {encoded['input_ids'].shape}")
print(f"Input IDs: {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")
print()

# Decode back to text
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"Decoded text: {decoded}")

Original text:  Ilarawan kung ano ang makikita mo kung pupunta ka sa Grand Canyon.

Input IDs shape: torch.Size([1, 13])
Input IDs: tensor([[ 2401,   854,  1011,   636,  5539,   723,   854, 17411,   665,   641,
          9994, 12598,    17]])
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Decoded text: Ilarawan kung ano ang makikita mo kung pupunta ka sa Grand Canyon .


## 10. Batch Encoding Test

In [12]:
# Test batch encoding with padding
batch_texts = [
    "Mabuhay!",
    "Kumusta ka ngayong umaga?",
    "Ang ganda ng panahon."
]

# Encode batch with padding
batch_encoded = tokenizer(
    batch_texts, 
    padding=True, 
    truncation=True,
    return_tensors="pt"
)

print("Batch encoding results:")
print(f"Input IDs shape: {batch_encoded['input_ids'].shape}")
print(f"\nInput IDs:\n{batch_encoded['input_ids']}")
print(f"\nAttention mask:\n{batch_encoded['attention_mask']}")

# Decode each sequence
print("\nDecoded sequences:")
for i, ids in enumerate(batch_encoded['input_ids']):
    decoded = tokenizer.decode(ids, skip_special_tokens=True)
    print(f"{i+1}. {decoded}")

Batch encoding results:
Input IDs shape: torch.Size([3, 5])

Input IDs:
tensor([[ 9732, 11115,     4,     0,     0],
        [14043,   665,  8781,  7467,    34],
        [  710, 17996,   634,  1391,    17]])

Attention mask:
tensor([[1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])

Decoded sequences:
1. Mab uhay !
2. Kumusta ka ngayong umaga ?
3. Ang ganda ng panahon .


## 11. Calculate Tokenizer Statistics

In [13]:
# Calculate average tokens per sentence
import numpy as np

tagalog_lengths = []
english_lengths = []

# Sample 100 examples
for i in range(min(100, len(ds['train']))):
    tag_tokens = tokenizer.tokenize(ds['train'][i]['tagalog'])
    eng_tokens = tokenizer.tokenize(ds['train'][i]['english'])
    tagalog_lengths.append(len(tag_tokens))
    english_lengths.append(len(eng_tokens))

print("=== Tokenizer Statistics (100 samples) ===")
print(f"\nTagalog:")
print(f"  Average tokens: {np.mean(tagalog_lengths):.2f}")
print(f"  Min tokens: {np.min(tagalog_lengths)}")
print(f"  Max tokens: {np.max(tagalog_lengths)}")

print(f"\nEnglish:")
print(f"  Average tokens: {np.mean(english_lengths):.2f}")
print(f"  Min tokens: {np.min(english_lengths)}")
print(f"  Max tokens: {np.max(english_lengths)}")

print(f"\nAverage compression ratio: {np.mean(tagalog_lengths) / np.mean(english_lengths):.2f}")

=== Tokenizer Statistics (100 samples) ===

Tagalog:
  Average tokens: 12.76
  Min tokens: 5
  Max tokens: 26

English:
  Average tokens: 11.06
  Min tokens: 5
  Max tokens: 28

Average compression ratio: 1.15


## 12. Test Unknown Token Handling

In [14]:
# Test with rare/unusual words
unusual_texts = [
    "Supercalifragilisticexpialidocious",  # Made-up English word
    "Pakikipagsapalaran",  # Complex Filipino word
    "ðŸŽ‰ðŸŽŠ",  # Emojis
    "test123",  # Mixed alphanumeric
]

print("=== Unknown/Rare Token Handling ===")
for text in unusual_texts:
    tokens = tokenizer.tokenize(text)
    print(f"\nText: {text}")
    print(f"Tokens: {tokens}")
    print(f"Contains [UNK]: {'[UNK]' in tokens}")

=== Unknown/Rare Token Handling ===

Text: Supercalifragilisticexpialidocious
Tokens: ['Super', 'cal', 'if', 'rag', 'ilis', 'tice', 'x', 'p', 'ial', 'ido', 'cious']
Contains [UNK]: False

Text: Pakikipagsapalaran
Tokens: ['Pakikipagsapalaran']
Contains [UNK]: False

Text: ðŸŽ‰ðŸŽŠ
Tokens: ['ðŸŽ‰', 'ðŸŽŠ']
Contains [UNK]: False

Text: test123
Tokens: ['test', '123']
Contains [UNK]: False


## 13. Integration Test with Model Config

In [15]:
# Test tokenizer configuration for encoder-decoder model
from transformers import EncoderDecoderConfig

print("=== Model Configuration Test ===")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"PAD token ID: {tokenizer.pad_token_id}")
print(f"BOS token ID: {tokenizer.bos_token_id}")
print(f"EOS token ID: {tokenizer.eos_token_id}")
print(f"UNK token ID: {tokenizer.unk_token_id}")

# Verify all special token IDs are set
assert tokenizer.pad_token_id is not None, "PAD token ID not set!"
assert tokenizer.bos_token_id is not None, "BOS token ID not set!"
assert tokenizer.eos_token_id is not None, "EOS token ID not set!"
assert tokenizer.unk_token_id is not None, "UNK token ID not set!"

print("\nâœ“ All special tokens properly configured!")

=== Model Configuration Test ===
Vocab size: 32000
PAD token ID: 0
BOS token ID: 2
EOS token ID: 3
UNK token ID: 1

âœ“ All special tokens properly configured!


## 14. Summary

This notebook demonstrated:
- Loading the Filipino-English dataset
- Training a BPE tokenizer on both languages
- Saving and reloading the tokenizer
- Testing tokenization on various inputs
- Batch encoding with padding
- Computing tokenizer statistics
- Verifying model integration readiness

The tokenizer is now ready to be used with an encoder-decoder translation model!