### 🧑‍🏫 Instructions

FLORES 200 dataset

1. dev/ and devtest/ contains the sentences in same order for each language.
2. metadata_dev.tsv and metadata_devtest.tsv contains tab separated metadata for each of the sentences. The sentences are in same order as of dev/<lang>.dev and devtest/<lang>.devtest


In [32]:
%pip install -q -r requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [33]:
import glob
import json
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
import os

# Initialize the tokenizer and trainer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()

# Get all the files with .dev extension in the /flores200_dataset/dev directory
files = glob.glob("inputs/flores200_dataset/dev/*.dev")

# Train the tokenizer on all the files
tokenizer.train(files, trainer)

# Save the trained tokenizer
tokenizer.save("results/tokenizer.json")
print("Tokenizer saved to inputs/tokenizer.json")

# Function to extract language code from file name
def get_language_from_filename(filename):
    return os.path.basename(filename).split('.')[0]

language_vocabularies = {}

# Iterate over files to get the vocabulary for each language
for file in files:
    # Extract language code from the file name
    language_code = get_language_from_filename(file)
    
    # Tokenize the content of the file
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
    
    encoded = tokenizer.encode(text)
    unique_tokens = set(encoded.tokens)
    
    # Get the global vocabulary of the tokenizer
    global_vocab = tokenizer.get_vocab()
    
    # Filter the global vocabulary to get only the relevant tokens for this file
    vocab = {token: global_vocab[token] for token in unique_tokens if token in global_vocab}
    language_vocabularies[language_code] = vocab

with open("results/language_vocabularies.json", 'w', encoding='utf-8') as f:
    json.dump(language_vocabularies, f, ensure_ascii=False, indent=4)

print("Language vocabularies saved to results/language_vocabularies.json")





Tokenizer saved to inputs/tokenizer.json
Language vocabularies saved to results/language_vocabularies.json


# 📈 Analysis


In [34]:
# Get length of the tokenizer
vocab_size = len(tokenizer.get_vocab())
print("Vocabulary size:", vocab_size)

Vocabulary size: 30000


In [35]:
# Print the average token lenth for each language
for language, vocab in language_vocabularies.items():
    print(f"Language: {language}, Vocab Size: {len(vocab)}, Vocab Size Percentage: {len(vocab)/vocab_size*100:.2f}%, Average token length: {sum(len(token) for token in vocab)/len(vocab):.2f}")

Language: bam_Latn, Vocab Size: 2809, Vocab Size Percentage: 9.36%, Average token length: 3.97
Language: vec_Latn, Vocab Size: 3168, Vocab Size Percentage: 10.56%, Average token length: 4.07
Language: prs_Arab, Vocab Size: 1495, Vocab Size Percentage: 4.98%, Average token length: 3.02
Language: tuk_Latn, Vocab Size: 2604, Vocab Size Percentage: 8.68%, Average token length: 3.57
Language: tgl_Latn, Vocab Size: 3343, Vocab Size Percentage: 11.14%, Average token length: 4.18
Language: fij_Latn, Vocab Size: 2443, Vocab Size Percentage: 8.14%, Average token length: 4.19
Language: nld_Latn, Vocab Size: 3094, Vocab Size Percentage: 10.31%, Average token length: 4.07
Language: luo_Latn, Vocab Size: 2992, Vocab Size Percentage: 9.97%, Average token length: 4.06
Language: arb_Latn, Vocab Size: 2832, Vocab Size Percentage: 9.44%, Average token length: 3.89
Language: run_Latn, Vocab Size: 2918, Vocab Size Percentage: 9.73%, Average token length: 4.04
Language: eus_Latn, Vocab Size: 2900, Vocab Siz