# Preliminaries

Write requirements to file, anytime you run it, in case you have to go back and recover dependencies.

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [1]:
!ls ../input/jw300entw/jw300.en-tw.en

../input/jw300entw/jw300.en-tw.en


In [2]:
!pip freeze > kaggle_image_requirements.txt

# Train Twi Tokenizer From Scratch

In [3]:
from tokenizers import BertWordPieceTokenizer

In [4]:
paths = ['../input/jw300entw/jw300.en-tw.tw']

tokenizer = BertWordPieceTokenizer() # Initialize a tokenizer

In [5]:
# Customize training and carry it out
tokenizer.train(
    paths,
    vocab_size=10000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"], # standard BERT special tokens
    limit_alphabet=1000,
    wordpieces_prefix="##",
)

# Save tokenizer to disk
!mkdir twibert
tokenizer.save("twibert")

['twibert/vocab.txt']

# Fine-tune mBERT on Monolongual Twi Data (w. Twi Tokenizer Trained From Scratch)

To load the tokenizer from what we just saved, we just need to execute the following. Note that we use a maximum sequence length of 512 to be consistent with the previous subsection – this is what the pre-trained mBERT uses as well.

In [6]:
from transformers import BertTokenizerFast



In [7]:
tokenizer = BertTokenizerFast.from_pretrained("twibert", max_len=512) #  use the language-specific tokenizer we just trained



In [8]:
from transformers import BertForMaskedLM, BertConfig

model = BertForMaskedLM(BertConfig()) # Don't initialize to pretrained, create a fresh one

print("Number of parameters in mBERT model:")
print(model.num_parameters())

Number of parameters in mBERT model:
110104890


From here, the steps are the same as 5.4.2 - https://www.kaggle.com/azunre/tl-for-nlp-section5-4-3

In [9]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="../input/jw300entw/jw300.en-tw.tw",
    block_size=128,
)

In [10]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, mlm_probability=0.15
)

In [11]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="twimbert",
    overwrite_output_dir=True,
    num_train_epochs=2, # how about 2 epochs?
    per_gpu_train_batch_size=16,
    save_total_limit=1,
)

In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

In [13]:
import time
start = time.time()
trainer.train()
end = time.time()
print("Number of seconds for training:")
print((end-start))

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=37562.0, style=ProgressStyle(description_…

{"loss": 6.3489798774719235, "learning_rate": 4.966721686811139e-05, "epoch": 0.013311325275544433, "step": 500}




{"loss": 5.609932363510132, "learning_rate": 4.933443373622278e-05, "epoch": 0.026622650551088867, "step": 1000}
{"loss": 5.416386514663697, "learning_rate": 4.900165060433417e-05, "epoch": 0.0399339758266333, "step": 1500}
{"loss": 5.377104565143585, "learning_rate": 4.866886747244556e-05, "epoch": 0.05324530110217773, "step": 2000}
{"loss": 5.226606401920319, "learning_rate": 4.8336084340556944e-05, "epoch": 0.06655662637772217, "step": 2500}
{"loss": 5.202029550552369, "learning_rate": 4.800330120866834e-05, "epoch": 0.0798679516532666, "step": 3000}
{"loss": 5.124593429565429, "learning_rate": 4.767051807677972e-05, "epoch": 0.09317927692881103, "step": 3500}
{"loss": 5.082736471652985, "learning_rate": 4.7337734944891116e-05, "epoch": 0.10649060220435547, "step": 4000}
{"loss": 5.045186289310456, "learning_rate": 4.7004951813002505e-05, "epoch": 0.1198019274798999, "step": 4500}
{"loss": 4.96710503578186, "learning_rate": 4.667216868111389e-05, "epoch": 0.13311325275544433, "step"

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=37562.0, style=ProgressStyle(description_…

{"loss": 2.92047173500061, "learning_rate": 2.470848197646558e-05, "epoch": 1.0116607209413768, "step": 38000}
{"loss": 2.9104557859897615, "learning_rate": 2.4375698844576967e-05, "epoch": 1.0249720462169214, "step": 38500}
{"loss": 2.9423716962337494, "learning_rate": 2.404291571268836e-05, "epoch": 1.0382833714924657, "step": 39000}
{"loss": 2.9362151432037353, "learning_rate": 2.3710132580799748e-05, "epoch": 1.0515946967680103, "step": 39500}
{"loss": 2.8901157331466676, "learning_rate": 2.3377349448911136e-05, "epoch": 1.0649060220435547, "step": 40000}
{"loss": 2.8899089720249176, "learning_rate": 2.3044566317022524e-05, "epoch": 1.078217347319099, "step": 40500}
{"loss": 2.8940025260448454, "learning_rate": 2.2711783185133913e-05, "epoch": 1.0915286725946436, "step": 41000}
{"loss": 2.8580290224552156, "learning_rate": 2.23790000532453e-05, "epoch": 1.104839997870188, "step": 41500}
{"loss": 2.847791908502579, "learning_rate": 2.204621692135669e-05, "epoch": 1.1181513231457323,

In [14]:
trainer.save_model("twimbert") # save model

In [15]:
from transformers import pipeline # test model

fill_mask = pipeline(
    "fill-mask",
    model="twimbert",
    tokenizer=tokenizer
)

In [16]:
# same example as before
print(fill_mask("Eyi de ɔhaw kɛse baa [MASK] hɔ."))

[{'sequence': '[CLS] eyi de ɔhaw kɛse baa yɛn hɔ. [SEP]', 'score': 0.05942384898662567, 'token': 269}, {'sequence': '[CLS] eyi de ɔhaw kɛse baa yehowa hɔ. [SEP]', 'score': 0.045544251799583435, 'token': 291}, {'sequence': '[CLS] eyi de ɔhaw kɛse baa dwumadibea hɔ. [SEP]', 'score': 0.044391192495822906, 'token': 1647}, {'sequence': '[CLS] eyi de ɔhaw kɛse baa asɔrefie hɔ. [SEP]', 'score': 0.03953441604971886, 'token': 1512}, {'sequence': '[CLS] eyi de ɔhaw kɛse baa me hɔ. [SEP]', 'score': 0.03777942806482315, 'token': 277}]
