# Preliminaries

Write requirements to file, anytime you run it, in case you have to go back and recover dependencies.

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [1]:
!ls ../input/jw300entw/jw300.en-tw.en

../input/jw300entw/jw300.en-tw.en


In [2]:
!pip freeze > kaggle_image_requirements.txt

# Fine-tune mBERT on Monolongual Twi Data (multilingual mBERT Tokenizer)

Initialize BERT tokenizer to mBERT checkpoint

In [3]:
from transformers import BertTokenizerFast # this is just a faster version of BertTokenizer, which you could use instead



In [4]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-cased") # use pre-trained mBERT tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




Having prepared tokenizer, load mBERT checkpoint into a BERT masked language model.

In [5]:
from transformers import BertForMaskedLM # use masked language modeling

model = BertForMaskedLM.from_pretrained("bert-base-multilingual-cased") # initialize to mBERT checkpoint

print("Number of parameters in mBERT model:")
print(model.num_parameters())

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…


Number of parameters in mBERT model:
178565115


Build monolingual Twi dataset with tokenizer using method included with transformers

In [6]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="../input/jw300entw/jw300.en-tw.tw",
    block_size=128, # how many lines to read at a time 
)

We will also need a "data collator". This is a helper method that creates a special object out of a batch of sample data lines (of length block_size). This special object is consummable by PyTorch to neural network training

In [7]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, mlm_probability=0.15) # use masked language modeling, and mask words with probability of 0.15

Define standard training arguments, and then use them with previously defined dataset and collator to define a "trainer" for one epoch, i.e. across all 600000+ examples.

In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="twimbert",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_total_limit=1,
)

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

Train and time.

In [10]:
import time
start = time.time()
trainer.train()
end = time.time()
print("Number of seconds for training:")
print((end-start))

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=37562.0, style=ProgressStyle(description_…

{"loss": 2.6203167583942415, "learning_rate": 4.933443373622278e-05, "epoch": 0.013311325275544433, "step": 500}




{"loss": 1.929805256009102, "learning_rate": 4.866886747244556e-05, "epoch": 0.026622650551088867, "step": 1000}
{"loss": 1.7221835536956787, "learning_rate": 4.800330120866834e-05, "epoch": 0.0399339758266333, "step": 1500}
{"loss": 1.6270063619613648, "learning_rate": 4.7337734944891116e-05, "epoch": 0.05324530110217773, "step": 2000}
{"loss": 1.5580790455937386, "learning_rate": 4.667216868111389e-05, "epoch": 0.06655662637772217, "step": 2500}
{"loss": 1.4647048667669296, "learning_rate": 4.600660241733668e-05, "epoch": 0.0798679516532666, "step": 3000}
{"loss": 1.4247039886713029, "learning_rate": 4.5341036153559454e-05, "epoch": 0.09317927692881103, "step": 3500}
{"loss": 1.3765649322271347, "learning_rate": 4.467546988978223e-05, "epoch": 0.10649060220435547, "step": 4000}
{"loss": 1.3284565920829774, "learning_rate": 4.400990362600501e-05, "epoch": 0.1198019274798999, "step": 4500}
{"loss": 1.3167774492502213, "learning_rate": 4.334433736222779e-05, "epoch": 0.13311325275544433

In [11]:
trainer.save_model("twimbert") # save model

Test model on "fill-in-the-blank" task, by taking a sentence, masking a word and then predicting a completion with pipelines API.

In [12]:
# Define fill-in-the-blanks pipeline
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="twimbert",
    tokenizer=tokenizer
)

In [13]:
# We modified a sentences as "Eyi de ɔhaw kɛse baa sukuu hɔ." => "Eyi de ɔhaw kɛse baa [MASK] hɔ."
# Predict masked token 
print(fill_mask("Eyi de ɔhaw kɛse baa [MASK] hɔ."))

[{'sequence': '[CLS] Eyi de ɔhaw kɛse baa me hɔ. [SEP]', 'score': 0.13256989419460297, 'token': 10911}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa Israel hɔ. [SEP]', 'score': 0.06816119700670242, 'token': 12991}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa ne hɔ. [SEP]', 'score': 0.06106790155172348, 'token': 10554}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa Europa hɔ. [SEP]', 'score': 0.05116277188062668, 'token': 11313}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa Eden hɔ. [SEP]', 'score': 0.033920999616384506, 'token': 35409}]
