# Preliminaries

Write requirements to file, anytime you run it, in case you have to go back and recover dependencies.

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [1]:
!ls ../input/jw300entw/jw300.en-tw.en

../input/jw300entw/jw300.en-tw.en


In [2]:
!pip freeze > kaggle_image_requirements.txt

# Fine-tune DistilmBERT on Monolongual Twi Data (multilingual mBERT Tokenizer)

Initialize DistilmBERT tokenizer to DistilmBERT checkpoint

In [3]:
from transformers import DistilBertTokenizerFast # this is just a faster version of DistilBertTokenizer, which you could use instead



In [4]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-multilingual-cased") # use pre-trained DistilmBERT tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




Having prepared tokenizer, load mBERT checkpoint into a BERT masked language model.

In [5]:
from transformers import DistilBertForMaskedLM # use masked language modeling

model = DistilBertForMaskedLM.from_pretrained("distilbert-base-multilingual-cased") # initialize to DistilmBERT checkpoint

print("Number of parameters in DistilmBERT model:")
print(model.num_parameters())

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=541808922.0, style=ProgressStyle(descri…


Number of parameters in DistilmBERT model:
135445755


Build monolingual Twi dataset with tokenizer using method included with transformers

In [6]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="../input/jw300entw/jw300.en-tw.tw",
    block_size=128, # how many lines to read at a time 
)

We will also need a "data collator". This is a helper method that creates a special object out of a batch of sample data lines (of length block_size). This special object is consummable by PyTorch to neural network training

In [7]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, mlm_probability=0.15) # use masked language modeling, and mask words with probability of 0.15

Define standard training arguments, and then use them with previously defined dataset and collator to define a "trainer" for one epoch, i.e. across all 600000+ examples.

In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="twidistilmbert",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_total_limit=1,
)

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

Train and time.

In [10]:
import time
start = time.time()
trainer.train()
end = time.time()
print("Number of seconds for training:")
print((end-start))

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=37562.0, style=ProgressStyle(description_…

{"loss": 2.6965276935100557, "learning_rate": 4.933443373622278e-05, "epoch": 0.013311325275544433, "step": 500}




{"loss": 1.9868937261104584, "learning_rate": 4.866886747244556e-05, "epoch": 0.026622650551088867, "step": 1000}
{"loss": 1.7659727020263671, "learning_rate": 4.800330120866834e-05, "epoch": 0.0399339758266333, "step": 1500}
{"loss": 1.6668763043880463, "learning_rate": 4.7337734944891116e-05, "epoch": 0.05324530110217773, "step": 2000}
{"loss": 1.6016384425163268, "learning_rate": 4.667216868111389e-05, "epoch": 0.06655662637772217, "step": 2500}
{"loss": 1.5188227939605712, "learning_rate": 4.600660241733668e-05, "epoch": 0.0798679516532666, "step": 3000}
{"loss": 1.4741685634851456, "learning_rate": 4.5341036153559454e-05, "epoch": 0.09317927692881103, "step": 3500}
{"loss": 1.416521461725235, "learning_rate": 4.467546988978223e-05, "epoch": 0.10649060220435547, "step": 4000}
{"loss": 1.374769216656685, "learning_rate": 4.400990362600501e-05, "epoch": 0.1198019274798999, "step": 4500}
{"loss": 1.3641278388500213, "learning_rate": 4.334433736222779e-05, "epoch": 0.13311325275544433,

In [11]:
trainer.save_model("twidistilmbert") # save model

Test model on "fill-in-the-blank" task, by taking a sentence, masking a word and then predicting a completion with pipelines API.

In [12]:
# Define fill-in-the-blanks pipeline
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="twidistilmbert",
    tokenizer=tokenizer
)

In [13]:
# We modified a sentences as "Eyi de ɔhaw kɛse baa sukuu hɔ." => "Eyi de ɔhaw kɛse baa [MASK] hɔ."
# Predict masked token 
print(fill_mask("Eyi de ɔhaw kɛse baa [MASK] hɔ."))

[{'sequence': '[CLS] Eyi de ɔhaw kɛse baa fie hɔ. [SEP]', 'score': 0.31311026215553284, 'token': 29959}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa me hɔ. [SEP]', 'score': 0.09322386980056763, 'token': 10911}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa ne hɔ. [SEP]', 'score': 0.05879712104797363, 'token': 10554}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa too hɔ. [SEP]', 'score': 0.052420321851968765, 'token': 16683}, {'sequence': '[CLS] Eyi de ɔhaw kɛse baa no hɔ. [SEP]', 'score': 0.04025224596261978, 'token': 10192}]
