In [36]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

from transformers import (
    pipeline,
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizerFast,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
)
import transformers
import tokenizers
import torch

import nlpsig

from load_data import data_folder, seed, corpus_df, english_train

# Training a masked language model on corpus of English words

In this notebook, we train train a masked language model for our corpus of English words. In particular, we mask out particular letters and ask our model to try predict the masked letter.
We do this using the [`nlpsig.TextEncoder`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder) class which provides a wrapper around the `transformers` library.

See [`alphabet_analysis.ipynb`](alphabet_analysis.ipynb) notebook for an introduction to the overall task we're tackling.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
ALPHABET_FILE = f"{data_folder}/alphabet.txt"
with open(ALPHABET_FILE) as f:
    alphabet = f.read().splitlines()
print(alphabet)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## Set up Tokenizer for word corpora

If we were to fine-tune an existing pretrained transformer, we could use the same tokenizer that the model was pretrained with. However, in this notebook, we will train a Transformer from stratch, and so using a tokenizer that was pretrained on a corpus that looks quite different to ours is suboptimal. In this example, we want to tokenize our words into characters, and so we need to train a _character-based_ tokenizer that is able to do this.

Here, we need to use the [`tokenizers`](https://huggingface.co/docs/tokenizers/index) library to set up and train a new tokenizer for our text.

In [5]:
# initialise character based tokenizer
tokenizer = tokenizers.CharBPETokenizer()
tokenizer.train(files=[ALPHABET_FILE],
                show_progress=False)

# save the tokenizer to "char-roberta/" folder
if not os.path.exists("char-roberta"):
    os.makedirs("char-roberta")

tokenizer.save_model("char-roberta")

['char-roberta/vocab.json', 'char-roberta/merges.txt']

## Training a language model

We want to train a masked language model for our corpus of English words. In particular, we mask out particular letters and ask our model to try predict the masked letter.

Here, we initialise our tokenizer (here we tokenize by character), data collator (with padding) and set up our transformer model by specifying the config (we use the [Roberta](https://huggingface.co/docs/transformers/model_doc/roberta) here) described in [[1]](https://arxiv.org/abs/1907.11692).

In [6]:
corpus_df["word"].apply(len).max()

39

As the longest word in our corpus is 39, we will set the maximum sequence length in the transformer as 50 for a bit of headroom.

In [7]:
# set the maximum length as the longest word in our dataset
max_length = 50

# set dimension of hidden states for Transformer
hidden_size = 768

# load in tokenizer for architecture
vocab = "char-roberta/vocab.json"
merges = "char-roberta/merges.txt"
tokenizer = RobertaTokenizerFast(vocab, merges, max_len=max_length)

# set up data_collator to use (intially just one that adds padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# initialise transformer architecture (random weights)
config_args = {
    "vocab_size": tokenizer.backend_tokenizer.get_vocab_size(),
    "hidden_size": hidden_size,
    "max_length": max_length,
    "max_position_embeddings": max_length + 2,
    "intermediate_size": 4*hidden_size,
    "hidden_dropout_prob": 0.1,
    "num_attention_heads": 12,
    "num_hidden_layers": 6,
    "type_vocab_size": 1
}

config = RobertaConfig(**config_args)
model = RobertaForMaskedLM(config=config)

In [8]:
tokenizer.tokenize("signatures")

['s', 'i', 'g', 'n', 'a', 't', 'u', 'r', 'e', 's']

In [9]:
model_name = "english-char-roberta"

## Using the `TextEncoder` class

The `TextEncoder` class in the `nlpsig` package is able to take a dataframe with a column of text. We can use this class to obtain embeddings for the input text, or to train the model with the input text.

In this example, we will first use the class to train our transformer model with the corpus of English words, which we have stored in the `english_train` dataframe:

In [11]:
english_train.head()

Unnamed: 0,word,language
0,knots,en
1,stalemating,en
2,whoops,en
3,implantation,en
4,levers,en


To initialise the object, we pass in the dataframe, `english_train`, and the column name that stores our text, `"word"` in this case. We pass in our model, config, tokenizer and data collator which are necessary to train our model.

We note that in the case where we are not training a model, we could optionally just pass in a string to the `model_name` argument either specifying a model in the [Huggingface model hub](https://huggingface.co/models), e.g. [`"bert-base-uncased"`](https://huggingface.co/bert-base-uncased), or specifying a path to which a model is stored in, e.g. `"char-roberta_trained"`. We can then load in our pretrained model using the [`nlpsig.TextEncoder.load_pretrained_model`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder.load_pretrained_model) method - we will see this later in [alphabet_analysis.ipynb](alphabet_analysis.ipynb) where we load this model.

In [12]:
text_encoder = nlpsig.TextEncoder(
    df=english_train,
    feature_name="word",
    model=model,
    config=config,
    tokenizer=tokenizer,
    data_collator=data_collator
)

We can tokenize the text with the [`nlpsig.TextEncoder.tokenize_text`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder.tokenize_text) method, which tokenizes each of the sentences in the column of the dataframe that we passed in (note here that we just have words and we are tokenizing on the characters). So in the above, we tokenize each string in the `word` column of the `english_train` dataframe.

In [13]:
text_encoder.tokenize_text()

[INFO] Setting return_special_tokens_mask=True
[INFO] Tokenizing the dataset...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Map:   0%|          | 0/70615 [00:00<?, ? examples/s]

[INFO] Saving the tokenized text for each sentence into `.df['tokens']`...


Map:   0%|          | 0/70615 [00:00<?, ? examples/s]

[INFO] Creating tokenized dataframe and setting in `.tokenized_df` attribute...
[INFO] Note: 'text_id' is the column name for denoting the corresponding text id


Dataset({
    features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
    num_rows: 70615
})

Note that the `text_encoder` object (instance of [`nlpsig.TextEncoder`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder)) also keeps the data as a [Huggingface `Dataset`](https://huggingface.co/docs/datasets/index) object too which is stored in the `.dataset` attribute of the object:

In [14]:
text_encoder.dataset

Dataset({
    features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
    num_rows: 70615
})

Note that when initialising the `Text_Encoder` object, we could've optionally passed in the data as a `Dataset` object using the `dataset` argument. So if the dataset that you want to use is already in that form, there is no need to first convert that to a dataframe before using the class.

We can see that we have tokenized this as there are `input_ids`, `attention_mask`, `special_tokens_mask`, and `tokens` features in the `Dataset` object.

Lets have a look at the first word in this dataset:

In [15]:
text_encoder.dataset["word"][0]

'knots'

In [16]:
text_encoder.dataset["input_ids"][0]

[53, 11, 14, 15, 20, 19, 54]

We can see that this word has been tokenized by character:

In [17]:
text_encoder.dataset["tokens"][0]

['k', 'n', 'o', 't', 's']

We can also see that we have saved the tokenized text in the `'token'` column of the dataframe stored in `.df`:

In [18]:
text_encoder.df

Unnamed: 0,word,language,tokens
0,knots,en,"[k, n, o, t, s]"
1,stalemating,en,"[s, t, a, l, e, m, a, t, i, n, g]"
2,whoops,en,"[w, h, o, o, p, s]"
3,implantation,en,"[i, m, p, l, a, n, t, a, t, i, o, n]"
4,levers,en,"[l, e, v, e, r, s]"
...,...,...,...
70610,forcefulness,en,"[f, o, r, c, e, f, u, l, n, e, s, s]"
70611,fat,en,"[f, a, t]"
70612,creakier,en,"[c, r, e, a, k, i, e, r]"
70613,ramming,en,"[r, a, m, m, i, n, g]"


We also store the tokens in `.tokens` attribute.

In [19]:
text_encoder.tokens

Dataset({
    features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 70615
})

After applying the [`nlpsig.TextEncorer.tokenize_text`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder.tokenize_text) method, we store a tokenized dataframe in the `.tokenized_df` attribue. Here, we have each token in our corpus and their associated `'text_id'` (which is just the index they were given in the original dataframe that we pass):

In [20]:
text_encoder.tokenized_df

Unnamed: 0,text_id,language,tokens
0,0,en,k
1,0,en,n
2,0,en,o
3,0,en,t
4,0,en,s
...,...,...,...
601944,70614,en,m
601945,70614,en,i
601946,70614,en,l
601947,70614,en,e


So if we looked at `text_id==0`:

In [21]:
text_encoder.tokenized_df[text_encoder.tokenized_df["text_id"]==0]

Unnamed: 0,text_id,language,tokens
0,0,en,k
1,0,en,n
2,0,en,o
3,0,en,t
4,0,en,s


## Training the model

The above embeddings will not be good for any downstream task as the model itself has not been trained to the text. For this we will use other methods in the [`nlpsig.TextEncoder`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder) class which allows us to do this by using the [Huggingface trainer API](https://huggingface.co/docs/transformers/main_classes/trainer).

Note that if you're re-running this notebook after pre-training the model previously, you can skip this section.

Otherwise, to train the model, we need to set up a data collator for training our model. We train the model on the masked language modelling task and so use the `DataCollatorForLanguageModeling` class which masks tokens with a certain probability.

In [22]:
# set up data_collator for language modelling (has dynamic padding)
data_collator_for_LM = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

To train our dataset, we will split it into a train, validation and test sets with the [`nlpsig.TextEncoder.split_dataset`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder.split_dataset) method. This stores the split Dataset objects in `.dataset_split` attribute.

In [23]:
text_encoder.split_dataset(random_state=seed)

[INFO] Splitting up dataset into train / validation / test sets, and saving to `.dataset_split`.


DatasetDict({
    train: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 37849
    })
    test: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 14123
    })
    validation: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 18643
    })
})

In [24]:
type(text_encoder.dataset_split)

datasets.dataset_dict.DatasetDict

In [25]:
text_encoder.dataset_split.push_to_hub("rchan26/english_char_split")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/38 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/15 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/19 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

We can set up the trainer's arguments with [`nlpsig.TextEncoder.set_up_training_args`](https://nlpsig.readthedocs.io/en/latest/encode_text.html#nlpsig.encode_text.TextEncoder.set_up_training_args) which sets up a `TrainingArguments` object (from the `transformers` package) and stores it in the `.training_args` attribute of the `text_encoder` object:

In [26]:
text_encoder.set_up_training_args(
    output_dir=model_name,
    num_train_epochs=600,
    per_device_train_batch_size=128,
    disable_tqdm=False,
    save_strategy="steps",
    save_steps=10000,
    seed=seed
)

[INFO] Setting up TrainingArguments object and saving to `.training_args`.


TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_mod

In [27]:
type(text_encoder.training_args)

transformers.training_args.TrainingArguments

And lastly, we set up a `Trainer` object (from the `transformers` package) and store it in the `.trainer` attribute in the `text_encoder` object:

In [28]:
text_encoder.set_up_trainer(data_collator=data_collator_for_LM)

[INFO] Setting up Trainer object, and saving to `.trainer`.


<transformers.trainer.Trainer at 0x2ac165090>

In [29]:
type(text_encoder.trainer)

transformers.trainer.Trainer

Once everything is set up, we just train our model by calling `.fit_transformer_with_trainer_api()` method.

In [30]:
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

CUDA available: False
CUDA device count: 0


In [31]:
print(f"MPS available: {torch.backends.mps.is_available()}")

MPS available: True


In [32]:
device = torch.device('mps') if torch.backends.mps.is_available() else torch.device('cuda')
print(f"Using device {device}")

Using device mps


Fitting the model using the Huggingface Trainer API:

In [33]:
# set to only report errors to avoid excessing logging
transformers.utils.logging.set_verbosity(40)

# fit the model
text_encoder.fit_transformer_with_trainer_api()

[INFO] Training model with 43205433 parameters...


Epoch,Training Loss,Validation Loss
1,No log,0.202479
2,0.172200,0.183065
3,0.172200,0.169292
4,0.140800,0.16345
5,0.140800,0.160615
6,0.131700,0.154655
7,0.125800,0.155584
8,0.125800,0.151262
9,0.122100,0.147288
10,0.122100,0.148563


[INFO] Training completed!


## Evaluating the trained model

Evaluating the performance on predicting the masked letter for the test dataset. To do this, for each word in our test dataset, we will mask each letter on its own and ask the model to predict the masked letter. So for a 5 letter word, we have 5 predictions to make - one for each letter given the other letters.

For our tokenizer, we see that `<mask>` is used as the mask token:

In [42]:
text_encoder.tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

In [34]:
def compute_masked_character_accuracy(fill_mask, words):
    was_correct = []
    print(f"Evaluating with {len(words)} words")
    for word in tqdm(words):
        masked_strings = [word[:i] + '<mask>' + word[i+1:] for i in range(len(word))]
        predictions = [fill_mask(word)[0]['sequence'] for word in masked_strings]
        was_correct += [pred == word for pred in predictions]
    
    acc = np.sum(was_correct) / len(was_correct)
    print(f"Accuracy: {acc}")
    return acc

In [37]:
fill_mask = pipeline(
    "fill-mask",
    model=text_encoder.model,
    tokenizer=text_encoder.tokenizer,
    device="mps",
)

compute_masked_character_accuracy(fill_mask, text_encoder.dataset_split["test"]["word"])

Evaluating with 14123 words


100%|███████████████████████████████████████████████████████████████████████| 14123/14123 [20:30<00:00, 11.48it/s]

Accuracy: 0.7835766362745586





0.7835766362745586

We achieve about 78% accuracy on the test set of words!

## Saving our model:

In [38]:
text_encoder.trainer.save_model(model_name)

Uploading our trained model to the Huggingface model hub:

In [40]:
text_encoder.trainer.push_to_hub()

training_args.bin:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

'https://huggingface.co/rchan26/english-char-roberta/tree/main/'

## Acknowledgements

The computations described in this notebook were developed using the Baskerville Tier 2 HPC service (https://www.baskerville.ac.uk/). Baskerville was funded by the EPSRC and UKRI through the World Class Labs scheme (EP/T022221/1) and the Digital Research Infrastructure programme (EP/W032244/1) and is operated by Advanced Research Computing at the University of Birmingham.

## References

[1] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. _arXiv preprint arXiv:1907.11692_.