In [1]:
import os
import pandas as pd

from transformers import (
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizerFast,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
)
import transformers
import tokenizers
import torch

import nlpsig

from load_data import data_folder, seed, corpus_df, english_train

loading in english_train from data/english_train.pkl
loading in corpus_sample_df from data/corpus_sample.pkl


In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [2]:
ALPHABET_FILE = f"{data_folder}/alphabet.txt"
with open(ALPHABET_FILE) as f:
    alphabet = f.read().splitlines()
print(alphabet)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## Set up Tokenizer for word corpora

If we were to fine-tune an existing pretrained transformer, we could use the same tokenizer that the model was pretrained with. However, in this notebook, we will train a Transformer from stratch, and so using a tokenizer that was pretrained on a corpus that looks quite different to ours is suboptimal. In this example, we want to tokenize our words into characters, and so we need to train a _character-based_ tokenizer that is able to do this.

Here, we need to use the [`tokenizers`](https://huggingface.co/docs/tokenizers/index) library to set up and train a new tokenizer for our text.

In [3]:
# initialise character based tokenizer
tokenizer = tokenizers.CharBPETokenizer()
tokenizer.train(files=[ALPHABET_FILE],
                show_progress=False)

# save the tokenizer to "char-roberta/" folder
if not os.path.exists("char-roberta"):
    os.makedirs("char-roberta")

tokenizer.save_model("char-roberta")

['char-roberta/vocab.json', 'char-roberta/merges.txt']

## Training a language model

We want to train a masked language model for our corpus of English words. In particular, we mask out particular letters and ask our model to try predict the masked letter.

Here, we initialise our tokenizer (here we tokenize by character), data collator (with padding) and set up our transformer model by specifying the config (we use the [Roberta](https://huggingface.co/docs/transformers/model_doc/roberta) here) described in [[2]](https://arxiv.org/abs/1907.11692).

In [4]:
corpus_df["word"].apply(len).max()

39

As the longest word in our corpus is 39, we will set the maximum sequence length in the transformer as 50 for a bit of headroom.

In [5]:
# set the maximum length as the longest word in our dataset
max_length = 50

# set dimension of hidden states for Transformer
hidden_size = 768

# load in tokenizer for architecture
vocab = "char-roberta/vocab.json"
merges = "char-roberta/merges.txt"
tokenizer = RobertaTokenizerFast(vocab, merges, max_len=max_length)

# set up data_collator to use (intially just one that adds padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# initialise transformer architecture (random weights)
config_args = {
    "vocab_size": tokenizer.backend_tokenizer.get_vocab_size(),
    "hidden_size": hidden_size,
    "max_length": max_length,
    "max_position_embeddings": max_length + 2,
    "intermediate_size": 4*hidden_size,
    "hidden_dropout_prob": 0.1,
    "num_attention_heads": 12,
    "num_hidden_layers": 6,
    "type_vocab_size": 1
}

config = RobertaConfig(**config_args)
model = RobertaForMaskedLM(config=config)

In [6]:
tokenizer.tokenize("signatures")

['s', 'i', 'g', 'n', 'a', 't', 'u', 'r', 'e', 's']

In [7]:
model_name = "english-char-roberta"

If you have already ran this notebook before and have trained the transformer previously, you can just load in the pretrained transformer using the line below - just uncomment in order to load in the model weights.

In [8]:
# model = RobertaForMaskedLM.from_pretrained(model_name)

## Using the `TextEncoder` class

The `TextEncoder` class in the `nlpsig` package is able to take a dataframe with a column of text. We can use this class to obtain embeddings for the input text, or to train the model with the input text.

In this example, we will first use the class to train our transformer model with the corpus of English words, which we have stored in the `english_train` dataframe:

In [9]:
english_train.head()

Unnamed: 0,word,language
0,confounds,en
1,raglan,en
2,commit,en
3,reattains,en
4,curviest,en


To initialise the object, we pass in the dataframe, `english_train`, and the column name that stores our text, `"word"` in this case. We pass in our model, config, tokenizer and data collator which are necessary to train our model.

We note that in the case where we are not training a model, we could optionally just pass in a string to the `model_name` argument either specifying a model in the [Huggingface model hub](https://huggingface.co/models), e.g. [`"bert-base-uncased"`](https://huggingface.co/bert-base-uncased), or specifying a path to which a model is stored in, e.g. `"char-roberta_trained"`. We can then load in our pretrained model using the `.load_pretrained_model()` method - we will see this later in [alphabet_analysis.ipynb](alphabet_analysis.ipynb) where we load this model.

In [10]:
text_encoder = nlpsig.TextEncoder(
    df=english_train,
    feature_name="word",
    model=model,
    config=config,
    tokenizer=tokenizer,
    data_collator=data_collator
)

We can tokenize the text with the `.tokenize_text()` method, which tokenizes each of the sentences in the column of the dataframe that we passed in (note here that we just have words and we are tokenizing on the characters). So in the above, we tokenize each string in the `word` column of the `english_train` dataframe.

In [11]:
text_encoder.tokenize_text()

[INFO] Setting return_special_tokens_mask=True
[INFO] Tokenizing the dataset...


Map:   0%|          | 0/70641 [00:00<?, ? examples/s]

[INFO] Saving the tokenized text for each sentence into `.df['tokens']`...


Map:   0%|          | 0/70641 [00:00<?, ? examples/s]

[INFO] Creating tokenized dataframe and setting in `.tokenized_df` attribute...
[INFO] Note: 'text_id' is the column name for denoting the corresponding text id


Dataset({
    features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
    num_rows: 70641
})

Note that the `text_encoder` object (instance of `TextEncoder`) also keeps the data as a [Huggingface `Dataset`](https://huggingface.co/docs/datasets/index) object too which is stored in the `.dataset` attribute of the object:

In [12]:
text_encoder.dataset

Dataset({
    features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
    num_rows: 70641
})

Note that when initialising the `Text_Encoder` object, we could've optionally passed in the data as a `Dataset` object using the `dataset` argument. So if the dataset that you want to use is already in that form, there is no need to first convert that to a dataframe before using the class.

We can see that we have tokenized this as there are `input_ids`, `attention_mask`, `special_tokens_mask`, and `tokens` features in the `Dataset` object.

Lets have a look at the first word in this dataset:

In [13]:
text_encoder.dataset["word"][0]

'confounds'

In [14]:
text_encoder.dataset["input_ids"][0]

[0, 7, 19, 18, 10, 19, 25, 18, 8, 23, 1]

We can see that this word has been tokenized by character:

In [15]:
text_encoder.dataset["tokens"][0]

['c', 'o', 'n', 'f', 'o', 'u', 'n', 'd', 's']

We can also see that we have saved the tokenized text in the `'token'` column of the dataframe stored in `.df`:

In [16]:
text_encoder.df

Unnamed: 0,word,language,tokens
0,confounds,en,"[c, o, n, f, o, u, n, d, s]"
1,raglan,en,"[r, a, g, l, a, n]"
2,commit,en,"[c, o, m, m, i, t]"
3,reattains,en,"[r, e, a, t, t, a, i, n, s]"
4,curviest,en,"[c, u, r, v, i, e, s, t]"
...,...,...,...
70636,ague,en,"[a, g, u, e]"
70637,peremptory,en,"[p, e, r, e, m, p, t, o, r, y]"
70638,trapezoid,en,"[t, r, a, p, e, z, o, i, d]"
70639,adagios,en,"[a, d, a, g, i, o, s]"


We also store the tokens in `.tokens` attribute.

In [17]:
text_encoder.tokens

Dataset({
    features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 70641
})

After applying the `.tokenize_text()` method, we store a tokenized dataframe in the `.tokenized_df` attribue. Here, we have each token in our corpus and their associated `'text_id'` (which is just the index they were given in the original dataframe that we pass):

In [18]:
text_encoder.tokenized_df

Unnamed: 0,text_id,language,tokens
0,0,en,c
1,0,en,o
2,0,en,n
3,0,en,f
4,0,en,o
...,...,...,...
601993,70640,en,g
601994,70640,en,a
601995,70640,en,b
601996,70640,en,b


So if we looked at `text_id==0`:

In [19]:
text_encoder.tokenized_df[text_encoder.tokenized_df["text_id"]==0]

Unnamed: 0,text_id,language,tokens
0,0,en,c
1,0,en,o
2,0,en,n
3,0,en,f
4,0,en,o
5,0,en,u
6,0,en,n
7,0,en,d
8,0,en,s


## Training the model

The above embeddings will not be good for any downstream task as the model itself has not been trained to the text. For this we will use other methods in the `TextEncoder` class which allows us to do this by using the [Huggingface trainer API](https://huggingface.co/docs/transformers/main_classes/trainer).

Note that if you're re-running this notebook after pre-training the model previously, you can skip this section.

Otherwise, to train the model, we need to set up a data collator for training our model. We train the model on the masked language modelling task and so use the `DataCollatorForLanguageModeling` class which masks tokens with a certain probability.

In [20]:
# set up data_collator for language modelling (has dynamic padding)
data_collator_for_LM = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

To train our dataset, we will split it into a train, validation and test sets with the `.split_dataset()` method. This stores the split Dataset objects in `.dataset_split` attribute.

In [21]:
text_encoder.split_dataset(random_state=seed)

[INFO] Splitting up dataset into train / validation / test sets, and saving to `.dataset_split`.


DatasetDict({
    train: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 37863
    })
    test: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 14129
    })
    validation: Dataset({
        features: ['word', 'language', 'input_ids', 'attention_mask', 'special_tokens_mask', 'tokens'],
        num_rows: 18649
    })
})

In [22]:
type(text_encoder.dataset_split)

datasets.dataset_dict.DatasetDict

In [None]:
text_encoder.dataset_split.push_to_hub("rchan26/english_char_split")

We can set up the trainer's arguments with `.set_up_training_args()` which sets up a `TrainingArguments` object (from the `transformers` package) and stores it in the `.training_args` attribute of the `text_encoder` object:

In [23]:
text_encoder.set_up_training_args(
    output_dir=model_name,
    num_train_epochs=600,
    per_device_train_batch_size=128,
    disable_tqdm=False,
    save_strategy="steps",
    save_steps=10000,
    seed=seed
)

[INFO] Setting up TrainingArguments object and saving to `.training_args`.


TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_mod

In [24]:
type(text_encoder.training_args)

transformers.training_args.TrainingArguments

And lastly, we set up a `Trainer` object (from the `transformers` package) and store it in the `.trainer` attribute in the `text_encoder` object:

In [25]:
text_encoder.set_up_trainer(data_collator=data_collator_for_LM)

[INFO] Setting up Trainer object, and saving to `.trainer`.


<transformers.trainer.Trainer at 0x14d5a62e6290>

In [26]:
type(text_encoder.trainer)

transformers.trainer.Trainer

Once everything is set up, we just train our model by calling `.fit_transformer_with_trainer_api()` method.

In [27]:
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

CUDA available: True
CUDA device count: 1


In [28]:
print(f"MPS available: {torch.backends.mps.is_available()}")

MPS available: False


In [29]:
device = torch.device('mps') if torch.backends.mps.is_available() else torch.device('cuda')
print(f"Using device {device}")

Using device cuda


Fitting the model using the Huggingface Trainer API:

In [30]:
# set to only report errors to avoid excessing logging
transformers.utils.logging.set_verbosity(40)

# fit the model
text_encoder.fit_transformer_with_trainer_api()

[INFO] Training model with 43205433 parameters...


Epoch,Training Loss,Validation Loss
1,No log,2.169436
2,2.333500,1.912144
3,2.333500,
4,1.872700,1.715129
5,1.872700,
6,1.728300,
7,1.646500,
8,1.646500,1.541534
9,1.586300,1.524111
10,1.586300,1.488707


[INFO] Training completed!


Saving our model:

In [31]:
text_encoder.trainer.save_model(model_name)

Uploading our trained model to the Huggingface model hub:

In [34]:
text_encoder.trainer.push_to_hub()

training_args.bin:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/173M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/rchan26/english-char-roberta/tree/main/'