In [None]:
import os
import pandas as pd

from transformers import (
    RobertaConfig,
    RobertaForMaskedLM,
    RobertaTokenizerFast,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
)
import transformers
import tokenizers
import torch

import nlpsig

from load_data import data_folder, seed, corpus_df, english_train

In [None]:
ALPHABET_FILE = f"{data_folder}/alphabet.txt"
with open(ALPHABET_FILE) as f:
    alphabet = f.read().splitlines()
print(alphabet)

## Set up Tokenizer for word corpora

If we were to fine-tune an existing pretrained transformer, we could use the same tokenizer that the model was pretrained with. However, in this notebook, we will train a Transformer from stratch, and so using a tokenizer that was pretrained on a corpus that looks quite different to ours is suboptimal. In this example, we want to tokenize our words into characters, and so we need to train a _character-based_ tokenizer that is able to do this.

Here, we need to use the [`tokenizers`](https://huggingface.co/docs/tokenizers/index) library to set up and train a new tokenizer for our text.

In [None]:
# initialise character based tokenizer
tokenizer = tokenizers.CharBPETokenizer()
tokenizer.train(files=[ALPHABET_FILE],
                show_progress=False,
                special_tokens=['<s>', '</s>', '<unk>', '<pad>', '<mask>'])

# save the tokenizer to "char-bert/" folder
if not os.path.exists("char-bert"):
    os.makedirs("char-bert")

tokenizer.save_model("char-bert")

## Training a language model

We want to train a masked language model for our corpus of English words. In particular, we mask out particular letters and ask our model to try predict the masked letter.

Here, we initialise our tokenizer (here we tokenize by character), data collator (with padding) and set up our transformer model by specifying the config (we use the [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) here) described in [[2]](https://arxiv.org/abs/1907.11692).

In [None]:
corpus_df["word"].apply(len).max()

As the longest word in our corpus is 39, we will set the maximum sequence length in the transformer as 50 for a bit of headroom.

In [None]:
# set the maximum length as the longest word in our dataset
max_length = 50

# set dimension of hidden states for Transformer
hidden_size = 768

# load in tokenizer for architecture
tokenizer = RobertaTokenizerFast.from_pretrained('char-bert/', max_len=max_length)

# set up data_collator to use (intially just one that adds padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# initialise transformer architecture (random weights)
config_args = {
    "vocab_size": tokenizer.backend_tokenizer.get_vocab_size(),
    "hidden_size": hidden_size,
    "max_length": max_length,
    "max_position_embeddings": max_length + 2,
    "intermediate_size": 4*hidden_size,
    "hidden_dropout_prob": 0.1,
    "num_attention_heads": 12,
    "num_hidden_layers": 6,
    "type_vocab_size": 1
}

config = RobertaConfig(**config_args)
model = RobertaForMaskedLM(config=config)

In [None]:
model_name = "english-char-bert"

If you have already ran this notebook before and have trained the transformer previously, you can just load in the pretrained transformer using the line below - just uncomment in order to load in the model weights.

In [None]:
model = RobertaForMaskedLM.from_pretrained(model_name)

## Using the `TextEncoder` class

The `TextEncoder` class in the `nlpsig` package is able to take a dataframe with a column of text. We can use this class to obtain embeddings for the input text, or to train the model with the input text.

In this example, we will first use the class to train our transformer model with the corpus of English words, which we have stored in the `english_train` dataframe:

In [None]:
english_train.head()

To initialise the object, we pass in the dataframe, `english_train`, and the column name that stores our text, `"word"` in this case. We pass in our model, config, tokenizer and data collator which are necessary to train our model.

We note that in the case where we are not training a model, we could optionally just pass in a string to the `model_name` argument either specifying a model in the [Huggingface model hub](https://huggingface.co/models), e.g. [`"bert-base-uncased"`](https://huggingface.co/bert-base-uncased), or specifying a path to which a model is stored in, e.g. `"char-bert_trained"`. We can then load in our pretrained model using the `.load_pretrained_model()` method - we will see this later on when we will use this class again in order to obtain embeddings for the words in `corpus_sample_df`.

In [None]:
text_encoder = nlpsig.TextEncoder(
    df=english_train,
    feature_name="word",
    model=model,
    config=config,
    tokenizer=tokenizer,
    data_collator=data_collator
)

We can tokenize the text with the `.tokenize_text()` method, which tokenizes each of the sentences in the column of the dataframe that we passed in (note here that we just have words and we are tokenizing on the characters). So in the above, we tokenize each string in the `word` column of the `english_train` dataframe.

In [None]:
text_encoder.tokenize_text()

Note that the `text_encoder` object (instance of `TextEncoder`) also keeps the data as a [Huggingface `Dataset`](https://huggingface.co/docs/datasets/index) object too which is stored in the `.dataset` attribute of the object:

In [None]:
text_encoder.dataset

Note that when initialising the `Text_Encoder` object, we could've optionally passed in the data as a `Dataset` object using the `dataset` argument. So if the dataset that you want to use is already in that form, there is no need to first convert that to a dataframe before using the class.

We can see that we have tokenized this as there are `input_ids`, `attention_mask`, `special_tokens_mask`, and `tokens` features in the `Dataset` object.

Lets have a look at the first word in this dataset:

In [None]:
text_encoder.dataset["word"][0]

In [None]:
text_encoder.dataset["input_ids"][0]

We can see that this word has been tokenized by character:

In [None]:
text_encoder.dataset["tokens"][0]

We can also see that we have saved the tokenized text in the `'token'` column of the dataframe stored in `.df`:

In [None]:
text_encoder.df

We also store the tokens in `.tokens` attribute.

In [None]:
text_encoder.tokens

After applying the `.tokenize_text()` method, we store a tokenized dataframe in the `.tokenized_df` attribue. Here, we have each token in our corpus and their associated `'text_id'` (which is just the index they were given in the original dataframe that we pass):

In [None]:
text_encoder.tokenized_df

So if we looked at `text_id==0`:

In [None]:
text_encoder.tokenized_df[text_encoder.tokenized_df["text_id"]==0]

## Training the model

The above embeddings will not be good for any downstream task as the model itself has not been trained to the text. For this we will use other methods in the `TextEncoder` class which allows us to do this by using the [Huggingface trainer API](https://huggingface.co/docs/transformers/main_classes/trainer).

Note that if you're re-running this notebook after pre-training the model previously, you can skip this section.

Otherwise, to train the model, we need to set up a data collator for training our model. We train the model on the masked language modelling task and so use the `DataCollatorForLanguageModeling` class which masks tokens with a certain probability.

In [None]:
# set up data_collator for language modelling (has dynamic padding)
data_collator_for_LM = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

To train our dataset, we will split it into a train, validation and test sets with the `.split_dataset()` method. This stores the split Dataset objects in `.dataset_split` attribute.

In [None]:
text_encoder.split_dataset(random_state=seed)

In [None]:
type(text_encoder.dataset_split)

We can set up the trainer's arguments with `.set_up_training_args()` which sets up a `TrainingArguments` object (from the `transformers` package) and stores it in the `.training_args` attribute of the `text_encoder` object:

In [None]:
text_encoder.set_up_training_args(
    output_dir=model_name,
    num_train_epochs=600,
    per_device_train_batch_size=128,
    disable_tqdm=False,
    save_strategy="steps",
    save_steps=10000,
    seed=seed
)

In [None]:
type(text_encoder.training_args)

And lastly, we set up a `Trainer` object (from the `transformers` package) and store it in the `.trainer` attribute in the `text_encoder` object:

In [None]:
text_encoder.set_up_trainer(data_collator=data_collator_for_LM)

In [None]:
type(text_encoder.trainer)

Once everything is set up, we just train our model by calling `.fit_transformer_with_trainer_api()` method.

In [None]:
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

In [None]:
print(f"MPS available: {torch.backends.mps.is_available()}")

In [None]:
device = torch.device('mps') if torch.backends.mps.is_available() else torch.device('cuda')
print(f"Using device {device}")

In [None]:
# set to only report errors to avoid excessing logging
transformers.utils.logging.set_verbosity(40)

# fit the model
text_encoder.fit_transformer_with_trainer_api()

Saving our model:

In [None]:
text_encoder.trainer.save_model(model_name)

Uploading our trained model to the Huggingface model hub:

In [None]:
text_encoder.trainer.push_to_hub()