<div style="text-align: center;">
        <img src="./static/pretraining_header.png" width="289px" style="height: auto;"></img>
</div>

---

In this notebook, we'll look at how we can pretrain BabyBERT!

#### 📦 Import dependencies

Let's begin by importing the dependencies we'll need for training BabyBERT.

In [1]:
from babybert.data import CollatorForMLM, LanguageModelingDataset, load_corpus
from babybert.model import BabyBERT, BabyBERTConfig, BabyBERTForMLM
from babybert.tokenizer import WordPieceTokenizer
from babybert.trainer import Trainer, TrainerConfig

#### 📖 Loading our pretrained tokenizer

Next, let's load a pretrained tokenizer. We'll use the one we trained in the previous notebook.

In [2]:
tokenizer = WordPieceTokenizer.from_pretrained("./checkpoints/toy-model")

#### 📚 Assembling our training corpus

In order to properly pretrain a language model, you need a vast corpus of diverse, unstructured texts. For the sake of this example, we'll use a text file containing around 1,000 raw English sentences.

We'll also encode the corpus using our pretrained tokenizer, which converts each sentence into a list of token IDs and attention masks. These token IDs and masks will serve as training examples for our model!

In [None]:
corpus = load_corpus("./data/corpus.txt")
encoded = tokenizer.batch_encode(corpus)

Let's create a dataset object to store the encoded corpus; thankfully, `LanguageModelingDataset` has a built-in `from_dict` method we can use to do this!

In [None]:
dataset = LanguageModelingDataset.from_dict(encoded)

#### ⚙️ Instantiating the BabyBERT model

Here, we define the configuration settings for our BabyBERT model and instantiate it.

In [None]:
model_cfg = BabyBERTConfig(
    vocab_size=tokenizer.vocab_size,
    block_size=dataset.seq_length,
)

model = BabyBERT(model_cfg)

For pretraining, we need to add a masked language modeling head to BabyBERT.

In [None]:
mlm_model = BabyBERTForMLM(model)

#### 💪 Instantiating the trainer

Let's create a collator that we can use to automatically mask our input sequences for us.

In [None]:
collator = CollatorForMLM(tokenizer)

We'll use that collator as part of the configuration for our trainer! The trainer will automatically perform masking for us when it creates a batch of samples.

In [None]:
trainer_cfg = TrainerConfig(
    collator=collator, batch_size=16, num_workers=4, num_samples=1000
)

trainer = Trainer(mlm_model, trainer_cfg)

#### 🏋️ Training BabyBERT

We have everything ready to go now - let's train our model!

In [None]:
trainer.run(dataset)

Training:   0%|[33m          [0m| 0/1008 [00:00<?, ?samples/s]

Training:   0%|[33m          [0m| 0/1008 [00:02<?, ?samples/s]


TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "c:\Users\drewe\Documents\learning\babybert\.venv\Lib\site-packages\torch\utils\data\_utils\worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\drewe\Documents\learning\babybert\.venv\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\drewe\Documents\learning\babybert\babybert\data.py", line 82, in __call__
    batched_token_ids = torch.stack(token_ids)
                        ^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected Tensor as element 0 in argument 0, but got list


#### 💾 Saving the pretrained BabyBERT model

Finally, we save our pretrained model for later use.

In [None]:
model.save_pretrained("./checkpoints/toy-model")