# ActivateBaby - training LM
based on [How to train a new language model from scratch using Transformers and Tokenizers](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=M1oqh0F6W3ad)


# 0. init

In [1]:
conda env list

# conda environments:
#
base                  *  /root/miniconda3
sum                      /root/miniconda3/envs/sum
vicuna                   /root/miniconda3/envs/vicuna


Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import re
import time
from os.path import join as osj
from pathlib import Path
from collections import defaultdict
import random
import pickle

import nltk
import pandas as pd
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import torch
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
from datasets import Dataset

from config import default_args
from lmprobs import TrigramSurprisalSpace

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
transformers.__version__

'4.29.2'

## 1. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [4]:
# Check that we have a GPU
# !nvidia-smi

In [5]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

In [6]:
# os.environ["CUDA_VISIBLE_DEVICES"] = "3"
# os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda:0"

### parameters of official baseline 

"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.29.2",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265

### We'll define the following config for the model

In [7]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=1,
    
    attention_probs_dropout_prob=0.1,
    bos_token_id=0,
    classifier_dropout=None,
    eos_token_id=2,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    hidden_size=768,
    initializer_range=0.02,
    intermediate_size=3072,
    layer_norm_eps=1e-05,
    model_type="roberta",
    pad_token_id=1,
    position_embedding_type="absolute",
    torch_dtype="float32",
    transformers_version="4.29.2",
    use_cache=True,
)

Now let's re-create our tokenizer in transformers

In [8]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained(default_args['tokenizer_path'], max_len=512)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LongformerTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LongformerTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [9]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config).to(device)

In [10]:
model.num_parameters()
# => 84 million parameters

126031648

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [11]:
from transformers import DataCollatorForLanguageModeling, LineByLineTextDataset

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [12]:
# %%time
# from transformers import LineByLineTextDataset

# dataset = LineByLineTextDataset(
#     tokenizer=tokenizer,
#     file_path="../dataset/babylm_10M_sents.txt",
#     block_size=512,
# )

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

### surprisal space

In [13]:
tss = pickle.load(open(default_args['tss_path'], "rb"))

all_sents = open(default_args['train_data_path'], "r").readlines()

In [14]:
train_data_df = pd.read_csv("/root/xhong/babylm/dataset/babylm_10M.csv")
train_data_df.shape

(1180291, 4)

In [26]:
INITIAL_SAMPLE = 100000
SAMPLE_SIZE = 50000
MAX_ITERATION = 22
encoder_max_length = 512
batch_size = 8

In [16]:
pool = train_data_df['line_idx'].to_numpy()
pool.shape

(1180291,)

In [17]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["line"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    return batch

In [None]:
initial_indices = np.random.choice(pool, INITIAL_SAMPLE)
pool = np.delete(pool, initial_indices)
tss.remove_from_space(initial_indices)
sampled_train_data_df = train_data_df.loc[initial_indices,:]

iteration = 0
save_dir = "../ckpt/ABRoBERTa_10M_10ep/"
convergence_criterion_not_met = True
while convergence_criterion_not_met: # another miracle
    # dataset = LineByLineTextDataset(
    #     tokenizer=tokenizer,
    #     file_path=training_filename, #REPLACE WITH CURRENT TRAINING SET
    #     block_size=512,
    # )
    dataset = Dataset.from_pandas(sampled_train_data_df)
    
    # map train data
    train_set = dataset.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=['Unnamed: 0', 'line_idx', 'token'],
    )

    training_args = TrainingArguments(
        output_dir="../ckpt/ABRoBERTa_10M_10ep",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=8,
        save_steps=10_000,
        save_total_limit=2,
        prediction_loss_only=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_set,
    )

    # ### Start training
    trainer.train()

    # #### 🎉 Save final model (+ tokenizer + config) to disk
    save_path = os.path.join("../ckpt/ABRoBERTa_10M_10ep/", str(iteration))
    trainer.save_model(save_path) # TODO: rename checkpoints per iteration
    
    # Assume a miracle where we know the specific index of the highest perplexity sentence from
    # the training set.    
    # That miracle we will call most_confused_index
    # I.e., for every sentence in the training set, we get the perplexity according to the trained model.
    # find the index of the maximum.
    sampled_indices = np.random.choice(len(train_data_df), SAMPLE_SIZE)
    surprisal_by_group = []
    with torch.no_grad():
        for idx in tqdm(sampled_indices):
            line_idx = train_data_df.loc[idx, 'line_idx']
            tokens = train_data_df.loc[idx, 'line']

            # Tokenize the sentences and convert to tensor
            inputs = tokenizer(
                tokens,  
                padding="max_length", 
                truncation=True,
                max_length=encoder_max_length,
                return_tensors='pt').to(device)

            # Perform a forward pass through the model
            outputs = model(**inputs, labels=inputs['input_ids'])

            # The first output is the Cross Entropy loss, calculated per example in the batch
            # Surprisal is the negative log-likelihood, which corresponds to the loss here.
            surprisals = outputs.loss.tolist()
            
            surprisal_by_group.append(surprisals)
        surprisal_array = np.array(surprisal_by_group)
        max_surprisal_idx = surprisal_array.argmax()
        most_confused_index = sampled_indices[max_surprisal_idx]
        
        print('most_confused_index', most_confused_index)

    _, indices, _ = tss.find_index(most_confused_index, k=SAMPLE_SIZE) #TODO: k is a hyperparameter
    pool = np.delete(pool, indices)
    # Take things out of the space.
    tss.remove_from_space(indices)
    sampled_train_data_df = train_data_df.loc[indices,:]
    
    iteration += 1
    if iteration > MAX_ITERATION or pool.size == 0:
        convergence_criterion_not_met = False

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,8.3286
1000,7.8081
1500,7.7545
2000,7.7063
2500,7.7263
3000,7.673
3500,7.729
4000,7.6838
4500,7.71
5000,7.6338


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

 32%|███▏      | 15931/50000 [06:14<14:27, 39.29it/s]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (

most_confused_index 1125521




Step,Training Loss
500,7.7188
1000,7.7328
1500,7.7444


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

100%|██████████| 50000/50000 [20:21<00:00, 40.93it/s]


most_confused_index 628677




Step,Training Loss
500,7.5779
1000,7.613
1500,7.5923
2000,7.594
2500,7.5796
3000,7.4895
3500,7.5913
4000,7.707
4500,7.5553
5000,7.5824


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

 19%|█▉        | 9536/50000 [04:12<16:11, 41.65it/s]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

100%|██████████| 50000/50000 [21:12<00:00, 39.28it/s]


most_confused_index 354738




Step,Training Loss
500,7.4911
1000,7.4466


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

 42%|████▏     | 20876/50000 [09:33<15:38, 31.04it/s]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (

In [21]:
len(train_set)

50000

In [23]:
pool.shape

(34406,)

In [24]:
MAX_ITERATION

20

In [27]:
iteration = 21
convergence_criterion_not_met = True
while convergence_criterion_not_met: # another miracle
    # dataset = LineByLineTextDataset(
    #     tokenizer=tokenizer,
    #     file_path=training_filename, #REPLACE WITH CURRENT TRAINING SET
    #     block_size=512,
    # )
    dataset = Dataset.from_pandas(sampled_train_data_df)
    
    # map train data
    train_set = dataset.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=['Unnamed: 0', 'line_idx', 'token'],
    )

    training_args = TrainingArguments(
        output_dir="../ckpt/ABRoBERTa_10M_10ep",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=8,
        save_steps=10_000,
        save_total_limit=2,
        prediction_loss_only=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_set,
    )

    # ### Start training
    trainer.train()

    # #### 🎉 Save final model (+ tokenizer + config) to disk
    save_path = os.path.join("../ckpt/ABRoBERTa_10M_10ep/", str(iteration))
    trainer.save_model(save_path) # TODO: rename checkpoints per iteration
    
    # Assume a miracle where we know the specific index of the highest perplexity sentence from
    # the training set.    
    # That miracle we will call most_confused_index
    # I.e., for every sentence in the training set, we get the perplexity according to the trained model.
    # find the index of the maximum.
    sampled_indices = np.random.choice(len(train_data_df), SAMPLE_SIZE)
    surprisal_by_group = []
    with torch.no_grad():
        for idx in tqdm(sampled_indices):
            line_idx = train_data_df.loc[idx, 'line_idx']
            tokens = train_data_df.loc[idx, 'line']

            # Tokenize the sentences and convert to tensor
            inputs = tokenizer(
                tokens,  
                padding="max_length", 
                truncation=True,
                max_length=encoder_max_length,
                return_tensors='pt').to(device)

            # Perform a forward pass through the model
            outputs = model(**inputs, labels=inputs['input_ids'])

            # The first output is the Cross Entropy loss, calculated per example in the batch
            # Surprisal is the negative log-likelihood, which corresponds to the loss here.
            surprisals = outputs.loss.tolist()
            
            surprisal_by_group.append(surprisals)
        surprisal_array = np.array(surprisal_by_group)
        max_surprisal_idx = surprisal_array.argmax()
        most_confused_index = sampled_indices[max_surprisal_idx]
        
        print('most_confused_index', most_confused_index)

    _, indices, _ = tss.find_index(most_confused_index, k=SAMPLE_SIZE) #TODO: k is a hyperparameter
    pool = np.delete(pool, indices)
    # Take things out of the space.
    tss.remove_from_space(indices)
    sampled_train_data_df = train_data_df.loc[indices,:]
    
    iteration += 1
    if iteration > MAX_ITERATION or pool.size == 0:
        convergence_criterion_not_met = False



Step,Training Loss
500,6.9714
1000,6.8029
1500,6.7664
2000,6.894
2500,6.8152
3000,6.7387
3500,6.8488
4000,6.8437
4500,6.8321
5000,6.94


100%|██████████| 50000/50000 [18:57<00:00, 43.97it/s]


most_confused_index 587583


ValueError: k must be less than or equal to the number of training points

In [74]:
tss.dims

7

In [71]:
dists, indices = tss.nnfinder.query(np.array([[1,2,3,4,5,6,7]]), k=5)

In [86]:
tss.find_index(1, k=5)[-1]

(array([11.57198041,  7.68966767,  2.88667137,  0.5849625 ,  1.34917822,
         0.94563687,  0.23640922]),
 array([4.22634629, 4.55387522, 5.89305967, 1.5849625 , 5.73436519,
        5.71417052, 0.63706166]),
 array([6.59307593, 5.802712  , 2.6412563 , 2.19178893, 1.98492687,
        0.76431367, 0.45916037]),
 array([8.53102277, 7.17128543, 1.73233609, 4.78408363, 7.96476085,
        3.5572489 , 2.45537091]),
 array([6.7934593 , 7.41206322, 5.55625147, 3.70043972, 2.11453698,
        0.52863425, 1.05726849]))

In [72]:
dists

array([[8.8943542 , 9.26913703, 9.33674916, 9.4271919 , 9.4271919 ]])

In [73]:
indices

array([[29107, 28763, 26851, 21290, 21466]])

In [31]:
len(indices)

50000

In [29]:
pool = np.delete(pool, indices)

IndexError: index 40373 is out of bounds for axis 0 with size 34406

In [32]:
from sklearn.neighbors import KDTree

In [35]:
t = KDTree([[1,2,3,4,5]])
t

<sklearn.neighbors._kd_tree.KDTree at 0x55816faa6c80>

In [44]:
t.get_arrays()[0].size

5

In [77]:
tss.nnfinder.data.shape[0]

30291

## 2. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./ABRoBERTa",
    tokenizer="./ABRoBERTa"
)

In [None]:
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

# 