# Roberta Pre-Training with *wikismall* dataset

We will train the model for MLM

1. Load the dataset [acloudfan/wikismall](https://huggingface.co/datasets/acloudfan/wikismall)
2. Tokenize the dataset using *RobertaTokenizerFast*
3. Data pre-processing to get it ready for training : Grouping of data to chunks of fixed size
4. Setup DataCollatorForLanguageModeling to (a) create batches (b) randomply mask tokens
5. Setup TrainingArguments
6. Create model with random weights
7. Setup trainer & run training
8. Push trained model to the hub

In [None]:
# Needed for Google Colab
# !pip install datasets torch transformers accelerate

In [None]:
from datasets import DownloadConfig, load_dataset, DatasetDict
from transformers import (AutoTokenizer, 
                          DataCollatorForLanguageModeling, 
                          TrainingArguments, 
                          Trainer, 
                          AutoConfig, 
                          AutoModelForMaskedLM)

## 1. Load Dataset 

The dataset is already split into 'train' and 'validation'


In [None]:
# Dataset name
dataset_name='acloudfan/wikismall'

# Load
raw_datasets = load_dataset(dataset_name)

# Shuffling is not optional
# raw_datasets = raw_datasets.shuffle(seed=23)

raw_datasets

## 2. Tokenize the dataset

In [None]:
# Create the tokenizer
tokenizer_name = 'roberta-base'

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# Maximum sequence length supported by the model
max_seq_length = tokenizer.model_max_length

tokenizer

In [None]:
# Column to tokenize
text_column_name = "text" 

# Pad/Truncate to max_seq_length
# The DataCollator optimizes the processing if the special token mask is provided (return_special_tokens_mask=True)
def tokenize_function(examples):
    return tokenizer(examples[text_column_name], 
                     return_special_tokens_mask=True)
    
#     return tokenizer(examples[text_column_name], 
#                      return_special_tokens_mask=True, 
#                      padding='max_length', 
#                      max_length=max_seq_length) 

In [None]:
# Get the features from the datasets - these are column names
# In the tokenized dataset we will remove all columns from the original dataset
# IGNORE THE WARNING - we will take care of adjusting the sequence lengths later (block_size)
column_names = list(raw_datasets["train"].features)

tokenized_datasets = raw_datasets.map(
                    tokenize_function,
                    batched=True,
                    remove_columns=column_names,
                )

# Note down the number of rows
tokenized_datasets

## 3. Data pre-processing to get it ready for training : Grouping of data to chunks of fixed size

### Data Pre-processing 

Since the data is already clean, we don't need to carry out any cleaning. Only thing we need to do is creation of chunks that of data that will be fed for training. This chunking mechanism will make the training efficient.

Two steps:

#### 1. Define a chunking function

Receives an array of tokenized sentences (input_ids, attention_mask, special_tokens). Function concatenates each feature array and then breaks the concatenated keys into chunks of fixed length (=block_size). If the last chunk is < block_size, it gets dropped; this is done to keep the logic simple.

1. Decide on a chunk or block size
2. Function arguments = an array of tokenized dataset rows (N number of rows)
3. In tokenized dataset:
   * Concatenate the values in each of the features in the tokenized_datsets
   * Create chunks for each feature = block_size
   * Discard last chunk if its lenth < block_size
   
#### 2. Use the Dataset.map function to apply chunking

Checkouut documentation to understanding how *map* function is applied to a batch of rows in a Dataset.

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.map


In [None]:
# 1. Define chunking function

def group_text_tokens(examples):
    
    block_size = 128
      
    # Concatenate input_ids, special_token_mask, attention_mask 
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    
    # Get the length of concatenated arrays/tensors
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    # Split the concatenated keys into arrays of size = block_size
    # Size of last block may be < block_size
    #    1. You can either drop the last block
    #    2. Or you can pad input_ids to make its len = block_size, do not forget the attention_mask/special_tokens
    
    # To keep things simple - we will drop the last block
    total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    
    # Copy the input_ids to a new column 'labels'
    # This new columns get utilized in a later part
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
# 2. Apply function to batches

# Take advantage of multiple core/gpu by setting the *num_proc* > 1
# Batch size may be adjusted - higher means less data loss

lm_datasets = tokenized_datasets.map(
    group_text_tokens,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

# Note down the number of rows
lm_datasets

## 4. Create the data collator

Since we are training Roberta for MLM, we will use the DataCollatorForLanguageModeling which has the inbuilt capability of masking the tokens.

https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForLanguageModeling

In [None]:
# Create the collator for use by Trainer

MLM_PROBABILITY = 0.15

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=MLM_PROBABILITY)

## 5. Setup training arguments

https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments

#### Adjust the training arguments to experiment



In [None]:
training_args = TrainingArguments(
    "wikismall-roberta-mlm-training",
    num_train_epochs=5,
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps = 50,
    log_level = 'warning',
    use_cpu = False
)

## 6. Create model with randomly initialized weights

1. Create the configuration from checkpoint = 'roberta-base'
2. Use the configuration to create an instance of the untrained Roberta model


In [None]:
model_checkpoint='roberta-base'

config = AutoConfig.from_pretrained(model_checkpoint)

model = AutoModelForMaskedLM.from_config(config)

## 7. Setup trainer & run training

Note:

Training loss = No Log may be encountered. This generally happens when low volumes of data are used for training.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

## 8. Push the model to hub

In [None]:
HF_TOKEN='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'

model_name = 'wikismall-roberta'

model.push_to_hub(model_name, token=HF_TOKEN)