#### Training the language model.


In this step we will finetune the camebert model on our dataset using the masked language modeling as objective function.

With masked language we will mask some words in the input text and train the model to predict the masked words.
Upon training our model on that task we will use it for other downstream tasks.

In [None]:
from transformers import AutoTokenizer

In [None]:
model_checkpoint = "cmarkea/distilcamembert-base"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
def tokenize_function(data):
    """tokenize the data

    Args:
        data (_type_): _description_
    """
    result = tokenizer(data['content'], max_length=512, truncation=True)
    return result

In [None]:
tokenized_dataset = dataset.map(tokenize_function, 
                                batched=True, 
                                remove_columns=["title", "summary", "posted_at", "website_origin", "content", '__index_level_0__'])

Map:   0%|          | 0/36597 [00:00<?, ? examples/s]

### Dataset Masking

After the tokenization of our dataset we will use dataset masking to mask random sentences in our dataset. 

In [None]:
from transformers import DataCollatorForLanguageModeling

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### Concatenate and Chunk Dataset

I haven't properly understood this process but I will get back to it later

In [None]:
def concat_chunk_dataset(data):
    chunk_size = 128
    # concatenate texts
    concatenated_sequences = {key: sum(value, []) for key, value in data.items()}
    #compute length of concatenated texts
    total_concat_length = len(concatenated_sequences[list(data.keys())[0]])

    # drop the last chunk if is smaller than the chunk size
    total_length = (total_concat_length // chunk_size) * chunk_size

    # split the concatenated sentences into chunks using the total length
    result = {k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_sequences.items()}

    '''we create a new labels column which is a copy of the input_ids of the processed text data,the labels column serve as 
    ground truth for our masked language model to learn from. '''
    
    result["labels"] = result["input_ids"].copy()

    return result


In [None]:
padded_dataset = tokenized_dataset.map(concat_chunk_dataset, batched=True)

Map:   0%|          | 0/36597 [00:00<?, ? examples/s]

#### Training Process

In [None]:
from torch.utils.data import DataLoader
from transformers import AutoModelForMaskedLM
from torch.optim import AdamW
from accelerate import Accelerator
from transformers import get_scheduler


In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15)

In [None]:
batch_size = 16

train_dataloader = DataLoader(padded_dataset['train'], 
                              shuffle=True,batch_size=batch_size, 
                              collate_fn=data_collator)
eval_dataloader = DataLoader(padded_dataset['test'],
                             shuffle=False,
                                batch_size=batch_size,
                                collate_fn=data_collator)

In [None]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=1e-2)

In [None]:
accelerator = Accelerator()
device = accelerator.device
model.to(device)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader,
                                                                          device_placement=[True, True, True, True])

In [None]:
train_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = train_epochs * num_update_steps_per_epoch

In [None]:
lr_scheduler = get_scheduler("linear", 
                             optimizer=optimizer,
                             num_warmup_steps=0,
                             num_training_steps=num_training_steps)


### Training the Model

In [None]:
import torch
import math
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

# directory to save the models
output_dir = "trained_models"

for epoch in range(train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(padded_dataset["test"])]

    # perplexity metric used for mask language model training
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")
    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save model
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/52395 [00:00<?, ?it/s]

KeyboardInterrupt: 

We have trained our model on the google colab notebook. Now we have downloaded the model under the path /trained_models/congo-news-model. We will use this model to generate the text.

In [1]:
from pathlib import Path

In [2]:
trained_model_path = Path.cwd().joinpath("trained_models", "congo-news-masked-language-model")

In [3]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained(trained_model_path)

model = AutoModelForMaskedLM.from_pretrained(trained_model_path) 






In [12]:
tokenizer.mask_token

'<mask>'

In [20]:
inputs = tokenizer("Dans son speech, il n'a pas écarté la possibilité d'aligner sa candidature à la prochaine élection <mask> ", return_tensors="pt")


In [21]:
inputs.get('input_ids').shape

torch.Size([1, 27])

In [25]:
with torch.no_grad():
    logits = model(**inputs).logits

# retrieve index of [MASK]  
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
output = tokenizer.decode(predicted_token_id)
print(output)

présidentielle


In [27]:
mask_token_index

tensor([24])

With the quick text we can see that  our model is working. The next step will be to use it for topic modelling or new classification tasks.