## Causal Language Modeling (CLM) - Preprocess, Training and Inference

In this lab, we will explore **Causal Language Modeling (CLM)**, which is a core task in training autoregressive language models like GPT-2. CLM is the process of predicting the next word in a sequence, given the previous words. This type of modeling forms the backbone of text generation tasks, where the model learns to generate coherent text by focusing only on previous tokens in the sequence.

The lab is divided into three major sections:
1. **Preprocessing**: Preparing a dataset and the labels for the CLM task and tokenizing them.
2. **Training**: Fine-tuning a pre-trained language model like GPT-2 on a specific dataset using the CLM task.
3. **Inference**: Evaluating the model’s performance by generating text based on input prompts.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

### 1. Preprocessing

1. **Load the Domain-Specific Dataset**:
   - The first step in fine-tuning GPT-2 is to load a dataset that is specific to the domain of interest. In this case, we are using a publicly available **medical dataset** from the **PubMed** collection. PubMed contains a vast number of medical articles, and fine-tuning on such a dataset can help GPT-2 generate more accurate and context-specific medical text.
   - The dataset we're using is `"japhba/pubmed_simple"`, which is a simplified version of PubMed data. This dataset can be easily accessed using the `datasets` library from Hugging Face.


In [2]:
ds = load_dataset("japhba/pubmed_simple", split="train")
train_dataset = ds.shuffle(seed=42).select(range(1000))
eval_dataset = ds.shuffle(seed=42).select(range(1000, 1500))

We can check the contents of any one of the examples in the dataset to understand the structure of the data.

In [3]:
train_dataset[0]

{'abstract': 'Purpose The identification of abnormalities that are relatively rare within otherwise normal anatomy is a major challenge for deep learning in the semantic segmentation of medical images. The small number of samples of the minority classes in the training data makes the learning of optimal classification challenging, while the more frequently occurring samples of the majority class hamper the generalization of the classification boundary between infrequently occurring target objects and classes. In this paper, we developed a novel generative multi-adversarial network, called Ensemble-GAN, for mitigating this class imbalance problem in the semantic segmentation of abdominal images.Method The Ensemble-GAN framework is composed of a single-generator and a multi-discriminator variant for handling the class imbalance problem to provide a better generalization than existing approaches. The ensemble model aggregates the estimates of multiple models by training from different ini

We see that the each entry is a dictionary with two keys:
- "abstract": The abstract of the medical article
- "country": The country where the article was published (we will not use this information in this lab)

2. **Tokenize the Dataset**:
   - The next step is to **tokenize** the data so that it can be processed by GPT-2.
   - GPT-2 does not natively use a padding token, since it does not require fixed length inputs. For this reason, we will substite it with the EOS token as suggested by transformers library (remember, we are also passing an attention mask, so whatever value is used for padding will be ignored by the model!)

In [4]:
# Tokenize the Dataset
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS token

def tokenize_function(samples):
    return tokenizer(samples['abstract'], truncation=True, padding="max_length")

tokenized_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dataset_eval = eval_dataset.map(tokenize_function, batched=True)




3. **Add labels to the Dataset for Next Token Prediction**:
   - In this step, we will add the **labels** that will be used during the next token prediction task. 
   - In autoregressive language modeling, the **labels** represent the same sequence as the input, shifted one token to the right. This is because the model is trained to predict the next token in the sequence given the previous tokens.
   - The shifting of the tokens is already handled automatically by the `Trainer` class. We just pass an extra attribute named `labels` to the dataset (when this argument is passed to the model, it will know to compute the loss for us!)

In [5]:
# Add labels to the dataset
def add_labels(samples):
    samples["labels"] = samples["input_ids"]
    return samples

tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)
tokenized_dataset_eval = tokenized_dataset_eval.map(add_labels, batched=True)

### 2. Training

1. **Fine-Tune the GPT-2 Model**:
   - Set up the model and finetune it using the medical dataset. 
   - The pipeline to be followed is the same that we have already seen in the previous lab (`lab03 - 01-bert`)

In [6]:
# Set Training Parameters
training_args = TrainingArguments(
    output_dir="./gpt2-medical-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=6,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=100,
    eval_steps=10,
    eval_strategy="steps",
)

# Initialize GPT-2 Model
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

import transformers
# Fine-Tune the Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset_eval,
)

trainer.train()

# Save the Fine-Tuned Model
model.save_pretrained("./gpt2-medical-finetuned")




Step,Training Loss,Validation Loss
10,No log,0.734506
20,No log,0.726209
30,No log,0.722615
40,No log,0.720809
50,No log,0.719125
60,No log,0.718181
70,No log,0.71751
80,No log,0.71708


### 3. Inference

1. **Compare Text Generation Before and After Fine-Tuning**:
   - Generate text using both the original pre-trained GPT-2 model and the fine-tuned model.
   - Provide the same input prompt and observe the differences in the outputs.

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the pre-trained and fine-tuned GPT-2 models
pretrained_model = AutoModelForCausalLM.from_pretrained("gpt2")
finetuned_model = AutoModelForCausalLM.from_pretrained("./gpt2-medical-finetuned")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

prompt = "The patient presents with chest pain and shortness of breath."

inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
input_ids = inputs['input_ids']



In [8]:
# Generate Output of the Model Before Fine-Tuning
output_pretrained = pretrained_model.generate(input_ids, do_sample=True, max_length=100, temperature=1.0, pad_token_id=tokenizer.eos_token_id)
generated_pretrained = tokenizer.decode(output_pretrained[0], skip_special_tokens=True)

print("**Before Fine-Tuning (Pre-Trained GPT-2)**:")
print(generated_pretrained)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


**Before Fine-Tuning (Pre-Trained GPT-2)**:
The patient presents with chest pain and shortness of breath. When the patient begins to cry it does not occur to the specialist that the patient's own blood pressure is too low. If your heart rate is very low then that is a sign of cardiac arrest/suicide and is probably at least temporarily preventable.

Patients may require immediate cardiac treatment after cardiac arrest or suicide or when the patient dies because it is not possible to keep control on blood pressure. Please consult your medical history carefully


In [9]:
# Generate Output of the Model After Fine-Tuning
output_finetuned = finetuned_model.generate(input_ids, do_sample=True, max_length=100, temperature=1.0, pad_token_id=tokenizer.eos_token_id)
generated_finetuned = tokenizer.decode(output_finetuned[0], skip_special_tokens=True)

print("**After Fine-Tuning (Fine-Tuned on Medical Dataset)**:")
print(generated_finetuned)


**After Fine-Tuning (Fine-Tuned on Medical Dataset)**:
The patient presents with chest pain and shortness of breath. Patients who report breathing difficulties may display respiratory depression, hypoxia, and hyperalgesia and often use a nasal apnea and airway obstruction.


<span style="color:red">Extra stuff!</span>

Training the model in this way produces batches with potentially very different lengths. This can be inefficient, as the model will have to pad the sequences to the length of the longest sequence in the batch.

To avoid this, we can use a technique called **Dynamic Padding**. This technique groups the sequences in the batch by length and pads them to the length of the longest sequence in each group. This way, the model only has to pad the sequences to the length of the longest sequence in each group, which can significantly reduce the amount of padding required.

As a first exercise, quantify the number of pad tokens being used in various situations:
1. You pad all batches to the maximum allowed sequence length (1024 for GPT-2, this is what we used so far)
2. You pad the entire batch to the length of the longest sequence in the batch (generate the batches by randomly sampling sentences)
3. You pad the entire batch to the length of the longest sequence in the batch (generate the batches by placing sentences of similar lengths together)

Next, introduce dynamic padding and compare the execution times of the previous execution and the one with dynamic padding.

You can use the following resources to help you with this exercise:
- `group_by_length` parameter ([TrainingArguments](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer#transformers.TrainingArguments.group_by_length)) (parameter to group together samples with similar lengths)
- `DataCollatorForSeq2Seq` ([DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq)) (collator function that aggregates samples into batches and pads them to the maximum length of the batch)

Note: you may find that the validation losses you observe may be different from the previous ones. This is because the cross entropy loss is computed as an average across tokens, and the number of tokens in a batch can vary depending on the padding strategy used.