## Casual Language Modeling (CLM) - Preprocess, Training and Inference

In this lab, we will explore **Causal Language Modeling (CLM)**, which is a core task in training autoregressive language models like GPT-2. CLM is the process of predicting the next word in a sequence, given the previous words. This type of modeling forms the backbone of text generation tasks, where the model learns to generate coherent text by focusing only on previous tokens in the sequence.

The lab is divided into three major sections:
1. **Preprocessing**: Preparing a dataset and the labels for the CLM task and tokenizing them.
2. **Training**: Fine-tuning a pre-trained language model like GPT-2 on a specific dataset using the CLM task.
3. **Inference**: Evaluating the model’s performance by generating text based on input prompts.

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


### 1. Preprocessing

1. **Load the Domain-Specific Dataset**:
   - The first step in fine-tuning GPT-2 is to load a dataset that is specific to the domain of interest. In this case, we are using a publicly available **medical dataset** from the **PubMed** collection. PubMed contains a vast number of medical articles, and fine-tuning on such a dataset can help GPT-2 generate more accurate and context-specific medical text.
   - The dataset we're using is `"japhba/pubmed_simple"`, which is a simplified version of PubMed data. This dataset can be easily accessed using the `datasets` library from Hugging Face.


In [2]:
train_dataset = load_dataset("japhba/pubmed_simple", split="train")
train_dataset = train_dataset.shuffle(seed=42).select(range(500))

Downloading metadata: 100%|██████████| 1.45k/1.45k [00:00<00:00, 12.4MB/s]
Downloading data: 100%|██████████| 274M/274M [00:24<00:00, 11.2MB/s] 
Downloading data: 100%|██████████| 271M/271M [00:24<00:00, 11.3MB/s] 
Downloading data: 100%|██████████| 256M/256M [00:22<00:00, 11.3MB/s] 
Downloading data: 100%|██████████| 239M/239M [00:21<00:00, 11.2MB/s] 
Downloading data: 100%|██████████| 236M/236M [00:20<00:00, 11.2MB/s] 
Downloading data: 100%|██████████| 233M/233M [00:20<00:00, 11.2MB/s] 
Downloading data: 100%|██████████| 238M/238M [00:21<00:00, 11.2MB/s] 
Downloading data: 100%|██████████| 242M/242M [00:21<00:00, 11.2MB/s] 
Generating train split: 100%|██████████| 3116832/3116832 [00:08<00:00, 357352.07 examples/s]


2. **Tokenize the Dataset**:
   - The next step is to **tokenize** the data so that it can be processed by GPT-2.
   - Since GPT-2 does not natively use a padding token, since it does not require fixed length inputs. For this reason, we will substite it with the EOS token as suggested by transformers library.

In [17]:
# Tokenize the Dataset
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS token

def tokenize_function(samples):
    return tokenizer(samples['abstract'], padding="max_length", truncation=True)

tokenized_dataset = train_dataset.map(tokenize_function, batched=True)

3. **Add labels to the Dataset for Next Token Prediction**:
   - In this step, we will add the **labels** that will be used during the next token prediction task. 
   - In autoregressive language modeling, the **labels** represent the same sequence as the input, shifted one token to the right. This is because the model is trained to predict the next token in the sequence given the previous tokens.
   - The shifting of the tokens is already handled automatically by the `Trainer` class.

In [19]:
# Add labels to the dataset
def add_labels(samples):
    samples["labels"] = samples["input_ids"].copy()
    return samples

tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)

### 2. Training

1. **Fine-Tune the GPT-2 Model**:
   - Set up the model and finetune it using the medical dataset. 
   - The pipeline to be followed is the same that we have already seen in the previous lab (`lab03 - 01-bert`)

In [4]:
# Set Training Parameters
training_args = TrainingArguments(
    output_dir="./gpt2-medical-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=6,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=100,
)

# Initialize GPT-2 Model
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

# Fine-Tune the Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

# Save the Fine-Tuned Model
model.save_pretrained("./gpt2-medical-finetuned")




Step,Training Loss


### 3. Inference

1. **Compare Text Generation Before and After Fine-Tuning**:
   - Generate text using both the original pre-trained GPT-2 model and the fine-tuned model.
   - Provide the same input prompt and observe the differences in the outputs.

In [24]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained and fine-tuned GPT-2 models
pretrained_model = GPT2LMHeadModel.from_pretrained("gpt2")
finetuned_model = GPT2LMHeadModel.from_pretrained("./gpt2-medical-finetuned")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

prompt = "The patient presents with chest pain and shortness of breath."

inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
input_ids = inputs['input_ids']

In [25]:
# Generate Text Before Fine-Tuning
output_pretrained = pretrained_model.generate(input_ids, do_sample=True, max_length=100, temperature=0.7, pad_token_id=tokenizer.eos_token_id)
generated_pretrained = tokenizer.decode(output_pretrained[0], skip_special_tokens=True)

# Print the Outputs for ComparisonS
print("**Before Fine-Tuning (Pre-Trained GPT-2)**:")
print(generated_pretrained)

**Before Fine-Tuning (Pre-Trained GPT-2)**:
The patient presents with chest pain and shortness of breath. The patient's left arm is twisted and the patient is bleeding. The patient is bleeding from the lower back. The patient is bleeding from the upper back. The patient is bleeding from the left side of his body. The patient is bleeding from his left hip. A chest pain is evident in the right chest. There is a pulse that is the same as the left pulse. The patient is bleeding from the left side of the chest. The


In [26]:
# Generate Text After Fine-Tuning
output_finetuned = finetuned_model.generate(input_ids, do_sample=True, max_length=100, temperature=0.7, pad_token_id=tokenizer.eos_token_id)
generated_finetuned = tokenizer.decode(output_finetuned[0], skip_special_tokens=True)

print("\n**After Fine-Tuning (Fine-Tuned on Medical Dataset)**:")
print(generated_finetuned)


**After Fine-Tuning (Fine-Tuned on Medical Dataset)**:
The patient presents with chest pain and shortness of breath. He presents with a paresthesias and nausea. He has diarrhea and is irritable bowel syndrome (IBS). He is mildly irritable bowel syndrome (LBS).
