# Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset

## Introduction
Language modeling can be categorized into two types: causal and masked. This guide focuses on causal language modeling, commonly used in **text generation** tasks. Causal language models are ideal for applications such as **interactive storytelling** or **intelligent coding assistants** like Copilot and CodeParrot.

In causal language modeling, the model predicts the next token in a sequence based on previous tokens, without access to future tokens. This approach ensures that the model's predictions are generated in a left-to-right manner. GPT-2 is a prominent example of a causal language model.

## Setup

## Setup

In [9]:
!pip install transformers evaluate datasets



In [11]:
# Login to HuggingFace Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load ELI5

**Dataset Card:**

The ELI5-Category dataset is a smaller but newer and categorized version of the original ELI5 dataset. It's an English-language dataset of questions and answers gathered from the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/) subreddit where users ask factual questions requiring paragraph-length or longer answers. After 2017, a tagging system was introduced to this subreddit so that the questions can be categorized into different topics according to their tags. Since the training and validation set is built by questions in different topics, the dataset is expected to alleviate the train/validation overlapping issue in the original [ELI5 dataset](https://huggingface.co/datasets/eli5).

In [12]:
# Load 5000 examples from ELI5-Category dataset
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

The repository for eli5_category contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eli5_category.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [13]:
eli5

Dataset({
    features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
    num_rows: 5000
})

In [14]:
# Split the dataset into a train and test set
eli5 = eli5.train_test_split(test_size=0.2)
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 1000
    })
})

In [15]:
# Take a look at an example
eli5["train"][0]

{'q_id': '7fp7ku',
 'title': 'why do laptop and PC manufacturers install all of this useless software that slows the computer down so much?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqddoru'],
  'text': ["$$$ You'll notice that a lot of the software has a premium version and it tries to get you to upgrade after the trial period is over. These software companies pay the OEMs to load the trials onto the prebuilt machines in hopes that a small percentage of those who buy them will buy the premium version when the trial runs out. HP, Dell, and other OEMs load the software on the machine because it lets them charge less for the machine while still making the same amount of money and lets them move more machines that way."],
  'score': [27],
  'text_urls': [[]]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

## Preprocess the data

In [16]:
# Load a DistilGPT2 tokenizer to process text subfield
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [17]:
# Extract the text subfield from nested structure
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '7fp7ku',
 'title': 'why do laptop and PC manufacturers install all of this useless software that slows the computer down so much?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqddoru'],
 'answers.text': ["$$$ You'll notice that a lot of the software has a premium version and it tries to get you to upgrade after the trial period is over. These software companies pay the OEMs to load the trials onto the prebuilt machines in hopes that a small percentage of those who buy them will buy the premium version when the trial runs out. HP, Dell, and other OEMs load the software on the machine because it lets them charge less for the machine while still making the same amount of money and lets them move more machines that way."],
 'answers.score': [27],
 'answers.text_urls': [[]],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

In [18]:
# Function to join the list of strings for each example and tokenize the results
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

# Apply the preprocess function over the entire dataset
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2788 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3304 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1088 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2166 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1164 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1102 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1334 > 1024). Running this sequence through the model will result in indexing errors


In [19]:
tokenized_eli5

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [20]:
# Concatenate all sequences and split the concatenated sequences into shorter chunks
block_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Drop the small remainders
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    # Split by chunks of block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = result["input_ids"].copy()
    return result

In [21]:
# Apply the group_texts function over the entire dataset
lm_dataset = tokenized_eli5.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [22]:
# Create a batch of examples with data collator
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

## Train the Model

In [23]:
# Load DistilGPT2 model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [33]:
# Define the training hyperparameters
training_args = TrainingArguments(
    output_dir="gpt2-funetuned-eli5",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

In [34]:
# Define trainer with the trainig arguments, model, datasets and data collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

In [35]:
# Finetune the Model
trainer.train()

Epoch,Training Loss,Validation Loss
1,3.8522,3.830748
2,3.8093,3.827997
3,3.7661,3.826901


TrainOutput(global_step=3867, training_loss=3.8167276039646367, metrics={'train_runtime': 720.6736, 'train_samples_per_second': 42.914, 'train_steps_per_second': 5.366, 'total_flos': 1010140575694848.0, 'train_loss': 3.8167276039646367, 'epoch': 3.0})

In [36]:
# Evaluate the model
import math

eval_results = trainer.evaluate()

print(f"Perplexity: {math.exp(eval_results['eval_loss']): .2f}")

Perplexity:  45.92


In [37]:
# Push the model to the HuggingFace Hub
trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1724598044.0fb544052c32.4106.3:   0%|          | 0.00/7.86k [00:00<?, ?B/s]

events.out.tfevents.1724598788.0fb544052c32.4106.4:   0%|          | 0.00/359 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ashaduzzaman/gpt2-funetuned-eli5/commit/f1010fe1d255ce780ba849e930ea337b91e4d430', commit_message='End of training', commit_description='', oid='f1010fe1d255ce780ba849e930ea337b91e4d430', pr_url=None, pr_revision=None, pr_num=None)

## Inference

### Inference with a pipeline()

In [38]:
from transformers import pipeline

generator = pipeline("text-generation", model="ashaduzzaman/gpt2-funetuned-eli5")

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [41]:
prompt = "Somatic hypermutation allows the immune system to"
generator(prompt)

[{'generated_text': 'Somatic hypermutation allows the immune system to generate more antibodies. Because your immune system has such a strong negative correlation with the immune system, such a correlation causes the immune system to produce antibodies that can kill you. If we use an entire dose'}]

In [43]:
# Tokenize the text and return the input_ids as PyTorch tensors
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ashaduzzaman/gpt2-funetuned-eli5")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

In [47]:
# Use the generate() method to generate text.
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("ashaduzzaman/gpt2-funetuned-eli5")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [48]:
# Decode the generated text
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Somatic hypermutation allows the immune system to replicate the type and target molecules by binding onto them without breaking the immune system's immune system's protective barrier. When you look at a person with a genetic predisposition, they have to start with the genetic template that your heart is attached to. In this case, the virus has been successfully transgressed, so it's in the DNA of all the different organisms that you need to fight against. If the genes you need to fight against aren't present, then the immune system will try to remove"]