# Supervised Fine-Tuning of GPT2

https://github.com/omidiu/GPT-2-Fine-Tuning/blob/main/README.md

Why we only use the questions and not the answers in that fine tuning process?

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import math

## Loading the SQuAD dataset

In [2]:
dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## Loading the DistilGPT-2 tokenizer

In [3]:
model_id = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_id) #use_fast=True)

special_tokens = tokenizer.special_tokens_map
print(special_tokens)

{'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}




## 5) Preprocessing the dataset
Since we are going to use `distilgpt2` as our tokenizer, we should add the corresponding special tokens to the dataset. The special tokens are added to the dataset using the `map` function.

bos_token (str or tokenizers.AddedToken, optional) â€” A special token representing the beginning of a sentence. Will be associated to self.bos_token and self.bos_token_id.

eos_token (str or tokenizers.AddedToken, optional) â€” A special token representing the end of a sentence. Will be associated to self.eos_token and self.eos_token_id.

unk_token (str or tokenizers.AddedToken, optional) â€” A special token representing an out-of-vocabulary token. Will be associated to self.unk_token and self.unk_token_id.

In [4]:
def add_end_token_to_question(input_dict):
    input_dict['question'] += special_tokens['bos_token']
    return input_dict

dataset = dataset.remove_columns(['id', 'title', 'context', 'answers'])
dataset = dataset.map(add_end_token_to_question)

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['question'],
        num_rows: 10570
    })
})

## Tokenizing the dataset using the tokenizer

In [6]:
def tokenize_function(input_dict):
    return tokenizer(input_dict['question'], truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=['question'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 10570
    })
})

In [7]:
for item in tokenized_dataset['train']:
    print(item['input_ids'])
    print(tokenizer.decode(item['input_ids']))
    break

[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256]
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>


## Grouping Tokenized Text

The grouping tokenized text process involves dividing a tokenized text into fixed-length blocks or chunks to efficiently process large datasets during NLP tasks. By splitting the tokenized sequence into smaller segments, each of equal size, it becomes easier to handle and process the data in parallel, making it ideal for tasks like language modeling and text generation.

In [8]:
max_block_length = 128

def divide_tokenized_text(tokenized_text_dict, block_size):
    """
    Divides the tokenized text in the examples into fixed-length blocks of size block_size.

    Parameters:
    -----------
    tokenized_text_dict: dict
        A dictionary containing tokenized text as values for different keys.

    block_size: int
        The desired length of each tokenized block.

    Returns:
    -----------
        dict: A dictionary with tokenized text divided into fixed-length blocks.
    """
    concatenated_examples = {k: sum(tokenized_text_dict[k], []) for k in tokenized_text_dict.keys()}
    total_length = len(concatenated_examples[list(tokenized_text_dict.keys())[0]])
    total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    result['labels'] = result['input_ids'].copy()
    return result


# see: https://huggingface.co/docs/datasets/nlp_process

lm_dataset = tokenized_dataset.map(
    lambda tokenized_text_dict: divide_tokenized_text(tokenized_text_dict, max_block_length),
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [9]:
print(lm_dataset['train'][0]['input_ids'])
print(lm_dataset['train'][0]['labels'])

[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256, 2061, 318, 287, 2166, 286, 262, 23382, 20377, 8774, 11819, 30, 50256, 464, 32520, 3970, 286, 262, 17380, 2612, 379, 23382, 20377, 318, 13970, 284, 543, 4645, 30, 50256, 2061, 318, 262, 10299, 33955, 379, 23382, 20377, 30, 50256, 2061, 10718, 319, 1353, 286, 262, 8774, 11819, 379, 23382, 20377, 30, 50256, 2215, 750, 262, 3059, 349, 3477, 11175, 286, 23382, 288, 480, 2221, 12407, 30, 50256, 2437, 1690, 318, 23382, 20377, 338, 262, 39296, 1754, 3199, 30, 50256, 2061, 318, 262, 4445, 3710, 3348, 379, 23382, 20377, 1444, 30, 50256, 2437, 867, 3710, 1705, 9473, 389, 1043, 379, 23382, 20377, 30, 50256, 818, 644, 614, 750, 262, 3710, 3348]
[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256, 2061, 318, 287, 2166, 286, 262, 23382, 20377, 8774, 11819, 30, 50256, 464, 32520, 3970, 286, 262, 17380, 2612, 379, 23382, 20377, 318, 13970, 284, 543, 4645

In [10]:
for item in lm_dataset['train']:
    print(item['input_ids'])
    print(tokenizer.decode(item['input_ids']))
    print(len(tokenizer.decode(item['input_ids'])))
    break

[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256, 2061, 318, 287, 2166, 286, 262, 23382, 20377, 8774, 11819, 30, 50256, 464, 32520, 3970, 286, 262, 17380, 2612, 379, 23382, 20377, 318, 13970, 284, 543, 4645, 30, 50256, 2061, 318, 262, 10299, 33955, 379, 23382, 20377, 30, 50256, 2061, 10718, 319, 1353, 286, 262, 8774, 11819, 379, 23382, 20377, 30, 50256, 2215, 750, 262, 3059, 349, 3477, 11175, 286, 23382, 288, 480, 2221, 12407, 30, 50256, 2437, 1690, 318, 23382, 20377, 338, 262, 39296, 1754, 3199, 30, 50256, 2061, 318, 262, 4445, 3710, 3348, 379, 23382, 20377, 1444, 30, 50256, 2437, 867, 3710, 1705, 9473, 389, 1043, 379, 23382, 20377, 30, 50256, 818, 644, 614, 750, 262, 3710, 3348]
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>What is in front of the Notre Dame Main Building?<|endoftext|>The Basilica of the Sacred heart at Notre Dame is beside to which structure?<|endoftext|>What is the Grotto at N

## Get train and evaluation datasets

In [11]:
train_dataset = lm_dataset['train'].shuffle(seed=42).select(range(100))
eval_dataset = lm_dataset['validation'].shuffle(seed=42).select(range(100))

## Fine-tuning the model

The training process is controlled by the TrainingArguments, where we define hyperparameters like the learning rate and weight decay. The model is trained on a question-answering dataset, divided into training and evaluation sets (`train_dataset` and `eval_dataset`). During training, the model's parameters are optimized to predict answers for given questions, making it capable of providing accurate responses to queries.

Also, To ensure the model's compatibility with the tokenization process, we add a special '[PAD]' token to the tokenizer.

By running this section of code, you will have a fine-tuned GPT-2 model optimized for question answering. **(SQuAD)**

In [12]:
# !pip install accelerate --> Trainer need accelerate

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})


training_args = TrainingArguments(
    f'./{model_id}-squad',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False, # Change to True to push the model to the Hub
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)



## Evaluating the fine-tuned model

In [13]:
eval_results = trainer.evaluate()
print(f'Perplexity: {math.exp(eval_results["eval_loss"]):.2f}')

Perplexity: 159.82


## Test the model

In [19]:
prompt = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"

In [23]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(prompt, max_length=50, num_return_sequences=2)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'generated_text': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? The answer to this question was John Ehrlich, the British president, and a historian.\n\n\n\nFirst published in English by Hugh Hams\nFor'},
 {'generated_text': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? In June 1862, The French press of an American newspaper called "Poles and Printers" published a photograph of Queen Elizabeth VI with Queen Elizabeth II.\nThe'}]

In [24]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='distilgpt2')
set_seed(42)
generator(prompt, max_length=50, num_return_sequences=2)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? The first recorded in the Virgin Mary's name was on March 6, 1859, and by 1859 her name has become the first of her kind.\n\n"},
 {'generated_text': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? What was the actual presence, if any, of the Virgin Mary among other women in the town?\nThis, as we reported last year, is not surprising since'}]

## Save the model

In [None]:
tokenizer.save_pretrained('gpt2-squad')
model.save_pretrained('gpt2-squad')
#model.push_to_hub('gpt2-squad')