# Fine-Tuning on Pre-Training Data: A Performance Comparison

In this notebook, we fine-tune a question-answering (QA) model on the Bible and evaluate any possible performance increase on the BibleQA dataset.

To start, we install and import our necessary dependencies.

In [1]:
!pip install transformers datasets peft pandas tqdm

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

In [2]:
import csv
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling

Next, we load in the biblical text as a Hugging Face datasets for later fine-tuning

In [3]:
bible_text_dataset = load_dataset('text', data_files={'train': 'bible.txt'})
bible_text_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 31104
    })
})

In [4]:
bible_text_dataset['train'][3000]['text']

'Leviticus 11:1\tAnd Jehovah spake unto Moses and to Aaron, saying unto them,'

We now prepare our BibleQA dataset as question-answer pairs.

In [5]:
bible_qa_pairs = {
    'questions': [],
    'answers': []
}
with open('bible_qa_pairs.csv') as bible_qa_pairs_csv_file:
    bible_qa_csv_file_reader = csv.reader(bible_qa_pairs_csv_file, delimiter='\t')
    header = next(bible_qa_csv_file_reader)
    for row in bible_qa_csv_file_reader:
        if row:
            bible_qa_pairs['questions'].append(
                ''.join(row[0].split('. ')[1:]))
            bible_qa_pairs['answers'].append(
                ''.join(''.join(row[1].split('. ')[1:]).split(' (')[:1]))
bible_qa_pairs_df = pd.DataFrame(bible_qa_pairs)
bible_qa_pairs_df.head()

Unnamed: 0,questions,answers
0,What was the name of Jesus' mother?,Mary
1,What was the name of the garden where Adam and...,Eden
2,With what food did Jesus feed the multitude?,Five loaves and two fishes
3,What method did the Romans use to kill Jesus?,Crucifixion
4,From which part of Adam's body did God create ...,Rib


We should now load in both our model and tokenizer. In the case of the former, we load in three instances of the model, one for each of our evaluations.

In [6]:
model_name = 'gpt2'
pretrained_model = GPT2LMHeadModel.from_pretrained(model_name)
fine_tuned_model = GPT2LMHeadModel.from_pretrained(model_name)
peft_model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Now that we have our tokenizer, we can tokenize our biblical text for fine-tuning.

In [7]:
def tokenize_bible_text(examples):
    return tokenizer(
        examples['text'], truncation=True,
        padding='max_length', max_length=512
    )
tokenized_bible_text = bible_text_dataset.map(
    tokenize_bible_text, batched=True
)

Map:   0%|          | 0/31104 [00:00<?, ? examples/s]

Fine-tuning can now take place. We first fine-tune in a straightforward fashion, simply performing more training but with our specific data.

In [8]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2
)
trainer = Trainer(
    model=fine_tuned_model,
    args=training_args,
    train_dataset=tokenized_bible_text['train'],
    data_collator=data_collator
)
trainer.train()

Step,Training Loss
500,3.2768
1000,3.1181
1500,3.0045
2000,2.9286
2500,2.9073
3000,2.8636
3500,2.829
4000,2.8143
4500,2.7769
5000,2.771


TrainOutput(global_step=7776, training_loss=2.85940164793666, metrics={'train_runtime': 1040.5479, 'train_samples_per_second': 29.892, 'train_steps_per_second': 7.473, 'total_flos': 8127227363328000.0, 'train_loss': 2.85940164793666, 'epoch': 1.0})

Next, we perform parameter-efficient fine-tuning (PEFT) in the form of low-rank adaptation (LoRA).

In [9]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],
    lora_dropout=0.1,
    bias='none'
)
peft_model = get_peft_model(peft_model, lora_config)

trainer.model = peft_model
trainer.train()



Step,Training Loss
500,2.5787
1000,2.6027
1500,2.5692
2000,2.5443
2500,2.5732
3000,2.5677
3500,2.5616
4000,2.5653
4500,2.5497
5000,2.5623


TrainOutput(global_step=7776, training_loss=2.569771880475582, metrics={'train_runtime': 1052.5009, 'train_samples_per_second': 29.552, 'train_steps_per_second': 7.388, 'total_flos': 8282213405687808.0, 'train_loss': 2.569771880475582, 'epoch': 1.0})

Finally, we can evaluate our models on the question-answer pairs.

In [10]:
def generate_answer(model: GPT2LMHeadModel, question: str) -> str:
    if not question.strip():
        return ''
    input_tokens = tokenizer.encode(
        question,
        return_tensors='pt'
    ).to(model.device)
    attention_mask = input_tokens.ne(
        tokenizer.pad_token_id
    ).to(model.device)
    model_output = model.generate(
        input_tokens,
        max_new_tokens=10,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=attention_mask
    )
    answer = tokenizer.decode(
        model_output[0],
        skip_special_tokens=True
    )
    answer = answer.replace(question, '')
    return answer

In [11]:
pretrained_model_answers = []
fine_tuned_model_answers = []
peft_model_answers = []
for question in tqdm(bible_qa_pairs_df['questions']):
    pretrained_model_answers.append(
        generate_answer(pretrained_model, question))
    fine_tuned_model_answers.append(
        generate_answer(fine_tuned_model, question))
    peft_model_answers.append(
        generate_answer(peft_model, question))

bible_qa_pairs_df['pretrained_model_answers'] = pretrained_model_answers
bible_qa_pairs_df['fine_tuned_model_answers'] = fine_tuned_model_answers
bible_qa_pairs_df['peft_model_answers'] = peft_model_answers

100%|██████████| 886/886 [08:44<00:00,  1.69it/s]


In [15]:
pretrained_correct = 0
fine_tuned_correct = 0
peft_correct = 0
bible_qa_rows = bible_qa_pairs_df.to_dict(orient='records')
for row in bible_qa_rows:
    if row['answers'] in row['pretrained_model_answers']:
        print(row['questions'], '|||', row['answers'], '|||', row['pretrained_model_answers'])
        pretrained_correct += 1
    if row['answers'] in row['fine_tuned_model_answers']:
        print(row['questions'], '|||', row['answers'], '|||', row['fine_tuned_model_answers'])
        fine_tuned_correct += 1
    if row['answers'] in row['peft_model_answers']:
        print(row['questions'], '|||', row['answers'], '|||', row['peft_model_answers'])
        peft_correct += 1

pretrained_accuracy = pretrained_correct / len(bible_qa_rows)
fine_tuned_accuracy = fine_tuned_correct / len(bible_qa_rows)
peft_accuracy = peft_correct / len(bible_qa_rows)

print(f'Pre-trained Model Accuracy: {pretrained_accuracy}')
print(f'Fine-tuned Model Accuracy: {fine_tuned_accuracy}')
print(f'PEFT Model Accuracy: {peft_accuracy}')

What was the name of Jesus' mother? ||| Mary ||| 

The name of Jesus' mother was Mary
What was the name of Jesus' mother? ||| Mary ||| 

The name of Jesus' mother was Mary
To which city will all nations one day go to worship God? ||| Jerusalem ||| 

The answer is, of course, Jerusalem
To which city will all nations one day go to worship God? ||| Jerusalem ||| 

The answer is, of course, Jerusalem
Who closed the door of Noah's ark? ||| God |||  and what is the ark of God? and
Which tribe followed David after the split of the Kingdom of Israel? ||| Judah ||| 

The answer is that the tribe of Judah
Which tribe followed David after the split of the Kingdom of Israel? ||| Judah ||| 

The answer is that the tribe of Judah
Which apostle was a Pharisee? ||| Paul ||| 

The apostle Paul, who was a Phar
Which apostle was a Pharisee? ||| Paul ||| 

The apostle Paul, who was a Phar
Who prayed for the fiery serpents to be taken away from Israel? ||| Moses ||| 

The Lord said to Moses, "I
Who prayed 