# Fine-Tuning on Pre-Training Data: A Performance Comparison

In this notebook, we fine-tune a large language model (LLM) on the Bible (using both standard and parameter-efficient fine-tuning) and evaluate any potential performance increase on the BibleQA dataset.

To start, we install and import our necessary dependencies.

In [None]:
!pip install transformers datasets peft pandas tqdm

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

In [None]:
import csv
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)
N_FINE_TUNING_EPOCHS = 4

Next, we load in the biblical text as a Hugging Face dataset for later fine-tuning.

In [None]:
bible_text_dataset = load_dataset('text', data_files={'train': 'bible.txt'})
bible_text_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 31104
    })
})

Before proceeding, we print a sample from the Bible dataset as a sanity check.

In [None]:
bible_text_dataset['train'][3000]['text']

'Leviticus 11:1\tAnd Jehovah spake unto Moses and to Aaron, saying unto them,'

We now prepare our BibleQA dataset as question-answer pairs and display a sample.

In [None]:
bible_qa_pairs = {
    'questions': [],
    'answers': []
}
with open('bible_qa_pairs.csv') as bible_qa_pairs_csv_file:
    bible_qa_csv_file_reader = csv.reader(bible_qa_pairs_csv_file, delimiter='\t')
    header = next(bible_qa_csv_file_reader)
    for row in bible_qa_csv_file_reader:
        if row:
            bible_qa_pairs['questions'].append(
                ''.join(row[0].split('. ')[1:]))
            bible_qa_pairs['answers'].append(
                ''.join(''.join(row[1].split('. ')[1:]).split(' (')[:1]))
bible_qa_pairs_df = pd.DataFrame(bible_qa_pairs)
bible_qa_pairs_df.head()

Unnamed: 0,questions,answers
0,What was the name of Jesus' mother?,Mary
1,What was the name of the garden where Adam and...,Eden
2,With what food did Jesus feed the multitude?,Five loaves and two fishes
3,What method did the Romans use to kill Jesus?,Crucifixion
4,From which part of Adam's body did God create ...,Rib


We should now load in both our models and our tokenizer. In the case of the former, we load in three instances of the model, one for each of our evaluations.

In [None]:
model_name = 'openai-community/gpt2-large'
pretrained_model = AutoModelForCausalLM.from_pretrained(model_name)
fine_tuned_model = AutoModelForCausalLM.from_pretrained(model_name)
peft_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Now that we have our tokenizer, we can tokenize our biblical text for fine-tuning.

In [None]:
def tokenize_bible_text(examples):
    return tokenizer(
        examples['text'], truncation=True,
        padding='max_length', max_length=512
    )
tokenized_bible_text = bible_text_dataset.map(
    tokenize_bible_text, batched=True
)

Map:   0%|          | 0/31104 [00:00<?, ? examples/s]

Fine-tuning can now take place. We first fine-tune in a straightforward fashion, simply performing more training but with our specific data.

In [None]:
# On A100: ~1.5 hr/epoch
fined_tuned_data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)
fine_tuned_training_args = TrainingArguments(
    output_dir='./fine_tuning_results',
    overwrite_output_dir=True,
    num_train_epochs=N_FINE_TUNING_EPOCHS,
    per_device_train_batch_size=8,
    save_strategy='no'
)
fine_tuned_trainer = Trainer(
    model=fine_tuned_model,
    args=fine_tuned_training_args,
    train_dataset=tokenized_bible_text['train'],
    data_collator=fined_tuned_data_collator
)
fine_tuned_trainer.train()

Step,Training Loss
500,2.6641
1000,2.5041
1500,2.4579
2000,2.4109
2500,2.358
3000,2.3267
3500,2.3115
4000,2.1475
4500,1.7963
5000,1.7964


TrainOutput(global_step=15552, training_loss=1.6144247307698913, metrics={'train_runtime': 20690.7961, 'train_samples_per_second': 6.013, 'train_steps_per_second': 0.752, 'total_flos': 2.707510272196608e+17, 'train_loss': 1.6144247307698913, 'epoch': 4.0})

Next, we perform parameter-efficient fine-tuning (PEFT) in the form of low-rank adaptation (LoRA).

In [None]:
# On A100: ~1.0 hr/epoch
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=['c_attn', 'c_proj']
)
peft_model = get_peft_model(peft_model, lora_config)

peft_data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)
peft_training_args = TrainingArguments(
    output_dir='./peft_results',
    overwrite_output_dir=True,
    num_train_epochs=N_FINE_TUNING_EPOCHS,
    per_device_train_batch_size=8,
    save_strategy='no'
)
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_bible_text['train'],
    data_collator=peft_data_collator
)
peft_trainer.train()



Step,Training Loss
500,2.8876
1000,2.5522
1500,2.5266
2000,2.5022
2500,2.4818
3000,2.4683
3500,2.4808
4000,2.4436
4500,2.4548
5000,2.4418


TrainOutput(global_step=15552, training_loss=2.4338112740850253, metrics={'train_runtime': 15435.6276, 'train_samples_per_second': 8.06, 'train_steps_per_second': 1.008, 'total_flos': 2.7385074806685696e+17, 'train_loss': 2.4338112740850253, 'epoch': 4.0})

Finally, we can evaluate our models on the question-answer pairs. Note that we embed our questions into a few-shot learning prompt to encourage the text-continuation model to output answers to the question (we also explicitly inform the model of the biblical nature of the questions).

In [None]:
def generate_answer(model: AutoModelForCausalLM, question: str) -> str:
    if not question.strip():
        return ''

    prompt =\
    f'The following are questions and answers about the Bible.\n\n\
Q: Who was the first man?\n\
A: Adam\n\n\
Q: What was special about Samson?\n\
A: He was very strong\n\n\
Q: Where was Jesus born?\n\
A: Bethlehem\n\n\
Q: {question}\n\
A: '
    input_tokens = tokenizer.encode(
        question,
        return_tensors='pt'
    ).to(model.device)
    attention_mask = input_tokens.ne(
        tokenizer.pad_token_id
    ).to(model.device)
    model_output = model.generate(
        input_tokens,
        max_new_tokens=15,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=attention_mask
    )
    answer = tokenizer.decode(
        model_output[0],
        skip_special_tokens=True
    )
    answer = answer.replace(prompt, '')
    return answer

In [None]:
pretrained_model_answers = []
fine_tuned_model_answers = []
peft_model_answers = []
for question in tqdm(bible_qa_pairs_df['questions']):
    pretrained_model_answers.append(
        generate_answer(pretrained_model, question))
    fine_tuned_model_answers.append(
        generate_answer(fine_tuned_model, question))
    peft_model_answers.append(
        generate_answer(peft_model, question))

bible_qa_pairs_df['pretrained_model_answers'] = pretrained_model_answers
bible_qa_pairs_df['fine_tuned_model_answers'] = fine_tuned_model_answers
bible_qa_pairs_df['peft_model_answers'] = peft_model_answers

100%|██████████| 886/886 [34:29<00:00,  2.34s/it]


In [None]:
pretrained_correct = 0
fine_tuned_correct = 0
peft_correct = 0
bible_qa_rows = bible_qa_pairs_df.to_dict(orient='records')
for row in bible_qa_rows:
    one_correct = False
    if row['answers'] in row['pretrained_model_answers']:
        pretrained_correct += 1
        one_correct = True
    if row['answers'] in row['fine_tuned_model_answers']:
        fine_tuned_correct += 1
        one_correct = True
    if row['answers'] in row['peft_model_answers']:
        peft_correct += 1
        one_correct = True
    if one_correct:
        print(f'Question: {row["questions"]}')
        print(f'Answer: {row["answers"]}')
        print(f'Pre-trained Model Answer: {row["pretrained_model_answers"]}')
        print(f'Fine-tuned Model Answer: {row["fine_tuned_model_answers"]}')
        print(f'PEFT Model Answer: {row["peft_model_answers"]}')
        print()

pretrained_accuracy = pretrained_correct / len(bible_qa_rows)
fine_tuned_accuracy = fine_tuned_correct / len(bible_qa_rows)
peft_accuracy = peft_correct / len(bible_qa_rows)

print(f'Pre-trained Model Accuracy: {pretrained_accuracy}')
print(f'Fine-tuned Model Accuracy: {fine_tuned_accuracy}')
print(f'PEFT Model Accuracy: {peft_accuracy}')

Question: What was the name of Jesus' mother?
Answer: Mary
Pre-trained Model Answer: What was the name of Jesus' mother?

The name of Jesus' mother was Mary.

Was Jesus
Fine-tuned Model Answer: What was the name of Jesus' mother? Elisabeth, of the family of the Elisabethites; and
PEFT Model Answer: What was the name of Jesus' mother? And what was the name of his father? And what was the name of

Question: What was the name of the garden where Adam and Eve lived?
Answer: Eden
Pre-trained Model Answer: What was the name of the garden where Adam and Eve lived?

The Garden of Eden.

What was the name of the
Fine-tuned Model Answer: What was the name of the garden where Adam and Eve lived? The name of the garden was called Cana of Galilee. And the
PEFT Model Answer: What was the name of the garden where Adam and Eve lived? And what was the name of the first tree which they sowed? And

Question: What was special about Jesus' mother?
Answer: She was a virgin
Pre-trained Model Answer: What was