# Finetuning GPT-2 with custom data

In [1]:
import warnings
warnings.filterwarnings("ignore")

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import torch
from datasets import Dataset

We can either use `AutoModelWithLMHead` and specify that we want `"gpt2"`, or we can use `GPT2LMHeadModel`. The first option is preferable because it means we can use any model with an LM head. We are using GPT-2, the smallest version. There are also `gpt2-medium`, `gpt2-large`, and `gpt2-xl`.

**What is an LM Head?**

It is the Language Model head. It is the fully connected neural network layer that maps the high-dimensional output of the transformer to the size of the vocabulary used in the model. This part of the network produces the probability distribution over the tokens in the model's vocabulary.

In [2]:
context_length = 512

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2").to("cuda")

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


Our model is quite small, but still very effective. We write a function that will prompt the model for us. `model.generate` takes a few arguments. Here are the important ones:

`max_length`: How many tokens do you want the model to output? If you set this too long, you might get repetition.

`temperature`: How random do you want the output to be. 0 is not very random, and 1 is highly random.

`no_repeat_ngram_size`: All ngrams of this size can only occur this many times. An ngram is a series of adjacent tokens. So in other words if this is 2, then all ngrams of size 2 can only occur once.

`do_sample`: Whether or not to sample. If False, you'll get the same output every time.

In [3]:
def generate(prompt, tokenizer, model, length):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=length,
                        temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.eos_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [4]:
output = generate("State your name and occupation.", tokenizer, model, length=256)
print(output)

State your name and occupation.

If you have any questions or concerns, please contact us.


If you run the above you'll probably get a response from former President Trump. This is perhaps indicative of the training data used to pretrain GPT-2.

## Finetuning with custom data

We use transcripts of speeches from former President Clinton.

In [5]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1_QA_special.txt', split='train')

We need to do some messing around with the `datasets` library to get this to work. We tokenize the text, cut the text into chunks, and put it into a format the Hugging Face trainer can read.

In [6]:
tokenizer.pad_token = tokenizer.eos_token
special_tokens_dict = {
    "bos_token": "<start>",
    "eos_token": "<stop>",
}
tokenizer.add_special_tokens(special_tokens_dict)
# # resize the token embeddings
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

In [7]:
outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

In [8]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [9]:
out = data_collator([tokenized_dataset[i] for i in range(10)])
for key in out:
    print(f"{key} shape: {out[key].shape}")
print(f"\n{tokenized_dataset}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([10, 512])
attention_mask shape: torch.Size([10, 512])
labels shape: torch.Size([10, 512])

Dataset({
    features: ['input_ids'],
    num_rows: 276
})


## Training

We define our training arguments: batch size, epochs, etc. All of these things can have an impact on performance.

In [10]:
args = TrainingArguments(
    output_dir="../results",
    per_device_train_batch_size=12,
    num_train_epochs=20,
    logging_steps=100,
    save_strategy="no"

)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [11]:
trainer.train()

Step,Training Loss
100,5.8527
200,2.8263
300,2.6356
400,2.5299


TrainOutput(global_step=460, training_loss=3.334866299836532, metrics={'train_runtime': 241.5265, 'train_samples_per_second': 22.855, 'train_steps_per_second': 1.905, 'total_flos': 1442332016640000.0, 'train_loss': 3.334866299836532, 'epoch': 20.0})

In [14]:
# prompt the model with a query and get the answer
prompt = "Do you have any regrets about your life?"
output = generate(prompt, tokenizer, model, 256)[len(prompt):]

print(prompt + "\n\n" + output)

Do you have any regrets about your life?

Well, I don't regret it. I think it was a good thing for the country, and I regret that I didn't do anything wrong. But I'm not sure I ever would have been President if I hadn't been able to do what I did. And I've tried to follow the law and the Constitution, but I never had any kind of personal involvement in the decisionmaking process. So I can't comment on that. It's just a question of whether I should have done something wrong or not. If I had done it wrong, it would be a huge mistake for me, because I knew I was making a mistake. You know, the thing that happened to me in '94 was basically the beginning of the end of my political career, which was when I got out of college and started running for President in high school. We had a very bad year. There were a lot of things that we did wrong that weren't good for our country and bad for America. That's why I thought I ought to be President, not just because it's a big mistake I made to get 

This is working decently. Now let's incorporate RAG...

In [15]:
# save model
trainer.save_model("../models/qa_model/")

# clear gpu memory
torch.cuda.empty_cache()

In [16]:
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("../models/qa_model/")
model = AutoModelWithLMHead.from_pretrained("../models/qa_model/").to("cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [18]:
def create_database(text_path, chunk_size, chunk_overlap):
    # load text
    with open(text_path, 'r') as f:
        text = f.read()

    # Split text
    text_splitter = CharacterTextSplitter(
        separator=' ',
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_text(text)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
    db = FAISS.from_texts(chunks, embeddings)

    return db

In [21]:
text_path = '../sample_data/cleaned_test_text_1.txt'
db = create_database(text_path, chunk_size=512, chunk_overlap=64)

In [22]:
query = "Do you have any regrets about your life?"

docs = db.similarity_search(query, k=3)

for doc in docs:
    print(doc.page_content+'\n')

I had made a terrible personal mistake, which I did try to correct, which then a year later got outed on or almost a year later and had to live with. And it caused an enormous amount of pain to my family and my administration and to the country at large, and I felt awful about it. And I had to deal with the aftermath of it. And then, I had to deal with what the Republicans were trying to do with it. But I had a totally different take on it than most people. I really believed then and I believe now I was

for that. And I will leave office with that sense of gratitude, because I think that's what every President wants to do. Every President wants to feel that during his tenure of service, America grew stronger and healthier and better. I feel good about where we are in our relations with the rest of the world. I think we've basically been a force for peace and prosperity. What is my greatest regret? I may not be able to say yet. I really wanted, with all my heart, to finish the Oslo peac

In [24]:
prompt = " ".join([doc.page_content for doc in docs]) + "\n\n" + query

output = generate(prompt, tokenizer, model, 512)[len(prompt):]

print(query + "\n\n" + output)

Do you have any regrets about your life?

Well, first of all I regret not having done more for the United States in the Middle East and the Balkans and all the other places I've been involved in, including Bosnia, Kosovo, Rwanda, Bosnia and Kosovo. It was a mistake I made at some point, but I didn't regret it as much as I would have liked to have done if it hadn't been for what happened in Bosnia. So I don't think I should have made that mistake again. Secondly, it's important for me to remember that when I became President, there were a lot of people out there who believed that Saddam Hussein had weapons of mass destruction and chemical weapons and biological weapons. That's not true. The truth is, the truth was that there was no such thing as a chemical or biological weapon. There was only a small percentage of
