# Finetuning GPT2 with custom data

In [None]:
import warnings
warnings.filterwarnings("ignore")

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import torch
from datasets import Dataset

: 

We can either use `AutoModelWithLMHead` and specify that we want `"gpt2"`, or we can use `GPT2LMHeadModel`. The first option is preferable because it means we can use any model with an LM head. We are using GPT-2, the smallest version. There are also `gpt2-medium`, `gpt2-large`, and `gpt2-xl`.

**What is an LM Head?**

It is the Language Model head. It is the fully connected neural network layer that maps the high-dimensional output of the transformer to the size of the vocabulary used in the model. This part of the network produces the probability distribution over the tokens in the model's vocabulary.

In [2]:
context_length = 516

tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
model = AutoModelWithLMHead.from_pretrained("gpt2-medium").to("cuda")

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 354.8M parameters


Our model is quite small, but still very effective. We write a function that will prompt the model for us. `model.generate` takes a few arguments. Here are the important ones:

`max_length`: How many tokens do you want the model to output? If you set this too long, you might get repetition.

`temperature`: How random do you want the output to be. 0 is not very random, and 1 is highly random.

`no_repeat_ngram_size`: All ngrams of this size can only occur this many times. An ngram is a series of adjacent tokens. So in other words if this is 2, then all ngrams of size 2 can only occur once.

`do_sample`: Whether or not to sample. If False, you'll get the same output every time.

In [3]:
def generate(prompt, tokenizer, model, length):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=length,
                        temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.eos_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [4]:
output = generate("Do you have any regrets about the Monica Lewinsky scandal?", tokenizer, model, length=256)
print(output)

Do you have any regrets about the Monica Lewinsky scandal?

I don't regret it at all. I think it was the right thing to do at the time, and I'm very proud of what I did. But I also think there were other things that I should have done differently. And I've learned a lot from my mistakes.


If you run the above you'll probably get a response from former President Trump. This is perhaps indicative of the training data used to pretrain GPT-2.

## Finetuning with custom data

We use transcripts of speeches from former President Clinton.

In [5]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1_QA_special.txt', split='train')

We need to do some messing around with the `datasets` library to get this to work. We tokenize the text, cut the text into chunks, and put it into a format the Hugging Face trainer can read.

In [6]:
tokenizer.pad_token = tokenizer.eos_token
special_tokens_dict = {
    "bos_token": "<start>",
    "eos_token": "<stop>",
}
tokenizer.add_special_tokens(special_tokens_dict)
# resize the token embeddings
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 1024)

In [7]:
outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

In [8]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [9]:
out = data_collator([tokenized_dataset[i] for i in range(10)])
for key in out:
    print(f"{key} shape: {out[key].shape}")
print(f"\n{tokenized_dataset}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([10, 516])
attention_mask shape: torch.Size([10, 516])
labels shape: torch.Size([10, 516])

Dataset({
    features: ['input_ids'],
    num_rows: 274
})


## Training

We define our training arguments: batch size, epochs, etc. All of these things can have an impact on performance.

In [10]:
args = TrainingArguments(
    output_dir="../results",
    per_device_train_batch_size=4,
    num_train_epochs=50,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [11]:
trainer.train()

Step,Training Loss
100,4.8875
200,2.6191
300,2.301
400,2.075


RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 2544239360 vs 2544239252

In [12]:
# prompt the model with a query and get the answer
prompt = "Do you have any regrets about the Monica Lewinsky scandal?"
output = generate(prompt, tokenizer, model, 256)[len(prompt):]

print(prompt + "\n\n" + output)

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
Do you have any regrets about the Monica Lewinsky scandal?

 I mean, do you?No, I don't. I think it was the best thing that ever happened to American politics, and I hope it never happens to anybody else. But I do regret the fact that I didn't do more to help her through the Whitewater investigation or the Independent Counsel's investigation. And I regret that, in retrospect, it may have been the right thing to do, because of the special relationship that existed between her and me and Bill Clinton, who was at the time Governor of Arkansas and a Senator from New York. It was very important to me to have a relationship with her that was mutually respectful and respectful of her position in the family life and the private lives of both of us. That's not the way I would have handled it, but it worked out just fine for me. So I'm not going to dwell o

This is working decently. Now let's incorporate RAG...

In [40]:
# save model
trainer.save_model("../models/qa_model/")

# clear gpu memory
torch.cuda.empty_cache()

In [2]:
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("../models/qa_model/")
model = AutoModelWithLMHead.from_pretrained("../models/qa_model/").to("cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [4]:
def create_database(text_path, chunk_size, chunk_overlap):
    # load text
    with open(text_path, 'r') as f:
        text = f.read()

    # Split text
    text_splitter = CharacterTextSplitter(
        separator=' ',
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_text(text)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
    db = FAISS.from_texts(chunks, embeddings)

    return db

In [5]:
text_path = '../sample_data/cleaned_test_text_1_QA_special.txt'
db = create_database(text_path, chunk_size=300, chunk_overlap=100)

In [24]:
query = "Do you have any regrets about the Monica Lewinsky scandal?"

docs = db.similarity_search(query, k=5)

for doc in docs:
    print(doc.page_content+'\n')

question on a matter of history that I feel compelled to ask you, Mr. President. We sat, you and I, 2 years ago almost to the day, and I it was the day that the Monica Lewinsky story broke in the Washington Post and the Los Angeles Times. And I and you denied that you had had an improper sexual

one mistake. I apologized for it. I paid a high price for it, and I've done my best to atone for it by being a good President. But I believe we also endured what history will clearly record was a bogus investigation, where there was nothing to Whitewater and nothing to these other charges, and they

Washington Post and the Los Angeles Times. And I and you denied that you had had an improper sexual relationship with Ms. Lewinsky. In retrospect, if you had answered that differently right at the beginning, not only just my question but all those questions at the beginning, do you think there would

David Gergen agreed and said you should turn over all the data, everything. And you didn't do it. Do

In [28]:
output = generate(prompt, tokenizer, model, 256)[len(prompt):]

print(query + "\n\n" + output)

Do you have any regrets about the Monica Lewinsky scandal?

Well, I regret it. I think it was a terrible mistake on my part to characterize it as something that had nothing to do with my husband and his wife and their children. And I'm glad that the facts were out there, and I hope the American people will come forward and find out whether there was anything wrong with it and, if so, whether anybody else should be punished or not. But I am profoundly sorry that it occurred to you and your family and to the country at large, because I know that there are people in this country who believe that something terrible had happened and ought to be brought to light. That's what I want people to think about when they see a jury of law abiding citizens come back and say, "This woman did something wrong. She should never have been in office." I don't think that that's appropriate. It's not appropriate for me to say that I condone or even in the strongest of terms possible refer to or condone, in a