# Finetuning GPT-2 with custom data

In [7]:
import warnings
warnings.filterwarnings("ignore")

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModelWithLMHead
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import torch
from datasets import Dataset

We can either use `AutoModelWithLMHead` and specify that we want `"gpt2"`, or we can use `GPT2LMHeadModel`. The first option is preferable because it means we can use any model with an LM head. We are using GPT-2, the smallest version. There are also `gpt2-medium`, `gpt2-large`, and `gpt2-xl`.

**What is an LM Head?**

It is the Language Model head. It is the fully connected neural network layer that maps the high-dimensional output of the transformer to the size of the vocabulary used in the model. This part of the network produces the probability distribution over the tokens in the model's vocabulary.

In [2]:
context_length = 256
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2").to("cuda")

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


Our model is quite small, but still very effective. We write a function that will prompt the model for us. `model.generate` takes a few arguments. Here are the important ones:

`max_length`: How many tokens do you want the model to output? If you set this too long, you might get repetition.

`temperature`: How random do you want the output to be. 0 is not very random, and 1 is highly random.

`no_repeat_ngram_size`: All ngrams of this size can only occur this many times. An ngram is a series of adjacent tokens. So in other words if this is 2, then all ngrams of size 2 can only occur once.

`do_sample`: Whether or not to sample. If False, you'll get the same output every time.

In [3]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=256,
                        # temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        # do_sample=True,
                        # pad_token_id=tokenizer.eos_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

output = generate("Do you have any regrets?")
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do you have any regrets?

No, I'm not regretful at all. It's just that I didn't want to do it. I just wanted to go out there and do what I love doing. That's why I did it, because I knew I was going to be a good person and I had a lot of respect for the people that were around me. But I also knew that it would be hard for me to make it to the end of the season, so I wasn't sure if I'd be able to come back and play in the playoffs or not. So I decided to take it one step at a time, and that's how I ended up doing it."


If you run the above you'll probably get a coherent, but meaningless response.

## Further pretraining on a text file

We use transcripts of press events from former President Clinton.

In [4]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1.txt', split='train')

We need to do some messing around with the `datasets` library to get this to work. We tokenize the text, cut the text into chunks, and put it into a format the Hugging Face trainer can read.

In [5]:
outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

In [6]:
tokenizer.pad_token = 'tokenizer.eos_token'
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([5, 256])
attention_mask shape: torch.Size([5, 256])
labels shape: torch.Size([5, 256])


In [7]:
args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=20,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [8]:
trainer.train()

Step,Training Loss
100,3.0504
200,2.8131
300,2.6541


KeyboardInterrupt: 

In [9]:
output = generate("Do you have any regrets?")
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do you have any regrets? The President. Well, first of all, I didn't do anything wrong. I was just trying to do my job and try to help the American people. And I think that's the most important thing that I could have done. Secondly, we had a lot of good people in the White House, and we did a good job there. We had some bad people there, too. So I don't regret what I did, but I do regret the fact that we were able to get people to come to our office and say, "Mr. President, you know, this is what we're doing here," and then we got a chance to talk to them and get a better sense of what was going on. Mr. Blitzer. Do you regret that, personally, that you were the only person in this room who was not inaudible to the people who were there at the dinner table when the President was talking to you? President Bush. No, not at all. That's not what happened. It happened in '96, '98. But I'm very sorry about it, because I thought it was a terrible mistake to say anything that would have been i

In [10]:
# save model
trainer.save_model("../models/basic_model/")

# clear gpu memory
torch.cuda.empty_cache()

So this is definitely learning the data. Now let's see if we can make it learn a general QA format. To do this, we recognize some of the reoccuring features of the data set, such as responses being given by `The President.` We look for all instances of `The President.` and replace with `RESPONSE: `. We also search for names of the interviewers and replace with `QUESTION: `. We also include the end of sentence token at the end of the responses, in the hopes that it will also learn these.

In [8]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1_QA.txt', split='train')

In [9]:
context_length = 512
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '<pad>'})

model = AutoModelWithLMHead.from_pretrained("gpt2").to("cuda")
with torch.no_grad():
  model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

GPT-2 size: 124.4M parameters


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([5, 512])
attention_mask shape: torch.Size([5, 512])
labels shape: torch.Size([5, 512])


In [10]:
print(tokenizer.decode(out['input_ids'][0], skip_special_tokens=False))

QUESTION: We understand you made a foreign policy related call shortly. RESPONSE: Yes, I just talked to President Kim about the No Gun Ri incident and personally expressed my regret to him. And I thanked him for the work that we had done together in developing our mutual statement. We also set up this scholarship fund and did some other things that we hope will be a genuine gesture of our regret. It was a very you know, I had a good talk with him.<|endoftext|>QUESTION: Any particular reason why you used the word "regret" instead of "apology" in your statement? RESPONSE: I think the findings were I think he knows that "regret" and "apology" both mean the same thing, in terms of being profoundly sorry for what happened. But I believe that the people who looked into it could not conclude that there was a deliberate act, decided at a high enough level in the military hierarchy, to acknowledge that, in effect, the Government had participated in something that was terrible. So I don't think 

Really the process is just identical to before.

In [11]:
args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=20,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [12]:
trainer.train()

Step,Training Loss
100,6.1318
200,2.7629
300,2.6033
400,2.489
500,2.3921
600,2.3297
700,2.2916


TrainOutput(global_step=700, training_loss=3.0000536673409597, metrics={'train_runtime': 257.181, 'train_samples_per_second': 21.697, 'train_steps_per_second': 2.722, 'total_flos': 1458009538560000.0, 'train_loss': 3.0000536673409597, 'epoch': 20.0})

We slightly change our prompt and display templates, just to make it more readable.

In [29]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=len(input_ids[0])+128,
                        temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.pad_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [14]:
question = "QUESTION: Do you have any regrets?"
prompt = f"{question} RESPONSE:"
output = generate(prompt)[len(prompt):]
print(f"{question}\n\nRESPONSE:{output}")

QUESTION: Do you have any regrets?

RESPONSE: I don't know.


In [15]:
generate("QUESTION: What did you talk to President Kim about today? RESPONSE:")

'QUESTION: What did you talk to President Kim about today? RESPONSE: Well, first of all, I talked to him a few times, and he said, "Okay, let\'s talk about something else." And I thought he was very interested in what I was trying to do in the Middle East. I think that\'s what he wanted to know, because he knows that we\'re going to have to deal with the aftermath of the nuclear agreement with Iran. But he also said that he thought it was a good idea for the United States to support the Lebanese peace process, so that they could be part of it. And so I agreed with him on that.'

In [16]:
# save model
trainer.save_model("../models/QA_model")

So this is working somewhat. Note that GPT-2 is a very small model, and the dataset is also small. In general, the results will be quite poor. You can try rerunning this on `gpt-medium` or `gpt-large` if you have the compute and memory. On my machine, this whole notebook will consume about 12GB of RAM.

# Retreival Augmented Generation

In [17]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [18]:
def create_database(text_path, chunk_size, chunk_overlap):
    # load text
    with open(text_path, 'r') as f:
        text = f.read()

    # Split text
    text_splitter = CharacterTextSplitter(
        separator=' ',
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_text(text)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
    db = FAISS.from_texts(chunks, embeddings)

    return db

text_path = '../sample_data/cleaned_test_text_1.txt'
db = create_database(text_path, chunk_size=256, chunk_overlap=64)

In [19]:
query = "Do you have any regrets?"

docs = db.similarity_search(query, k=3)

for doc in docs:
    print(doc.page_content+'\n')

that, but I don't. But the thing I regret most, except for doing the wrong thing, is misleading the American people about it. I do not regret the fact that I fought the Independent Counsel. And what they did was, in that case and generally, was completely

of the world. I think we've basically been a force for peace and prosperity. What is my greatest regret? I may not be able to say yet. I really wanted, with all my heart, to finish the Oslo peace process, because I believe that if Israel and the

when you look back on it, do you regret the substance of what you did? Do you think that going with an employer mandate was the wrong thing? And also, do you regret the detail in which you did it, the fact that you did the 1,300 pages and The President. I



In [20]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=len(input_ids[0])+256,
                        temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=3,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.pad_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [34]:
question = "QUESTION: What is your biggest regret?"

prompt = " ".join([doc.page_content for doc in docs]) + "\n\n" + question + " RESPONSE:"

output = generate(prompt)[len(prompt):]

print(question + "\n\nRESPONSE:" + output)

QUESTION: What is your biggest regret?

RESPONSE: Well, first of all, I didn't do what I said I wanted to do, which is to try to convince the Palestinians that we had no intention of continuing the war with them, and that they had a legitimate interest in continuing to be part of Israel's existence. That was a mistake I made.


Brilliant, that is working nicely given that the model is so small. We can make some small changes here to make everything more compact. We essentially want to define a single Clinton "Agent" that we can use.

Now you can define your parameters as a config.

In [1]:
from types import SimpleNamespace
from agent import Agent

from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter

model_config = SimpleNamespace(
    model_name = 'gpt2-medium',
    context_length = 256,
    temperature = 0.7,
    do_sample = True,
    gen_length = 128,
    repetition_penalty = 1.1,
)

training_config = SimpleNamespace(
    dataset_path = '../sample_data/cleaned_test_text_1_QA.txt',
    context_length = 256,
    batch_size = 4,
    num_epochs = 20,
)

database_config = SimpleNamespace(
    text_path = '../sample_data/cleaned_test_text_1.txt',
    embedding_model = 'sentence-transformers/all-mpnet-base-v2',
    chunk_size = 256,
    chunk_overlap = 64,
    vector_store = FAISS,
    text_splitter = CharacterTextSplitter,
)

In [2]:
from agent import Agent

clinton = Agent(model_config, database_config=database_config, local=True)

Initalizing model: gpt2-medium
Creating database from: ../sample_data/cleaned_test_text_1.txt


In [3]:
output = clinton.ask_question("Do you have any regrets?")
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Well first off--I just want everybody to know how sorry I am after having served our country this past eight years as your representative at State Department. Not only were there some very regrettable incidents; even though those things are unfortunate now since last August's elections where many good ideas fell through the cracks or went out of style without being implemented by me personally -- well, obviously these issues cannot go unchallenged indefinitely under Hillary Clinton who has presided over every bad idea from 9/11 to Afghanistan, had no clue whatsoever why Iraq gave weapons-of -mass destruction (WMD) then lied us into invading Libya. My view remains unchanged


In [4]:
clinton.train(training_config)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,4.9934
200,2.7028
300,2.5605
400,2.3724
500,2.195
600,2.0464
700,1.9291
800,1.6918
900,1.5712
1000,1.4632


In [20]:
import pickle

# save agent
with open(f'../models/local_clinton_agent_{model_config.model_name}.pkl', 'wb') as f:
    pickle.dump(clinton, f)

In [19]:
clinton.model_config.repetition_penalty = 1.01

output = clinton.ask_question("What is your biggest regret?")
print(output)

I don't know yet. I don't know. I mean, I'm sorry about the Wye River incident. I'm really sorry about that. There should have been more of an investigation. There should have been some kind of public accounting made of what had occurred. And I'm sorry about the fact that the Palestinians, I think, overreacted to it. And I'm sorry about the fact that we gave Arafat and his team a free hand, and that he then went on the offensive, which he did, in a way that was terrible. But the thing I regret most is that I fought that Independent Counsel.


Getting the right generation parameters for decent output can be really quite challenging.