# Finetuning GPT-2 with custom data

In [1]:
import warnings
warnings.filterwarnings("ignore")

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModelWithLMHead
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import torch
from datasets import Dataset

We can either use `AutoModelWithLMHead` and specify that we want `"gpt2"`, or we can use `GPT2LMHeadModel`. The first option is preferable because it means we can use any model with an LM head. We are using GPT-2, the smallest version. There are also `gpt2-medium`, `gpt2-large`, and `gpt2-xl`.

**What is an LM Head?**

It is the Language Model head. It is the fully connected neural network layer that maps the high-dimensional output of the transformer to the size of the vocabulary used in the model. This part of the network produces the probability distribution over the tokens in the model's vocabulary.

In [2]:
context_length = 256
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2").to("cuda")

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


Our model is quite small, but still very effective. We write a function that will prompt the model for us. `model.generate` takes a few arguments. Here are the important ones:

`max_length`: How many tokens do you want the model to output? If you set this too long, you might get repetition.

`temperature`: How random do you want the output to be. 0 is not very random, and 1 is highly random.

`no_repeat_ngram_size`: All ngrams of this size can only occur this many times. An ngram is a series of adjacent tokens. So in other words if this is 2, then all ngrams of size 2 can only occur once.

`do_sample`: Whether or not to sample. If False, you'll get the same output every time.

In [3]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=256,
                        # temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        # do_sample=True,
                        # pad_token_id=tokenizer.eos_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

output = generate("Do you have any regrets?")
print(output)

Do you have any regrets?

I don't regret anything. I'm just happy that I was able to do what I wanted to. That's all.


If you run the above you'll probably get a coherent, but meaningless response.

## Further pretraining on a text file

We use transcripts of press events from former President Clinton.

In [4]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1.txt', split='train')

We need to do some messing around with the `datasets` library to get this to work. We tokenize the text, cut the text into chunks, and put it into a format the Hugging Face trainer can read.

In [5]:
outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

In [6]:
tokenizer.pad_token = 'tokenizer.eos_token'
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([5, 256])
attention_mask shape: torch.Size([5, 256])
labels shape: torch.Size([5, 256])


In [9]:
args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=20,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [10]:
trainer.train()

Step,Training Loss
100,3.0504
200,2.8131
300,2.6541
400,2.5219
500,2.4216
600,2.3082
700,2.231
800,2.1326
900,2.0664
1000,2.0098


TrainOutput(global_step=1400, training_loss=2.2741852133614677, metrics={'train_runtime': 257.7111, 'train_samples_per_second': 43.382, 'train_steps_per_second': 5.432, 'total_flos': 1460622458880000.0, 'train_loss': 2.2741852133614677, 'epoch': 20.0})

In [11]:
output = generate("Do you have any regrets?")
print(output)

Do you have any regrets? The President. Well, I don't have to regret anything. I'm very proud of what we did in Vietnam, and I regret that we didn't do it the way we wanted it to be done. But I think the mistake I made was not doing what I thought was right, which led to the mistakes we made in the first Gulf war, in which we lost more than 100,000 Americans and hundreds of our friends and allies. And we've done a lot of good things since then. We've helped the people of Vietnam fight the Viet Cong, the North Vietnamese Army and the Republic of Korea and other countries that are involved in our efforts to end the violence there and restore democracy and respect the Line of Control. Mr. Wenner. Do you ever get angry at people who say things like, "He killed his father? He murdered his mother?" or "Did he get away with it? Didn't he do what he was charged with? Did he lose his family?" That's the kind of anger I have for the Vietnam veterans, because I know they say the same things to me

In [12]:
# save model
trainer.save_model("../models/basic_model/")

# clear gpu memory
torch.cuda.empty_cache()

So this is definitely learning the data. Now let's see if we can make it learn a general QA format. To do this, we recognize some of the reoccuring features of the data set, such as responses being given by `The President.` We look for all instances of `The President.` and replace with `RESPONSE: `. We also search for names of the interviewers and replace with `QUESTION: `. We also include the end of sentence token at the end of the responses, in the hopes that it will also learn these.

In [2]:
data = Dataset.from_text('../sample_data/cleaned_test_text_1_QA.txt', split='train')

Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 5761.41it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1093.98it/s]
Generating train split: 1 examples [00:00, 170.36 examples/s]


In [4]:
context_length = 512
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '<pad>'})

model = AutoModelWithLMHead.from_pretrained("gpt2").to("cuda")
with torch.no_grad():
  model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

outputs = tokenizer(
        data["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
{"input_ids": outputs.input_ids}

tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

GPT-2 size: 124.4M parameters


In [8]:
print(tokenizer.decode(out['input_ids'][0], skip_special_tokens=False))

QUESTION: We understand you made a foreign policy related call shortly. RESPONSE: Yes, I just talked to President Kim about the No Gun Ri incident and personally expressed my regret to him. And I thanked him for the work that we had done together in developing our mutual statement. We also set up this scholarship fund and did some other things that we hope will be a genuine gesture of our regret. It was a very you know, I had a good talk with him.<|endoftext|>QUESTION: Any particular reason why you used the word "regret" instead of "apology" in your statement? RESPONSE: I think the findings were I think he knows that "regret" and "apology" both mean the same thing, in terms of being profoundly sorry for what happened. But I believe that the people who looked into it could not conclude that there was a deliberate act, decided at a high enough level in the military hierarchy, to acknowledge that, in effect, the Government had participated in something that was terrible. So I don't think 

Really the process is just identical to before.

In [11]:
args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=20,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In [12]:
trainer.train()

Step,Training Loss
100,3.0261
200,2.7209
300,2.563
400,2.4468
500,2.3479
600,2.2849
700,2.2456


TrainOutput(global_step=700, training_loss=2.5193115670340402, metrics={'train_runtime': 255.9539, 'train_samples_per_second': 21.801, 'train_steps_per_second': 2.735, 'total_flos': 1458009538560000.0, 'train_loss': 2.5193115670340402, 'epoch': 20.0})

We slightly change our prompt and display templates, just to make it more readable.

In [13]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=len(input_ids[0])+128,
                        # temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.pad_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [16]:
question = "QUESTION: Do you have any regrets?"
prompt = f"{question} RESPONSE:"
output = generate(prompt)[len(prompt):]
print(f"{question}\n\nRESPONSE:{output}")

QUESTION: Do you have any regrets?

RESPONSE: No.


In [20]:
generate("QUESTION: What did you talk to President Kim about today? RESPONSE:")

'QUESTION: What did you talk to President Kim about today? RESPONSE: Well, I talked to him about the missile program and the North Korea issue. And I said, "Mr. President, we\'re going to have to work together to try to develop a long term strategy to stop this nuclear threat, and we\'ve got to find a way to put it in place that will stop it." And he was very supportive of that. So I thought it was a very good thing to say.'

In [30]:
# save model
trainer.save_model("../models/QA_model")

So this is working somewhat. Note that GPT-2 is a very small model, and the dataset is also small. In general, the results will be quite poor. You can try rerunning this on `gpt-medium` or `gpt-large` if you have the compute and memory. On my machine, this whole notebook will consume about 12GB of RAM.

# Retreival Augmented Generation

In [21]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [23]:
def create_database(text_path, chunk_size, chunk_overlap):
    # load text
    with open(text_path, 'r') as f:
        text = f.read()

    # Split text
    text_splitter = CharacterTextSplitter(
        separator=' ',
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_text(text)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')
    db = FAISS.from_texts(chunks, embeddings)

    return db

text_path = '../sample_data/cleaned_test_text_1_QA.txt'
db = create_database(text_path, chunk_size=256, chunk_overlap=64)

In [24]:
query = "Do you have any regrets?"

docs = db.similarity_search(query, k=3)

for doc in docs:
    print(doc.page_content+'\n')

is my greatest regret? I may not be able to say yet. I really wanted, with all my heart, to finish the Oslo peace process, because I believe that if Israel and the Palestinians could be reconciled, first the State of Israel would be secure, which is very

I wanted to do, but the overwhelming majority of things I wanted to do I was able to accomplish, and I'm grateful that it worked out for the country. And then a lot of other things came up along the way which were good for the country. So I'm happy now,

our people stuck with me, and that the American people stuck with me, and I was able to resist what it was they attempted to do. But I do regret the fact that I wasn't straight with the American people about it. It was something I was ashamed of and pained



In [32]:
def generate(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    output = model.generate(input_ids,
                        max_length=len(input_ids[0])+256,
                        temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=3,
                        early_stopping=True,
                        do_sample=True,
                        pad_token_id=tokenizer.pad_token_id
                        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [36]:
question = "QUESTION: Can you tell us about the Whitewater scandal?"

prompt = " ".join([doc.page_content for doc in docs]) + "\n\n" + question + " RESPONSE:"

output = generate(prompt)[len(prompt):]

print(question + "\n\nRESPONSE:" + output)

QUESTION: Can you tell us about the Whitewater scandal?

RESPONSE: I can tell you, first of all, I regret that I didn't do what I said I would do. I did what I thought was the right thing, but I did it in the wrong way. And I'm sorry about that. I think it's important that we all remember what happened to me. I was sitting in the Oval Office in the White House when I met with Mr. Arafat, and he said to me, "Mr. President, you've got to do this." And I said, "I don't want to do it. I don't think I can do it."


Brilliant, that is working nicely given that the model is so small. We can make some small changes here to make everything more compact. We essentially want to define a single Clinton "Agent" that we can use.

In [1]:
from types import SimpleNamespace
import warnings
warnings.filterwarnings("ignore")

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

import torch
from datasets import Dataset

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


class Agent():
    def __init__(self, model_config : SimpleNamespace, database_config : SimpleNamespace = None):
        """RAG agent

        Args:
            model_config (SimpleNamespace): model parameters
            database_config (SimpleNamespace, optional): database parameters. Defaults to None.
        """
        self.model_config = model_config
        self._validate_model_config()
        self.database_config = database_config

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        print(f"Initalizing model: {self.model_config.model_name}")
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_config.model_name)
        self.model = AutoModelWithLMHead.from_pretrained(self.model_config.model_name).to(self.device)

        if self.database_config is not None:
            print(f"Creating database from: {self.database_config.text_path}")
            self.db = self._create_database()

        self.trained = False

    
    def __repr__(self):
        agent_config = f"self.model_config: {self.model_config}\nself.database_config: {self.database_config}"
        return agent_config


    def _validate_model_config(self):
        assert hasattr(self.model_config, 'model_name'), "model_config must have a model_name attribute"
        if not hasattr(self.model_config, 'gen_length'):
            self.model_config.gen_length = 128
        if not hasattr(self.model_config, 'context_length'):
            self.model_config.context_length = 256
        if not hasattr(self.model_config, 'temperature'):
            self.model_config.temperature = 0.7
        if not hasattr(self.model_config, 'do_sample'):
            self.model_config.do_sample = True

    
    def _validate_database_config(self):
        assert hasattr(self.database_config, 'text_path'), "database_config must have a text_path attribute"
        assert hasattr(self.database_config, 'text_splitter'), "database_config must have a text_splitter attribute"
        assert hasattr(self.database_config, 'chunk_size'), "database_config must have a chunk_size attribute"
        assert hasattr(self.database_config, 'chunk_overlap'), "database_config must have a chunk_overlap attribute"
        assert hasattr(self.database_config, 'embedding_model'), "database_config must have an embedding_model attribute"
        assert hasattr(self.database_config, 'vector_store'), "database_config must have a vector_store attribute"
            
    
    def _create_database(self):
        self._validate_database_config()
        
        with open(self.database_config.text_path, 'r') as f:
            text = f.read()

        # Split text
        text_splitter = self.database_config.text_splitter(
            separator=' ',
            chunk_size=self.database_config.chunk_size,
            chunk_overlap=self.database_config.chunk_overlap,
            length_function=len
        )
        chunks = text_splitter.split_text(text)

        embeddings = HuggingFaceEmbeddings(model_name=self.database_config.embedding_model)
        db = self.database_config.vector_store.from_texts(chunks, embeddings)

        return db


    def ask_question(self, query : str = "What is your name?", retrieval : bool = True) -> str:
        """Ask a question

        Args:
            query (str, optional): Query to the Agent. Defaults to "What is your name?".

        Returns:
            str: Output string from the Agent
        """
        question = "QUESTION: " + query

        if retrieval and self.database_config is not None:
            docs = self.db.similarity_search(query, k=3)
            prompt = " ".join([doc.page_content for doc in docs]) + "\n\n" + question + " RESPONSE:"
        else:
            prompt = question + " RESPONSE:"

        input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

        output = self.model.generate(input_ids,
                        max_length=self.model_config.gen_length + len(input_ids[0]),
                        # temperature=0.7,
                        num_beams=5,
                        no_repeat_ngram_size=2,
                        early_stopping=True,
                        # do_sample=True,
                        pad_token_id=self.tokenizer.pad_token_id   
                    )

        # output without input_ids
        return self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)[1:]


    def train(self, training_config : SimpleNamespace) -> None:
        """Train the Agent using the given training_config

        Args:
            training_config (SimpleNamespace): Training hyperparameters

        Returns:
            None
        """
        if not self.trained:
            self.tokenizer.add_special_tokens({'pad_token': '<pad>'})
            with torch.no_grad():
                self.model.resize_token_embeddings(len(self.tokenizer))
            self.model.config.pad_token_id = self.tokenizer.pad_token_id

        data = Dataset.from_text(training_config.dataset_path, split='train')
        
        outputs = self.tokenizer(
                data["text"],
                truncation=True,
                max_length=training_config.context_length,
                return_overflowing_tokens=True,
                return_length=True,
            )

        {"input_ids": outputs.input_ids}

        tokenized_dataset = Dataset.from_dict({"input_ids": outputs.input_ids})

        data_collator = DataCollatorForLanguageModeling(self.tokenizer, mlm=False)

        args = TrainingArguments(
            output_dir="./results",
            per_device_train_batch_size=training_config.batch_size,
            num_train_epochs=training_config.num_epochs,
            logging_steps=100
        )

        trainer = Trainer(
            model=self.model,
            tokenizer=self.tokenizer,
            args=args,
            data_collator=data_collator,
            train_dataset=tokenized_dataset,
        )

        trainer.train()

        self.trained = True

Now you can define your parameters as a config.

In [2]:
from types import SimpleNamespace

model_config = SimpleNamespace(
    model_name = 'gpt2-medium',
    context_length = 256,
    temperature = 0.7,
    do_sample = True,
    gen_length = 128,
)

training_config = SimpleNamespace(
    dataset_path = '../sample_data/cleaned_test_text_1_QA.txt',
    context_length = 256,
    batch_size = 4,
    num_epochs = 20,
)

database_config = SimpleNamespace(
    text_path = '../sample_data/cleaned_test_text_1_QA.txt',
    embedding_model = 'sentence-transformers/all-mpnet-base-v2',
    chunk_size = 256,
    chunk_overlap = 64,
    vector_store = FAISS,
    text_splitter = CharacterTextSplitter,
)

In [3]:
clinton = Agent(model_config, database_config)

In [16]:
output = clinton.ask_question("Do you have any favorite websites?")
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I do. I've read a lot of books. I don't know if I've ever read a book before, but I do like to read. I love to read, and I love reading. I'm a big fan of the New York Times bestseller, The Great American Novel, which is a great book. It's about a young man who finds himself in the middle of a war, and he finds out that his father is dead, and that his mother is dead. And he has to find out what's going on, and what's happening in his life. And it's a great story, and the book is


In [4]:
clinton.train(training_config)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,4.7688
200,2.7047
300,2.5599
400,2.365
500,2.1908
600,2.0402
700,1.921
800,1.689
900,1.5603
1000,1.4625


In [22]:
output = clinton.ask_question("What do you think about the electoral college?")
print(output)

I don't think it's very good, for a number of reasons. One is that, since 1800, every President since has been elected by a narrow margin, unless you happen to be a Governor, a Senator, or the President of the United States. So if you're a one point person and you win by five or six tenths of one percent, well, that's not going to bother anybody. Now, you've got to have a margin of victory somewhere between one and two percent to make sure that your votes do not unduly prejudge the outcome of an election. Also, the process is designed to prevent this.
