In [None]:
%META
title=Instruction Tuning GPT-2: Fine-Tuning LLMs for Question Answering
excerpt=
categories=

With ChatGPT's two-year anniversary around the corner, I had a funny realization: even though I use LLMs pretty much every day, I've never actually tried getting one to follow my own custom instructions. So, I figured, why not give it a go? Plus, I’ve been toying with the idea of fine-tuning an LLM anyway, so now seems like the perfect time to tackle this little gap in my experience! Should be a fun project.

My computational resources are limited right now so I’ll be using GPT2-medium as a basis model. This will surely not break any records, I’ll be satisfied when the model can give me basic information like the capital of a country. Also, I'll will be basing this on <a href="https://github.com/karpathy/nanoGPT">karpathy/nanoGPT</a>, since it makes finetuning LLMs really easy.

Let’s first see how helpful vanilla gpt2-medium is when it comes to giving correct answers to questions:

```python sample.py --init_from=gpt2-medium --start="What's the capitol of france?"```

In [11]:
What's the capitol of france?

Italy, Spain and Portugal have the highest population so far in the world and are in general, the most populous countries in Europe. The country has more than a quarter of the world's population and two-fifths of the total population.
—-------------
What's the capitol of france?

16 U.S.C. § 3509(a)(1) provides that the person convicted of a crime of violence "shall be guilty of a felony, or shall suffer, upon conviction thereof, upon summary conviction, imprisonment not exceeding fifteen years or a fine not exceeding $50,000 or both."
—-----------
 What's the capitol of france?

It is not the capital of France. It is the capital of Spain, the capital of Portugal, the capital of Italy, the capital of Belgium, the capital of Luxembourg, the capital of Switzerland, the capital of Germany, the capital of Luxembourg, the capital of Austria, the capital of Belgium, the capital of Luxembourg, and all the capitals of the other major nations, so it is really something like the French capital. So as you can see, it's not their capital.

SyntaxError: unterminated string literal (detected at line 3) (1446827108.py, line 3)

So… it’s not exactly great at answering questions. With all the buzz around LLMs these days, it’s crazy to think how useless they used to be when they were just spitting out statistically likely sentences purely based on webtext. But we can cheat a bit by tweaking our prompt—just making it clear that the answer to a question is coming up next:

```python sample.py --init_from=gpt2-medium --start="Question: What's the capitol of france?\n Answer:"```

Outputs:
```
Question: What's the capitol of france?\n Answer: Le Mans, the capital of France.
—
Question: What's the capitol of france?\n Answer: The French capital Uisduy is the third most valuable city in world, and third most populous city. The city is located in the northeast of France, near France's border with Germany, and is the largest city in the region. It is a very vibrant city, with a great number of national parks, museums, and artistic institutions, among others. Its capital is Uisduy, the first city to start being called Uisduy, after its founding in 1689.
—
Question: What's the capital of france?\n Answer: The Capitole de l'État in Paris.\n\right now we have the following names on the cover:
```
Which after a lot of wrong capitals actually mentions Paris(Although the answer is still technically incorrect, but whatever).

Let’s see how far we can get by sticking to just instruction tuning—so, focusing on fine-tuning with a simple question-answer dataset.  
For this, we'll bet using the Question-Answer dataset collected by the <a href="https://open-assistant.io">OpenAssistant</a> project. The actual dataset is <a href="https://huggingface.co/datasets/OpenAssistant/oasst2">OpenAssistant/oasst2</a>

We’ll grab the trees jsonl file, which contain the full conversation trees with some annotations and scoring. Really interesting and usefull stuff, honestly. To keep things simple, let’s only consider the prompt from the user (the root of the tree) and the first response that the assistant gives. This simplifies our extraction script significantly. We will just load the trees and dump the conversations to stdout. These conversations will have some very basic XML formatting to indicate beginning and end of user prompts and assistant responses:

In [None]:
import json
file_path = '2023-04-12_oasst_ready.trees.jsonl'

with open(file_path, 'r') as file:
    for line in file:
        # Load the JSON object from the current line
        json_object = json.loads(line)

        # For every top level reply, print prompt and corresponding reply
        for reply in json_object["prompt"]["replies"]:
            print("<conversation>")
            print("<user>", json_object["prompt"]["text"], "</user>", sep="")
            print("<assistant>", reply["text"], "</assistant>", sep="")
            print("</conversation>")

By printing to stdout, we can intuitively verify that our conversational data is at least making some sense. Which it does:

```python export.py > simple_qa_conversations.txt```  
```head simple_qa_conversations.txt```

```
<conversation>
<user>Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.</user>
<assistant>"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions. (...)
```