## Intro

In this post, I'm going to discuss my learnings around fine-tuning OpenAI GPT models. As an example, I'll use my work on [BeepGPT](https://github.com/kaskada-ai/beep-gpt), where I fine-tuned a model to predict which conversations might be interesting to users of a Slack workspace. 

Note that this post focuses on using the legacy GPT3 models. I'll make a future post about working with the latest generation models from OpenAI.

We'll cover the following topics in detail below:

* Building Training Examples
* Refining Training Examples
* Model Fine-Tuning

But before I get started, I'd like to discuss a few things. 

First is a willingness to experiment. Model fine-turing is an iterative process. Most likely the first way you will build your examples will not produce a successful model. When working on BeepGPT I experimented with five different scenarios (over a week) before I found one that was somewhat successful at predicting user interest.

Second is the importance of data quality. When fine-tuning a model, numerous training examples are sent to a model to update its behavior. The examples must contain strong signals that relate to the desired output. Despite the huge advances made recently with LLMs, garbage in still leads to garbage out.

Finally it is important to have many training examples. Ideally a fine tuning job should be run with at least a few thousand training examples. The more examples that you can provide to the model, the more it can learn, and the better the predictions will be in production.

## Definitions

#### Tokens
Tokens represent sets of characters that commonly occur in a certain sequence. On average, a token represents about 4 characters. You can use OpenAI's tokenizer tool to see how different texts get converted into tokens: https://platform.openai.com/tokenizer

For example, the text: "Hey team, how was your weekend?" gets tokenized as: [10814, 1074, 11, 703, 373, 534, 5041, 30]. 31 input characters become 8 tokens.

With color highlighting we how each set of characters was transformed:

<p><span class="tokenizer-tkn tokenizer-tkn-0">Hey</span><span class="tokenizer-tkn tokenizer-tkn-1"> team</span><span class="tokenizer-tkn tokenizer-tkn-2">,</span><span class="tokenizer-tkn tokenizer-tkn-3"> how</span><span class="tokenizer-tkn tokenizer-tkn-4"> was</span><span class="tokenizer-tkn tokenizer-tkn-0"> your</span><span class="tokenizer-tkn tokenizer-tkn-1"> weekend</span><span class="tokenizer-tkn tokenizer-tkn-2">?</span></p>

OpenAI's models process text as tokens instead of as characters. 

#### Prompts & Completions
Prompts are the input to LLM models. When working with ChatGPT, the prompt is the question we ask the model.

Completions are the responses from LLM models. When working with ChatGPT, the completion is the model's answer to our question. 

#### Training Examples
Training examples are prompt & completion pairs. The prompt is the text we would have sent the model in production, and the completion is the response we would have expected back. 

#### Maximum Token Length
The maximum token length of a request to a model, including both the prompt and completion. Depending on the model, there is a different maximum length. For fine-tuning, we need to make sure that each training example is less than this size.

## Building Training Examples

Before we start building training examples, we need to form hypotheses about what we want to predict and how we might do so successfully. 

For BeepGPT I experimented with the following ideas: 

* For a set of recent messages in a channel, try to predict:
  * The reaction (if any) to the most recent message
  * The next user that will reply
  * The set of users that might interact next (reply or react)
* For the set of recent messages in a conversation, try to predict:
  * The set of users that might interact next
  * The next user that will reply

I was most successful at fine-tuning a model for the final idea: For the set of recent messages in a conversation, predict the next user to reply. The rest of the post will focus on this.

::: {.callout-note}
I used [Kaskada](https://kaskada.io) to quickly iterate on these ideas. Kaskada is a tool that makes it easy to collect and aggregate events from raw data. You don't need to pre-process anything. Just import the raw events and start experimenting. Furthermore Kaskada ensures that your examples will not be subject to leakage, which is a big problem in predictive modeling. In a future blog post, I'll show how I used Kaskada to generate training examples for each of the ideas presented above.
:::

Consider this example conversation:

::: {.hanging-indent .margin-0}
**UserA**: Hey team, how was your weekend?

**UserB**: Great. And Yours?

**UserA**: It was good, I went swimming yesterday.

**UserA**: Hey @UserC, I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?

**UserB**: UserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?

**UserA**: I can wait for UserC to get back.

**UserC**: I can help with this. Can you tell me more about your application? Does it have any persistent storage requirements?
:::

If we just look at the first two messages, we can generate a training example. The prompt is the first message, and the completion is the user that responded.

* Prompt:  "Hey team, how was your weekend?"
* Completion: "UserB"

Instead if we consider the last 4 messages:

* Prompt: "Hey @UserC, I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back."
* Completion: "UserC"

The combination of the prompt and the completion is a training example. When using the OpenAI fine-tuning API, each training example should be a blob of json in a specific format on a single line. 

Converting our two examples above, we now have:

```{json}
{"prompt":"Hey team, how was your weekend?", "completion":"UserB"}
{"prompt":"Hey @UserC, I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back.", "completion":"UserC"}
```


Finally I found that model training works best if the following is done:
* Non-textual data like http-links, code blocks, and userIds are removed from the prompts
* Completions are reduced to a single token in length.

We can use regex and other string functions to remove non-textual data from the prompts. And we can use standard data science tools like the Scikit-Learn LabelEncoder to create a mapping from UserIds to integers. Integers under 512 map to unique tokens in OpenAI models.

So now we have:

```
{"prompt":"Hey team, how was your weekend?\n\n###\n\n", "completion":" 1"}
{"prompt":"Hey, I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: but I'm still not sure which to choose. Can you help?\n\n is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until is back?\n\nI can wait for to get back.\n\n###\n\n", "completion":" 2"}
```

We now have 2 training examples that we could use for fine-tuning a model. But for fine-tuning we should have several thousand examples. Using a tool like Kaskada, it should be relatively easy to generate examples like this from your full slack history. However, before we jump into model training, it is important to refine the examples. I'll cover this in the next section.

## Refining Training Examples

Before sending training examples to OpenAI to fine-tune a custom model, you should ensure they contain a strong signal for predicting your desired outcomes. Depending on your goal, it may be helpful to also include some negative examples. Negative examples help train the model about when to not make a prediction, or stated another way, about when to predict that no action should be taken. 

For example, with BeepGPT we are trying to predict when a set of messages might be interesting for a specific user. If we look at our training examples from the previous section, the first does not contain anything interesting. We would not want to alert anyone about this message. Therefore we should convert this example into a negative example.

The second example does contain a strong signal. Here we would like to alert users that have answered questions about kubernetes in the past. This example should be left as is.

To convert an example to a negative example, we simply need to change its completion to indicate a non- response. I chose to use " nil" for this, which is represented by a single token in OpenAI.

```{json}
{"prompt":"Hey team, how was your weekend?\n\n###\n\n", "completion":" nil"}
{"prompt":"Hey, I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: but I'm still not sure which to choose. Can you help?\n\n is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until is back?\n\nI can wait for to get back.\n\n###\n\n", "completion":" 2"}
```

Instead of manually going through each generated example to determine if it should be positive or negative, we can use OpenAIs ChatGPT API to do this work for us. However we still need to review a few examples in order to provide enough information to the ChatGPT to make decisions on our behalf.

Look through your generated examples and try to find 25-50 for each bucket: positive and negative.

Positive (strong signal) examples:

* I've been utilizing the Rust syntax highlighter for my code blocks. It does a good job of differentiating between functions and literals.
* The agent doesn’t push to prometheus, this is just another proxy location that prometheus scrapes.

Negative (weak signal) examples:

* Some very interesting ideas in here, thx for sharing
* Were there any issues with this? I'll start verifying a few things in a bit.
* Standup?

Put your positive examples in one jsonl file and the negative in another. 


We will now use ChatGPT with few-shot learning to iterate over our full set of training examples and label each as positive or negative. 

First we will generate an instruction set for ChatGPT, by building up an array of messages in json. Each message object contains `role` and `content` properties. The `role` can be either `system`, `user`, or `assistant`.

The first message should always be from the `system` role, and provide general instructions to the model of its function. Following this, message pairs of `user` and `assistant` should be added, where the `user` content is our example input and the `assistant` content is our expected response from ChatGPT. These are the "few-shot" learnings that ChatGPT uses to help it determine our desired output.

The following python code builds a suitable instruction set from our few examples:

In [None]:
strong = open(f'examples_strong.jsonl', 'r')
weak = open(f'examples_weak.jsonl', 'r')

instructions = [{
    "role": "system",
    "content": "You are a helpful assistant. Your job is to determine if a prompt will be helpful for fine-tuning a model. All prompts start with 'start -->' and end with: '\\n\\n###\\n\\n'. You should respond 'yes' if you think the prompt has enough context to be helpful, or 'no' if not. No explanation is needed. You should only respond with 'yes' or 'no'."
}]

count = 0
max_count = 50
while True:
    strong_line = strong.readline()
    weak_line = weak.readline()
    count += 1

    if (not strong_line) or (not weak_line) or (count > max_count):
        break

    strong_data = json.loads(strong_line)
    weak_data = json.loads(weak_line)

    instructions.append({"role": "user", "content": f'start -->{strong_data["prompt"]}'})
    instructions.append({"role": "assistant", "content": "yes"})
    instructions.append({"role": "user", "content": f'start -->{weak_data["prompt"]}'})
    instructions.append({"role": "assistant","content": "no"})

strong.close()
weak.close()

Now we can use these instructions with ChatGPT to determine if each training example contains a strong signal for fine-tuning purposes. The following python code does this. 