## Intro

First I'd like to say that I am not a Data Scientist. I've worked with Data Scientist's in the past, but my knowledge in the space is limited. With the advent of high-quality large language models (LLMs), software engineers like me can now do Data Science with little-to-no Data Science background. 

In this post, I'm going to discuss my learnings around fine-tuning OpenAI GPT models. As an example, I'll use my work on [BeepGPT](https://github.com/kaskada-ai/beep-gpt), where I fine-tuned a model to predict which conversations might be interesting to users of a Slack workspace. 

We'll cover the following topics in detail below:

* Building Training Examples
* Refining Training Examples
* Model Fine-Tuning
* Model Validation

#### A few tips before we get started

First is a willingness to experiment. Model fine-turing is an iterative process. Most likely the first way you will build your training examples will not produce a successful model. When working on BeepGPT I experimented with five different scenarios (over a week) before I found one that was successful at predicting user interest.

Second is the importance of data quality. When fine-tuning a model, numerous training examples are sent to a model to update its behavior. The examples must contain strong signals that relate to the desired output. Despite the major advances made recently with LLMs, garbage in still leads to garbage out.

Finally it is important to have many training examples. Ideally a fine tuning job should be run with at least a few thousand training examples. The more examples that you can provide to the model, the more it can learn, and the better the predictions will be in production.

## Definitions

Here are a few concepts that you should understand before continuing...

#### Tokens
OpenAI's models process text as tokens instead of as characters. Tokens represent sets of characters that commonly occur in a certain sequence. On average, a token represents about four characters. You can use [OpenAI's tokenizer tool](https://platform.openai.com/tokenizer) to see how different texts get converted into tokens, for example:

> <p><span class="tokenizer-tkn tokenizer-tkn-0" title="15592">Team</span><span class="tokenizer-tkn tokenizer-tkn-1" title="750"> did</span><span class="tokenizer-tkn tokenizer-tkn-2" title="345"> you</span><span class="tokenizer-tkn tokenizer-tkn-3" title="2883"> enjoy</span><span class="tokenizer-tkn tokenizer-tkn-4" title="262"> the</span><span class="tokenizer-tkn tokenizer-tkn-0" title="299"> n</span><span class="tokenizer-tkn tokenizer-tkn-1" title="620">ach</span><span class="tokenizer-tkn tokenizer-tkn-2" title="418">os</span><span class="tokenizer-tkn tokenizer-tkn-3" title="7415"> yesterday</span><span class="tokenizer-tkn tokenizer-tkn-4" title="30">?</span></p>
>
> The color highlighting shows how 41 characters becomes 10 tokens. 
>
> You can mouseover to see the actual token values.

Common words and most positive integers under 1000 equate to a single token. Whitespace and capitalization matter. ` Team`, ` team`, `Team`, and `team` equate to 4 different tokens.


#### Prompts & Completions
Prompts are the input to LLM models. When working with ChatGPT, the prompt is the question we ask the model.

Completions are the responses from LLM models. When working with ChatGPT, the completion is the model's answer to our question. 

#### Training Examples
Training examples are prompt & completion pairs. The prompt is the text we would have sent the model in production, and the completion is the response we would have expected back. 

#### Maximum Token Length
The maximum length of an API request to a model, in tokens. This includes both the prompt and completion. Depending on the model, there is a different maximum length. For fine-tuning, we need to make sure that each training example is less than this size.

## Building Training Examples

#### Hypothesis generation

Before we start building training examples, you need to form hypotheses about what you want to predict and how you might do so successfully. This is where the iterative process starts.

For BeepGPT I experimented with the following ideas: 

* For a set of recent messages in a channel, try to predict:
  * The reaction (if any) to the most recent message
  * The next user that will reply
  * The set of users that might interact next (reply or react)
* For the set of recent messages in a conversation, try to predict:
  * The set of users that might interact next
  * The next user that will reply

I was most successful at fine-tuning a model for the final idea: For the set of recent messages in a conversation, predict the next user to reply. The rest of the post will focus on this.

::: {.callout-tip}
I used [Kaskada](https://kaskada.io) to quickly iterate on these ideas. Kaskada is a tool that makes it easy to collect and aggregate events from raw data. You don't need to pre-process anything. Just import the raw events and start experimenting. Furthermore Kaskada ensures that your examples will not be subject to leakage, which is a big problem in predictive modeling. In a future post, I'll show how I used Kaskada to generate training examples for each of the ideas presented above.
:::

#### Consider this example conversation:

::: {.hanging-indent .margin-0}
**UserA**: Team did you enjoy the nachos yesterday?

**UserB**: Yes, I love mexican food.

**UserA**: <@UserC> I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?

**UserB**: UserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?

**UserA**: I can wait for UserC to get back.

**UserC**: I can help with this. Can you tell me more about your application? Does it have any persistent storage requirements?
:::

#### Training example construction

If we just look at the first two messages, we can generate a training example. The prompt is the first message, and the completion is the user that responded.

* **Prompt**: "Team did you enjoy the nachos yesterday?"
* **Completion**: "UserB"

Instead if we consider the last 4 messages (we use two new-line characters to join messages):

* **Prompt**: "<@UserC> I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back."
* **Completion**: "UserC"

The combination of the prompt and the completion is a training example. When using the OpenAI fine-tuning API, each training example should be a blob of json in a specific format on a single line.

Converting our two examples above, we now have:

```{.json .code-overflow-wrap}
{"prompt":"Team did you enjoy the nachos yesterday?", "completion":"UserB"}
{"prompt":"<@UserC> I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back.", "completion":"UserC"}
```

#### Formatting examples

Next, there are several formatting rules that you are recommended to follow. I don't understand why these are recommended but I followed them anyway. 

* All prompts should end with the same set of characters. The set of characters used should not occur elsewhere in your dataset. The recommended string for textual input data is `\n\n###\n\n`.
* All completions should start with a single whitespace character. 

Applying these rules to our examples, we get:

```{.json .code-overflow-wrap}
{"prompt":"Team did you enjoy the nachos yesterday?\n\n###\n\n", "completion":" UserB"}
{"prompt":"<@UserC> I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: http://tinyurl.com/4k53dc8h but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back.\n\n###\n\n", "completion":" UserC"}
```


#### Training example cleanup

Finally I found that model training works best if the following is done:

* Non-textual data like http-links, code blocks, and ids are removed from the prompts.
* Completions are reduced to a single token in length.

We can use regex and other string functions to remove non-textual data from the prompts. And we can use standard data science tools like the [Scikit-Learn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to create a mapping from UserIds to integers. Remember that positive integers under one thousand map to unique tokens in OpenAI.

So now we have:

```{.json .code-overflow-wrap}
{"prompt":"Team did you enjoy the nachos yesterday?\n\n###\n\n", "completion":" 1"}
{"prompt":"I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back.\n\n###\n\n", "completion":" 2"}
```

::: {.callout-important}
I found that removing ids was especially important. Before I did so, the model essentially learned how to map from input UserId to output user. It skipped learning anything useful from the actual text of the messages.
:::

We now have 2 training examples that we could use for fine-tuning a model. 

For fine-tuning we should have several thousand examples. Using a tool like [Kaskada](https://kaskada.io), it should be relatively easy to generate examples like this from your full slack history. 

## Refining Training Examples

Before proceeding with fine-tuning, I recommend the taking the following steps:

1. Use the ChatCompletion API to determine which examples contain enough signal to be useful
2. Use the OpenAI CLI to validate the training examples are in the correct format.

#### Determining Example Signal Strength

I recommend you ensure each example contains a strong enough signal for predicting your desired outcome. If an example doesn't have enough signal, you should consider excluding it from your training set.

Depending on your goal, it may be also helpful to include some negative examples. Negative examples help train the model about when to not make a prediction, or stated another way, about when to predict that no action should be taken.

For example, with BeepGPT we are trying to predict when a set of messages might be interesting for a specific user. If we look at our training examples from the previous section, the first does not contain anything interesting. We would not want to alert anyone about this message. Therefore we should convert this example into a negative example.

The second example does contain a strong signal. Here we would like to alert users that have answered questions about kubernetes in the past. This example should be left as is.

To convert an example to a negative example, we simply need to change its completion to indicate a non-response. I chose to use ` nil` for this, which is represented by a single token in OpenAI.

```{.json .code-overflow-wrap}
{"prompt":"Team did you enjoy the nachos yesterday?\n\n###\n\n", "completion":" nil"}
{"prompt":"I'm trying to get my application deployed in Kubernetes. However, I can't seem to figure out if I should use a deployment or a stateful set. I found some docs here: but I'm still not sure which to choose. Can you help?\n\nUserC is at lunch now. They will be back in about an hour. I don't know much about this either, but I can try to help. Or is it okay to wait until UserC is back?\n\nI can wait for UserC to get back.\n\n###\n\n", "completion":" 2"}
```

Instead of manually going through each generated example to determine if it should be positive or negative, we can use OpenAIs ChatGPT API to do this work for us. However we still need to review a few examples in order to provide enough information to the ChatGPT to make decisions on our behalf.

Look through your generated examples and try to find 25-50 for each bucket: **positive** and **negative**. Add these to files: `examples_pos.jsonl` and `examples_neg.jsonl`

Positive (strong signal) examples:

* I've been utilizing the Rust syntax highlighter for my code blocks. It does a good job of differentiating between functions and literals.
* The agent doesn’t push to prometheus, this is just another proxy location that prometheus scrapes.

Negative (weak signal) examples:

* Some very interesting ideas in here, thx for sharing
* Were there any issues with this? I'll start verifying a few things in a bit.
* Standup?

We will now use ChatGPT with few-shot learning to iterate over our full set of training examples and label each as positive or negative. 

First we will generate an instruction set for ChatGPT, by building up an array of messages in json. Each message object contains `role` and `content` properties. The `role` can be either `system`, `user`, or `assistant`.

The first message should always be from the `system` role, and provide general instructions to the model of its function. Following this, message pairs of `user` and `assistant` should be added, where the `user` content is our example input and the `assistant` content is our expected response from ChatGPT. These are the "few-shot" learnings that ChatGPT uses to help it determine our desired output.

Then we append a final `user` message to the instruction set that contains the content that we want to have evaluated by the model.

Below is some example python code for doing this refinement. Some Notes:

* The code is written in blocks with the intention of running inside a jupyter notebook environment. 
* If you want to run this code yourself, you will need an OpenAI API key. 
* The code also assumes that you have a file named `examples.jsonl` which contains the full set of training examples generated above.

In [None]:
#install required packages
%pip install openai backoff

::: {.callout-note}
Note that we use the `backoff` library to retry requests that have failed due to a rate-limit error. Despite this addition, sometimes the process stalls and must be manually restarted. The code below appends to the output file instead of replacing it, so that the process can be restarted after an error occurs.
:::

In [None]:
# import packages and init openAI with your API key
import backoff, getpass, json, openai, time

openai.api_key = getpass.getpass('OpenAI API Key:')

In [None]:
# get a total count of examples in the input file
total_count = 0
with open(f'examples.jsonl', 'r') as in_file:
    for line in in_file:
        total_count += 1

# initialize a progress counter
success_count = 0

# build up the instruction set for few-shot learning

# start with a `system` message that provides the general instructions to the model
system_instructions = "You are a helpful assistant. Your job is to determine \
    if a prompt will be helpful for fine-tuning a model. All prompts start with \
    'start -->' and end with: '\\n\\n###\\n\\n'. You should respond 'yes' if you \
    think the prompt has enough context to be helpful, or 'no' if not. No \
    explanation is needed. You should only respond with 'yes' or 'no'."
instructions = [{"role": "system", "content": system_instructions}]

# then add the positive and negative examples that we manually pulled out of the full set
pos = open(f'examples_pos.jsonl', 'r')
neg = open(f'examples_neg.jsonl', 'r')

while True:
    pos_line = pos.readline()
    neg_line = neg.readline()

    if (not pos_line) or (not neg_line):
        break

    pos_data = json.loads(pos_line)
    neg_data = json.loads(neg_line)

    # alternate adding positive and negative examples
    instructions.append({"role": "user", "content": f'start -->{pos_data["prompt"]}'})
    instructions.append({"role": "assistant", "content": "yes"})
    instructions.append({"role": "user", "content": f'start -->{neg_data["prompt"]}'})
    instructions.append({"role": "assistant","content": "no"})

pos.close()
neg.close()

# setup a method to retry requests automatically
@backoff.on_exception(backoff.expo, (openai.error.RateLimitError, openai.error.ServiceUnavailableError))
def chat_with_backoff(**kwargs):
    # add an additional delay, because the first retry almost always fails
    time.sleep(1)
    try:
        return openai.ChatCompletion.create(**kwargs)
    except openai.error.InvalidRequestError:
        return None

In [None]:
# iterate through each example, using the ChatCompletion API 
# to determine if it contains a strong signal for fine-tuning purposes.
# if this code block stalls, you can restart it to resume processing. 

count = 0
with open(f'examples.jsonl', 'r') as in_file:
    with open(f'examples_refined.jsonl', 'a') as out_file
        for line in in_file:
            count +=1

            # skip examples already processed on previous runs
            if count < success_count:
                continue

            print(f'Currently processing line {count} of {total_count}')

            # get the next example from the file
            data = json.loads(line)
            prompt = data["prompt"]

            # add the example to a copy of the instruction set
            msgs = instructions.copy()
            msgs.append({"role": "user", "content": f'start -->{prompt}'})

            # send the request
            res = chat_with_backoff(model = "gpt-3.5-turbo", messages = msgs)

            # if request failed for some reason, skip example
            if not res:
                continue

            # get the response and write the example back to disk
            if res["choices"][0]["message"]["content"] == "no":
                # for negative messages, re-write the completion as ` nil`
                data["completion"] = " nil"
            out_file.write(json.dumps(data) + '\n') 
            out_file.flush()

            # save progress for restart
            success_count = count

::: {.callout-warning title="Tips & Warnings"}
* If you get an error about too many tokens used, reduce the number of positive and negative examples in your generated instructions. Or try to summarize the positive & negative examples (manually or with ChatGPT) to reduce their length.
* This will cost a fair amount on OpenAI. A rough estimate is $50 per 10,000 examples.
* This can take a long time to run to completion. The ChatCompletion API limits the number of tokens used per minute. In my experience, running 10,000 examples through this process takes about 8 hours.
:::

#### Example Validation

Finally, will use a CLI tool provide by OpenAI to perform some validation on our training examples and split them into two files. The tool does the following for us:

* makes sure all prompts end with same suffix
* removes examples that use too many tokens
* removes duplicated examples

We can run the CLI tool directly from a python jupyter environment with the code below.

In [None]:
from openai import cli
from types import SimpleNamespace

args = SimpleNamespace(file='examples_refined.jsonl', quiet=True)
cli.FineTune.prepare_data(args)

The output of the above command should be two files:

* `examples_refined_prepared_train.jsonl` -> We will use this to fine-tune our model
* `examples_refined_prepared_valid.jsonl` -> We will use this to validate our fine-tuned model

## Model Fine-Tuning

Now that we have refined training examples, we can fine-tune a model for our purposes.

To do this, we take the following steps:

1. Upload training data
1. Create a fine-tuning job
1. Wait for the job to finish
1. Experiment with the fine-tuned model

#### Upload training data

First we upload the refined examples to OpenAI. We need to make sure the file has successfully uploaded before moving onto the next step.

In [None]:
training_file_name = "examples_refined_prepared_train.jsonl"

# start the file upload
training_file_id = cli.FineTune._get_or_upload(training_file_name, True)

# Poll and display the upload status until the it finishes
while True:
    time.sleep(2)
    file_status = openai.File.retrieve(training_file_id)["status"]
    print(f'Upload status: {file_status}')
    if file_status in ["succeeded", "failed", "processed"]:
        break

#### Create a fine-tuning job

Next we create a fine-tuning job using the file_id from the upload. 

When doing fine-tuning, you need to choose a base model to start from. The current options are:

* `babbage-002` -> Capable of straightforward tasks. Very fast and low cost.
* `davinci-002` -> Most capable GPT3 model. More expensive to train and run in production.

You also need to choose the number of epochs to train the model for. An epoch refers one full cycle through the training dataset.

With the example set I was using for BeepGPT I found that 8 epochs on `babbage-002` produced a model with a similar capability as 4 on `davinci-002`. Depending on your use case you may or may not have a similar result.

In [None]:
create_args = {
    "training_file": training_file_id,
    "model": "davinci-002",
    "n_epochs": 4,
    "suffix": "beep-gpt"
}

# Create the fine-tune job and retrieve the job ID
resp = openai.FineTune.create(**create_args)
job_id = resp["id"]

#### Wait for the job to finish

After the fine-tuning job has been created, we need to wait for it to start processing, and then for it to finish. 

Depending on the current backlog at OpenAI, I've seen that jobs can take up to a dozen hours to start. 

After the job starts successfully, you can then see its status, and wait for it to finish. This can also take a long time. When using `davinci-002` with 4 epochs, I estimate about 1 hour per 1000 training examples.

In [None]:
# Poll and display the fine-tuning status until the it finishes
while True:
    time.sleep(5)
    job_details = openai.FineTune.retrieve(id=job_id)
    
    print(f'Job status: {job_details["status"]}')
    print(f'Job events: {job_details["events"]}')

    if job_details["status"] == "succeeded":
        model_id = job_details["fine_tuned_model"]
        print(f'Successfully fine-tuned model with ID: {model_id}')

    if job_details["status"] in ["failed", "succeeded"]:
        break

#### Try using the fine-tuned model

Now that we have a finished model, we can try sending a few prompts and see if it recommends alerting any users. We can use the validation file for this.

See the [OpenAI docs](https://platform.openai.com/docs/api-reference/completions/create) for info on the parameters we send to the Completion API.

In [None]:
# choose which row in the validation file to send
row = 6

count = 0
with open(f'examples_refined_prepared_valid.jsonl', 'r') as in_file:
    for line in in_file:
        count +=1

        if count < row:
            continue

        data = json.loads(line)
        prompt = data["prompt"]
        completion = data["completion"]

        # this is the text we send to the model for it to 
        # determine if we should alert a user
        print(f'Prompt: {prompt}')

        # this is the user (or nil) we would have expected 
        # for the response (from the validation file)
        print(f'Completion: {completion}')

        # this is the response from the model. The `text` field contains 
        # the actual prediction. The `logprobs` array contains the 
        # log-probability from the 5 highest potential matches.
        print(f'Prediction:')
        openai.Completion.create(model=model_id, prompt=prompt, max_tokens=1, logprobs=5, temperature=0)

## Model Validation

In order to validate your model, you can run it over the full validation data set.





## Conclusion and Next Steps

