# GPT FINE TUNING

GROUP MEMBERS:
- Rishabh TIWARI;
- Felipe BAGNI;
- Erfan AMIDI;
- Federica VINCIGUERRA;
- Dan LIONIS.

---

# DAVINCI-02

# 1. Get OpenAI API Key

Prior to fine-tune our model, let's get the OpenAI credentials needed for the API calls.

Go to [OpenAI website](https://platform.openai.com/api-keys) and create a new secrete key.

# 2. Create training data

The next step is to create training data to teach GPT-3 what you'd like to say. The data need to be a JSONL document with a new prompt and the ideal generated text:

```
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
```

For babbage-002 and davinci-002, you can follow the prompt completion pair format as shown above. However, a conversational chat format is required to fine-tune gpt-3.5-turbo (convered later by our project)

**Optional for Colab users**

Before starting, we can set up the connection with the Google Drive storage, to keep there our documents.
Just execute the following passages:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Make sure that the variable path contains the correct sequence of folders separate by a `'/'` to get to your desired files

In [2]:
import os

path = '_NLP/Project' # CHANGE HERE FOR YOUR PATH

os.chdir(f'/content/drive/MyDrive/{path}') # IF WORKING LOCALLY, CHANGE THE PATH HERE AS WELL
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

Let's start by installing and importing the libraries needed:

In [3]:
!pip uninstall -y openai
!pip install openai==0.28

[0mCollecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0


In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-a

In [5]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

Then add your API key from the previous step:

In [6]:
api_key ="sk-proj-##################" # ADD YOUR OPENAI API KEY HERE
openai.api_key = api_key

Now create a regular dict with the training data:

Load the dataset (if you are running in colab, upload the file in the same Drive directory of your colab file):

In [7]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')
df = dataset['train'].to_pandas()
df = df.iloc[:, :-1]

Generating train split: 0 examples [00:00, ? examples/s]

Let's check out some properties of the dataset

In [8]:
len(df)

33955

In [9]:
df.head()

Unnamed: 0,input,output
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca..."
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...


Split the data into train and test:

In [10]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [11]:
train_data.head()

Unnamed: 0,input,output
17426,What is the reason for the rapid-onset of acti...,What is the reason for the rapid-onset of acti...
21416,What types of cancer are associated with a dec...,Oral contraceptives are associated with a decr...
27343,What is the product of the conversion of galac...,Galactose is converted to galactose-1-phosphat...
985,What is the effect of prostaglandin agonists o...,Prostaglandin agonists increase the uveosclera...
13243,To which drug class do ipratropium and tiotrop...,Ipratropium and tiotropium belong to the drug ...


# Preparing Training Data for Fine-Tuning

## Overview
In this section, we prepare the training data for fine-tuning language models. The data is formatted as prompt-completion pairs, which are suitable for training models like GPT-3.

## Steps Involved

### 1. Initialize an Empty List
- Started by initializing an empty list called `training_data` to store the formatted training data.

### 2. Iterate Through the DataFrame
- Iterated through each row of the DataFrame `train_data` containing the training examples.

### 3. Create Prompt-Completion Pairs
- For each row, created a dictionary with two keys:
  - **Prompt:** Concatenated the `input` column value with " ->".
  - **Completion:** Concatenated a space with the `output` column value and a period followed by a newline character.

### 4. Append to List
- Added each dictionary to the `training_data` list.

## Result
- **Training Data:** A list of dictionaries, where each dictionary contains a prompt-completion pair formatted for training.


In [12]:
# Initialize an empty list to store the training data
training_data = []

# Iterate through the rows of the DataFrame
for index, row in train_data.iterrows():
    # Create a dictionary for each row
    data_entry = {
        "prompt": row["input"] + " ->",
        "completion": " " + row["output"] + ".\n"
    }

    # Add the dictionary to the list
    training_data.append(data_entry)

# Saving Training Data to JSON Lines File

## Overview
In this step, we save the formatted training data to a JSON Lines file. JSON Lines format is commonly used for training data in machine learning as it is easy to process and stream.

## Steps Involved

### 1. Define the Output File Name
- Specified the name of the output file as `"training_data.jsonl"`.

### 2. Open the Output File
- Opened the file in write mode using `with open(file_name, "w") as output_file`.

### 3. Write Entries to the File
- Iterated through the `training_data` list.
- For each entry in the list:
  - Used `json.dump(entry, output_file)` to write the dictionary to the file in JSON format.
  - Added a newline character after each entry using `output_file.write("\n")`.

## Result
- The formatted training data is saved to a JSON Lines file (`training_data.jsonl`), where each line contains a JSON-formatted prompt-completion pair.

In [13]:
file_name = "training_data.jsonl"

with open(file_name, "w") as output_file:
 for entry in training_data:
  json.dump(entry, output_file)
  output_file.write("\n")

# 3. Check the training data

In [14]:
!rm -rf training_data_prepared.jsonl

In [15]:
!openai tools fine_tunes.prepare_data -f training_data.jsonl

Analyzing...

- Your file contains 27164 prompt-completion pairs
- There are 339 duplicated prompt-completion sets. These are rows: [72, 576, 852, 857, 879, 898, 1101, 1139, 1228, 1236, 1297, 1657, 1663, 1695, 1814, 1845, 1994, 2152, 2169, 2208, 2216, 2338, 2437, 2586, 2699, 2794, 2803, 2899, 2941, 3026, 3124, 3232, 3240, 3269, 3344, 3383, 3812, 3922, 3960, 4048, 4074, 4289, 4367, 4482, 4505, 4512, 4514, 4590, 4766, 4841, 4859, 4951, 5043, 5081, 5098, 5413, 5499, 5591, 5709, 5754, 5761, 5865, 5910, 5913, 5979, 6044, 6168, 6211, 6216, 6260, 6263, 6284, 6302, 6391, 6404, 6557, 6656, 6757, 6761, 6803, 6920, 6980, 6994, 7003, 7097, 7188, 7231, 7437, 7518, 7519, 7695, 7825, 7841, 7891, 7922, 8018, 8295, 8430, 8512, 8549, 8607, 8711, 8935, 9354, 9418, 9461, 9543, 9649, 9897, 10004, 10008, 10055, 10140, 10158, 10259, 10273, 10407, 10429, 10652, 10656, 10735, 10821, 10872, 11030, 11047, 11050, 11148, 11275, 11352, 11395, 11481, 11493, 11621, 11626, 11640, 11712, 11727, 11868, 11880, 11918, 119

# 4. Upload training data

Using the checked/prepared training data for the fine tuning:

In [16]:
file_name = "training_data_prepared.jsonl"

In [18]:
upload_response = openai.File.create(
  file=open(file_name, "rb"),
  purpose='fine-tune'
)
file_id = upload_response.id
upload_response

<File file id=file-Dzg6yTeEW4tsPjlOAm8Jg7p1 at 0x79455b601d00> JSON: {
  "object": "file",
  "id": "file-Dzg6yTeEW4tsPjlOAm8Jg7p1",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 13002933,
  "created_at": 1714635472,
  "status": "processed",
  "status_details": null
}

Use this file id in the next step, where we'll fine-tune a model.

# 5. Fine-tune model

Check the files uploaded to see if it was succesfully uploaded

In [22]:
openai.File.list()

<OpenAIObject list at 0x79454a2d2480> JSON: {
  "object": "list",
  "data": [
    {
      "object": "file",
      "id": "file-Dzg6yTeEW4tsPjlOAm8Jg7p1",
      "purpose": "fine-tune",
      "filename": "file",
      "bytes": 13002933,
      "created_at": 1714635472,
      "status": "processed",
      "status_details": null
    },
    {
      "object": "file",
      "id": "file-oGbFDNtSr3XiN0GToSBCCArH",
      "purpose": "fine-tune",
      "filename": "file",
      "bytes": 13002933,
      "created_at": 1714492391,
      "status": "processed",
      "status_details": null
    }
  ],
  "has_more": false
}

Create the Fine Tuning Job, in our case we will fine tune on top of Da Vinci

In [25]:
response = openai.FineTuningJob.create(
  training_file=file_id,
  model="davinci-002",
  )

response

<FineTuningJob fine_tuning.job id=ftjob-rBeqFM0RGgyiJYwgpnbBcdR6 at 0x79454b1e4540> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-rBeqFM0RGgyiJYwgpnbBcdR6",
  "model": "davinci-002",
  "created_at": 1714636256,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-qfaPGf9u8lnvI7TsUp7b6EU0",
  "result_files": [],
  "status": "validating_files",
  "validation_file": null,
  "training_file": "file-Dzg6yTeEW4tsPjlOAm8Jg7p1",
  "hyperparameters": {
    "n_epochs": "auto",
    "batch_size": "auto",
    "learning_rate_multiplier": "auto"
  },
  "trained_tokens": null,
  "error": {},
  "user_provided_suffix": null,
  "seed": 420739891,
  "estimated_finish": null,
  "integrations": []
}

Run this code to check the status of the fine tuning process, when status is successful, it is done.

In [26]:
job_id = response.id
status = response.status

print(f'Fine-tunning model with jobID: {job_id}.')
print(f"Training Response: {response}")
print(f"Training Status: {status}")

Fine-tunning model with jobID: ftjob-rBeqFM0RGgyiJYwgpnbBcdR6.
Training Response: {
  "object": "fine_tuning.job",
  "id": "ftjob-rBeqFM0RGgyiJYwgpnbBcdR6",
  "model": "davinci-002",
  "created_at": 1714636256,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-qfaPGf9u8lnvI7TsUp7b6EU0",
  "result_files": [],
  "status": "validating_files",
  "validation_file": null,
  "training_file": "file-Dzg6yTeEW4tsPjlOAm8Jg7p1",
  "hyperparameters": {
    "n_epochs": "auto",
    "batch_size": "auto",
    "learning_rate_multiplier": "auto"
  },
  "trained_tokens": null,
  "error": {},
  "user_provided_suffix": null,
  "seed": 420739891,
  "estimated_finish": null,
  "integrations": []
}
Training Status: validating_files


Run this code to check the list of events of the fine tuning.

In [35]:
openai.FineTuningJob.list_events(job_id)

<OpenAIObject list at 0x7945497c8360> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-vYxO32ULEnbp4QErzejuBrjE",
      "created_at": 1714637007,
      "level": "info",
      "message": "Step 1563/1578: training loss=0.55",
      "data": {
        "step": 1563,
        "train_loss": 0.5481270551681519,
        "total_steps": 1578,
        "train_mean_token_accuracy": 0.8197395205497742
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-H0MO59EQwGXpyqqLvkqg0rlB",
      "created_at": 1714637005,
      "level": "info",
      "message": "Step 1562/1578: training loss=0.57",
      "data": {
        "step": 1562,
        "train_loss": 0.5743265748023987,
        "total_steps": 1578,
        "train_mean_token_accuracy": 0.8223140239715576
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-GCQL7EdcMrVZKlU5LY0KAySF",
      "

Run this code to check the status of the fine tuning process, when status is  `succeeded`, it is done.

In [36]:
openai.FineTuningJob.retrieve(id=job_id).status

'succeeded'

Let's get the name of the new model:

In [45]:
ft_model = openai.FineTuningJob.retrieve(id=job_id).fine_tuned_model
ft_model

'ft:davinci-002:personal::9KLi6nKN'

# Result

![Results-FT-davinci-02](images/ft:davinci-02.png)

---

# Base model - DaVinci 02

## DaVinci 02 Shutdown and Fine-Tuning with Dummy Dataset

Recently, the DaVinci 02 model, known for its remarkable capabilities in various tasks, was shut down and deprecated, with this, we could not access it anymore. The only way to do it was via Fine Tuning, because it still serves as a base model for this. So, we decided to fine tune the model with a so called "dummy" dataset in order to draw some comparisons and assess performance changes.

The dummy dataset used for fine-tuning consisted of a limited set of data, comprising only 12 questions and their corresponding answers. These questions and answers were extracted from the medical flashcard dataset, chosen for its relevance and specificity to certain domains.

The purpose of this fine-tuning process was to observe how the performance of the model might vary with the fine tunning and to gauge any noticeable changes in its output. While the dummy dataset is significantly smaller in order to be the closest to the model before the fine tunning operation was performed. So, we can analize the impact of the fine tunning with the the model prior to it (which was considered the dummy one).

![Results-FT-dummy-davinci-02](images/ft:davinci-02-no-data.png)

---

# 1. Get OpenAI API Key

Prior to fine-tune our model, let's get the OpenAI credentials needed for the API calls.

Go to [OpenAI website](https://platform.openai.com/api-keys) and create a new secrete key.

# 2. Create training data

The next step is to create training data to teach GPT-3 what you'd like to say. The data need to be a JSONL document with a new prompt and the ideal generated text:

```
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
```


**Optional for Colab users**

Before starting, we can set up the connection with the Google Drive storage, to keep there our documents.
Just execute the following passages:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Make sure that the variable path contains the correct sequence of folders separate by a `'/'` to get to your desired files

In [None]:
import os

path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

Let's start by importing the libraries needed:

In [None]:
!pip uninstall -y openai
!pip install openai

[0mCollecting openai
  Downloading openai-1.30.3-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

In [None]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

Then add your API key from the previous step:

In [None]:
api_key ="sk-proj-###" # ADD YOUR API KEY HERE
openai.api_key = api_key

Now create a regular dict with the training data:

Load the dataset:

In [None]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')
df = dataset['train'].to_pandas()
df.head()

Generating train split: 0 examples [00:00, ? examples/s]

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully


In [None]:
df = df.iloc[:, :-1]

In [None]:
len(df)

33955

In [None]:
df.head()

Unnamed: 0,input,output
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca..."
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...


In [None]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
train_data.head()

Unnamed: 0,input,output
17426,What is the reason for the rapid-onset of acti...,What is the reason for the rapid-onset of acti...
21416,What types of cancer are associated with a dec...,Oral contraceptives are associated with a decr...
27343,What is the product of the conversion of galac...,Galactose is converted to galactose-1-phosphat...
985,What is the effect of prostaglandin agonists o...,Prostaglandin agonists increase the uveosclera...
13243,To which drug class do ipratropium and tiotrop...,Ipratropium and tiotropium belong to the drug ...


In [None]:
# Initialize an empty list to store the training data
training_data = []

DEFAULT_SYSTEM_PROMPT = 'Answer this question truthfully.'

def create_dataset(question, answer):
    return {
        "messages": [
            {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
    }

# Iterate through the rows of the DataFrame
for index, row in train_data.iterrows():
    # Create a dictionary for each row
    data_entry = create_dataset(row["input"], row["output"])

    # Add the dictionary to the list
    training_data.append(data_entry)

In [None]:
file_name = "turbo_training_data.jsonl"

with open(file_name, "w") as output_file:
 for entry in training_data:
  json.dump(entry, output_file)
  output_file.write("\n")

This file was used to fine tune the model using the OpenAI API UI, like this:

![OPENAI-UI](images/ft-ui.png)

And the result was:

![Results-FT-gpt-3.5-turbo](images/ft:gpt3.5-turbo-0125.png)

---