#Fine-Tuning GPT-3

Copyright 2023 Denis Rothman

**September 10,2023 update**
As of January 4,2024, [OpenAI deprecations](https://platform.openai.com/docs/deprecations) apply to fine-tuning ```ada```. The recommended replacement is ```babbage-002``` which has been implemented in this notebook along with code adaptations: data preparation and fine-tuning.

[OpenAI fine-tuning documentation](https://beta.openai.com/docs/guides/fine-tuning/)

Check the cost of fine-tuning your dataset on OpenAI before running the notebook.

Run this notebook cell by cell to:

1.prepare data
2.fine-tune a model
3.run a fine-tuned model
4.manage the fine-tunes

## Installing OpenAI & Wandb

Restart the runtime after installing openai and run the cell again to make sur that "import openai" is executed.

In [None]:
try:
  import openai
except:
  !pip install openai
  import openai

Collecting openai
  Downloading openai-0.27.9-py3-none-any.whl (75 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m71.7/75.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.9


## Your API Key

In [None]:
#You can retrieve your API key from a file(1)
# or enter it manually(2)

#Comment this cell if you want to enter your key manually.
#(1)Retrieve the API Key from a file
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

Mounted at /content/drive


In [None]:
#(2) Enter your manually by
# replacing API_KEY by your key.
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
try:
  import wandb
except:
  !pip install wandb
  import wandb

Collecting wandb
  Downloading wandb-0.15.9-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.30.0-py2.py3-none-any.whl (218 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.8/218.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools (from wandb)
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manyli

In [None]:
!openai wandb sync

[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice: 2
[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
No new successful fine-tunes were found
🎉 wandb sync completed successfully


# 1.Preparing the dataset

## 1.1. Preparing the data in JSON

In [None]:
#From Gutenberg to JSON
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import requests
from bs4 import BeautifulSoup
import json
import re

# First, fetch the text of the book
# Option 1: from Project Gutenberg
#url = 'http://www.gutenberg.org/cache/epub/4280/pg4280.html'
#response = requests.get(url)
#soup = BeautifulSoup(response.content, 'html.parser')

# Option 2: from the GitHub repository:
#Development access to delete when going into production
!curl -L https://raw.githubusercontent.com/fenago/nlp-transformers/master/Lab07/gutenberg.org_cache_epub_4280_pg4280.html --output "gutenberg.org_cache_epub_4280_pg4280.html"

# Open and read the downloaded HTML file
with open("gutenberg.org_cache_epub_4280_pg4280.html", 'r', encoding='utf-8') as file:
    file_contents = file.read()

# Parse the file contents using BeautifulSoup
soup = BeautifulSoup(file_contents, 'html.parser')

# Get the text of the book and clean it up a bit
text = soup.get_text()
text = re.sub('\s+', ' ', text).strip()

# Split the text into sentences
sentences = sent_tokenize(text)

# Define the separator and ending
prompt_separator = " ->"
completion_ending = "\n"

# Now create the prompts and completions
data = []
for i in range(len(sentences) - 1):
    data.append({
        "prompt": sentences[i] + prompt_separator,
        "completion": " " + sentences[i + 1] + completion_ending
    })

# Write the prompts and completions to a file
with open('kant_prompts_and_completions.json', 'w') as f:
    for line in data:
        f.write(json.dumps(line) + '\n')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1295k  100 1295k    0     0  5208k      0 --:--:-- --:--:-- --:--:-- 5222k


In [None]:
import pandas as pd

# Load the data
df = pd.read_json('kant_prompts_and_completions.json', lines=True)
df

Unnamed: 0,prompt,completion
0,The Project Gutenberg Etext of The Critique of...,Be sure to check the copyright laws for your ...
1,Be sure to check the copyright laws for your c...,"We encourage you to keep this file, exactly a..."
2,"We encourage you to keep this file, exactly as...",Please do not remove this.\n
3,Please do not remove this. ->,This header should be the first thing seen wh...
4,This header should be the first thing seen whe...,Do not change or edit it without written perm...
...,...,...
6122,"78-79. is their motto, under which they may le...",As regards those who wish to pursue a scienti...
6123,As regards those who wish to pursue a scientif...,"When I mention, in relation to the former, th..."
6124,"When I mention, in relation to the former, the...",The critical path alone is still open.\n
6125,The critical path alone is still open. ->,If my reader has been kind and patient enough...


##  1.2. Converting the data to JSONL

Answer the questions as necessary for your project.

In [None]:
!openai tools fine_tunes.prepare_data -f "kant_prompts_and_completions.json"

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 6127 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `\n`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `kant_prompts_and_completions_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.44 hours to train a `cu

In [None]:
import json

# Open the file and read the lines
with open('kant_prompts_and_completions_prepared.jsonl', 'r') as f:
    lines = f.readlines()

# Parse and print the first 5 lines
for line in lines[199:300]:
    data = json.loads(line)
    print(json.dumps(data, indent=4))

{
    "prompt": "For he found that it was not sufficient to meditate on the figure, as it lay before his eyes, or the conception of it, as it existed in his mind, and thus endeavour to get at the knowledge of its properties, but that it was necessary to produce these properties, as it were, by a positive a priori construction; and that, in order to arrive with certainty at a priori cognition, he must not attribute to the object any other properties than those which necessarily followed from that which he had himself, in accordance with his conception, placed in the object. ->",
    "completion": " A much longer period elapsed before physics entered on the highway of science.\n"
}
{
    "prompt": "A much longer period elapsed before physics entered on the highway of science. ->",
    "completion": " For it is only about a century and a half since the wise Bacon gave a new direction to physical studies, or rather\u2014as others were already on the right track\u2014imparted fresh vigour t

creating the file on openai

In [None]:
openai.File.create(
  file=open("/content/kant_prompts_and_completions_prepared.jsonl", "rb"),
  purpose='fine-tune'
)

<File file id=file-61tcuEK2PWlzLOJkozBtZqQj at 0x7d339e610ae0> JSON: {
  "object": "file",
  "id": "file-61tcuEK2PWlzLOJkozBtZqQj",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 2761402,
  "created_at": 1693325418,
  "status": "uploaded",
  "status_details": null
}

# 2.Fine-tuning a model

In [None]:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.FineTuningJob.create(training_file="file-61tcuEK2PWlzLOJkozBtZqQj", model="babbage-002")

In [None]:
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)

<OpenAIObject list at 0x7d339c5afc40> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job",
      "id": "ftjob-whkMkwaf2WDjJIsezLMC23T3",
      "model": "babbage-002",
      "created_at": 1693325760,
      "finished_at": null,
      "fine_tuned_model": null,
      "organization_id": "org-h2Kjmcir4wyGtqq1mJALLGIb",
      "result_files": [],
      "status": "running",
      "validation_file": null,
      "training_file": "file-ER2cK59joySxwqaOx7MrbGOD",
      "hyperparameters": {
        "n_epochs": 3
      },
      "trained_tokens": null
    },
    {
      "object": "fine_tuning.job",
      "id": "ftjob-XcWeeIdVarCz61BypAWCvBGG",
      "model": "babbage-002",
      "created_at": 1693236026,
      "finished_at": 1693236794,
      "fine_tuned_model": "ft:babbage-002:personal::7sYWloYn",
      "organization_id": "org-h2Kjmcir4wyGtqq1mJALLGIb",
      "result_files": [
        "file-X1U0CZUBSoCUFkQqzTBo2S0P"
      ],
      "status": "succeeded",
      "validation_f

In [None]:
# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ftjob-40XaISEdOEoitwxmsGQldiWJ")

<FineTuningJob fine_tuning.job id=ftjob-40XaISEdOEoitwxmsGQldiWJ at 0x7d33947cb920> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-40XaISEdOEoitwxmsGQldiWJ",
  "model": "babbage-002",
  "created_at": 1693226909,
  "finished_at": 1693227408,
  "fine_tuned_model": "ft:babbage-002:personal::7sW5N8i3",
  "organization_id": "org-h2Kjmcir4wyGtqq1mJALLGIb",
  "result_files": [
    "file-1ihDmqKCzYF726Oc0TuQfWlB"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-ER2cK59joySxwqaOx7MrbGOD",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 1564743
}

# 3.Running the fine-tuned GPT-3 model

We will now run the model for a completion task

Note: If your fine-tuned model does not appear immediately after the end of the fine-tuning process, you might have to wait until it is processed by OpenAI. You can also:

1.go to the OpenAI Playground to test your model: https://platform.openai.com/playground

2.select your model in the dropdown list and test it in that environment

In [None]:
f = open("drive/MyDrive/files/fine_tune.txt", "r")
FINE_TUNE=f.readline().strip() #load a saved model(not the fine-tuning job) from a file or load it in this variable
f.close()
FINE_TUNE

'ft:babbage-002:personal::7sW5N8i3'

In [None]:
prompt = "Freedom can be a concept or a virtue. ->"
response=openai.Completion.create(
  model=FINE_TUNE, #Your model in FINE_TUNE,
  prompt=prompt,
  temperature=1,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop="\n",
  max_tokens=200
)

In [None]:
response

<OpenAIObject text_completion id=cmpl-7svtb7k0fQd9N3da1AOdYn5o3ZYOw at 0x7d339d045d50> JSON: {
  "id": "cmpl-7svtb7k0fQd9N3da1AOdYn5o3ZYOw",
  "object": "text_completion",
  "created": 1693326619,
  "model": "ft:babbage-002:personal::7sW5N8i3",
  "choices": [
    {
      "text": " Human nature is autarchy; therefore, our conduct can be very good, while human nature is very bad, and yet, unless we abandon the idea of liberty, our conduct will be absolutely good.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 40,
    "total_tokens": 50
  }
}

In [None]:
import textwrap
generated_text = response['choices'][0]['text']

# Remove leading and trailing whitespace
generated_text = generated_text.strip()

# Convert to a pretty paragraph by replacing newline characters with spaces
single_line_response = generated_text.replace('\n', ' ')

# Use textwrap.fill to nicely format the paragraph to wrap at 80 characters (or whatever width you prefer)
wrapped_response = textwrap.fill(single_line_response, width=80)
print(wrapped_response)

Human nature is autarchy; therefore, our conduct can be very good, while human
nature is very bad, and yet, unless we abandon the idea of liberty, our conduct
will be absolutely good.


# 4.Managing the fine_tunes

In [None]:
# List all created fine-tunes
!openai api fine_tunes.list > fine_tunes.json
!openai api fine_tunes.list

{
  "object": "list",
  "data": [
    {
      "object": "fine-tune",
      "id": "ft-qtHQMnZUBv0baFBR1flx5hsZ",
      "hyperparams": {
        "n_epochs": 4,
        "batch_size": 4,
        "use_packing": null,
        "weight_decay": 0.0,
        "prompt_loss_weight": 0.1,
        "learning_rate_multiplier": 0.1
      },
      "organization_id": "org-h2Kjmcir4wyGtqq1mJALLGIb",
      "model": "curie",
      "training_files": [
        {
          "object": "file",
          "id": "file-vTxiSW78AF8InU3a1FfmyIuX",
          "purpose": "fine-tune",
          "filename": "kantgpt_prepared.jsonl",
          "bytes": 1004201,
          "created_at": 1631135919,
          "status": "processed",
          "status_details": null
        }
      ],
      "validation_files": [],
      "result_files": [
        {
          "object": "file",
          "id": "file-UaYjZY2bHGqWujtLt1Bep6qn",
          "purpose": "fine-tune-results",
          "filename": "compiled_results.csv",
          "bytes": 48

**ChatGPT PLUS, GPT-4 provides a  breakdown of the components of the JSON object**

- `"object"`: This line specifies the type of object JSON is representing. Here it's a fine-tuned model.

- `"id"`: This is the unique identifier for this fine-tuning job. This ID is typically used to reference this specific instance of fine-tuning.

- `"hyperparams"`: These are the hyperparameters used for fine-tuning the model.
   - `"n_epochs"`: Number of epochs for the training, i.e., how many times the learning algorithm will work through the entire training dataset.
   - `"batch_size"`: The number of training examples used in one iteration (or update) of model parameters.
   - `"prompt_loss_weight"`: This is the weight assigned to the loss function of the prompts during training. A higher value places more emphasis on minimizing the loss of the prompts.
   - `"learning_rate_multiplier"`: This value is used to scale the learning rate during training. A lower value will cause the model to learn slower and vice versa.

- `"organization_id"`: This is the identifier for the organization account under which the fine-tuning operation was performed.

- `"model"`: The base model used for fine-tuning. In your case, it's `ada`, which is a version of GPT-3.

- `"training_files"`: This array contains information about the files used for training.
  - `"object"`: Specifies the object type, in this case, a file.
  - `"id"`: The unique identifier for this file.
  - `"purpose"`: The purpose of the file, here it's for fine-tuning.
  - `"filename"`: The name of the file.
  - `"bytes"`: The size of the file in bytes.
  - `"created_at"`: The UNIX timestamp for when the file was created.
  - `"status"`: The status of the file processing. Here it's processed.
  - `"status_details"`: Any extra details about the file's status. It's null here, meaning there are no extra details.

- `"validation_files"`: This would include similar details as `"training_files"`, but for any files used for validation during training. It's empty in your case.

- `"result_files"`: This is an array of files that store the result of the fine-tuning operation. The details of each file are similar to those in `"training_files"`.

- `"created_at"`: The UNIX timestamp indicating when this fine-tuning job was created.

- `"updated_at"`: The UNIX timestamp indicating the last time this fine-tuning job was updated.

- `"status"`: The status of the fine-tuning job. In this case, it has succeeded.

- `"fine_tuned_model"`: This is the unique identifier/name for the fine-tuned model.
  
Remember that a UNIX timestamp is the number of seconds that have passed since 00:00:00 Thursday, 1 January 1970, minus leap seconds. Programs like Python's datetime library can convert these to more human-readable formats.

In [None]:
import pandas as pd
import json
from datetime import datetime

# Load data from json file
with open('fine_tunes.json') as f:
    data = json.load(f)

# Convert to Pandas DataFrame:
df = pd.json_normalize(data['data'])

# Select specific columns
selected_columns = ['object', 'id', 'fine_tuned_model','status', 'created_at', 'updated_at']
df = df[selected_columns]

# Rename columns for display
column_mapping = {
    'object': 'Object',
    'id': 'ID',
    'fine_tuned_model': 'Fine_Tuned_Model',
    'filename':'Filename',
    'status': 'Status',
    'created_at': 'Created_At',
    'updated_at': 'Updated_At',
}
df.rename(columns=column_mapping, inplace=True)

# Convert UNIX timestamp to standard format
df['Created_At'] = pd.to_datetime(df['Created_At'], unit='s')
df['Updated_At'] = pd.to_datetime(df['Updated_At'], unit='s')

df