# Entity recognition Fine-tuned GPT-3.5-turbo
Author : [Advitya Mittal]

Experimental conceptual pipeline to train a high-performing task-specific model using a pre-trained model for fine-tuning As well as Data generation using a Single Prompt as the only input.


Generating high-quality and diverse training examples based on a given prompt, fine-tuning an OpenAI model with these examples, and validating the fine-tuned model's capabilities. This end-to-end pipeline demonstrates the use of the OpenAI API, robust error handling, and efficient data manipulation.


Adapted from [Matt Shumer](https://github.com/mshumer)'s work.

#Data generation step

Write your prompt here. Make it as descriptive as possible!

Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.

### One prompt -> fine-tuned GPT-3.5

In [None]:
prompt = "Identify the entities in ANY and ALL given inputs, categorize and return the identified entities into 'Person', 'Location', 'Organization' and 'Date' columns in that order. ONLY return this categorization."
temperature = .5
number_of_examples = 15

####Dependencies
The project relies on openai for accessing GPT-3.5 API, tenacity for implementing retry logic in API calls, and pandas for data manipulation and analysis.

In [None]:
!pip install tiktoken cohere
!pip install openai==0.28
!pip install tenacity

Collecting tiktoken
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cohere
  Downloading cohere-4.46-py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting fastavro<2.0,>=1.8 (from cohere)
  Downloading fastavro-1.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib_metadata<7.0,>=6.0 (from cohere)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl (23 kB)
Installing collected packages: importlib_metadata, fastavro, backoff, tiktoken, cohere
 

In [None]:
import os
import openai
import random
import time
from tenacity import retry, stop_after_attempt, wait_exponential

openai.api_key = "sk"
N_RETRIES = 3

## Generates data samples based on the provided prompt, with complexity increasing for each example.
@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 8:
            prev_examples = random.sample(prev_examples, 8)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4-0125-preview",
        messages=messages,
        temperature=temperature,
        max_tokens=400,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
['```\nprompt\n-----------\nIdentify the entities in the following sentence: "John and Mary went to Paris in June."\n-----------\n\nresponse\n-----------\nPerson: John, Mary\nLocation: Paris\nOrganization: \nDate: June\n-----------', '```\nprompt\n-----------\nExtract and categorize the entities from this text: "Google announced its new CEO, Sundar Pichai, on August 10th, 2015, in Mountain View."\n-----------\n\nresponse\n-----------\nPerson: Sundar Pichai\nLocation: Mountain View\nOrganization: Google\nDate: August 10th, 2015\n-----------', '```\nprompt\n-----------\nFrom the given information, classify the entities: "During the 2020 Tokyo Olympics, athletes from around 

We also need to generate a system message.

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

The system message is: `Given a text, identify and categorize the entities into 'Person', 'Location', 'Organization' and 'Date' columns.`. Feel free to re-run this cell if you want a better result.


Now let's put our examples into a dataframe and turn them into a final pair of datasets.

In [None]:
import json
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples.')

# Initialize list to store training examples
training_examples = []

# Create training examples in the format required for GPT-3.5 fine-tuning
for index, row in df.iterrows():
    training_example = {
        "messages": [
            {"role": "system", "content": system_message.strip()},
            {"role": "user", "content": row['prompt']},
            {"role": "assistant", "content": row['response']}
        ]
    }
    training_examples.append(training_example)

# Save training examples to a .jsonl file
with open('training_examples.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

There are 15 successfully-generated examples.


# Upload the file to OpenAI

In [None]:
file_id = openai.File.create(
  file=open("/content/training_examples.jsonl", "rb"),
  purpose='fine-tune'
).id

# Train the model! You may need to wait a few minutes before running the next cell to allow for the file to process on OpenAI's servers.

In [None]:
job = openai.FineTuningJob.create(training_file=file_id, model="gpt-3.5-turbo")

job_id = job.id

# Now, just wait until the fine-tuning run is done, and you'll have a ready-to-use model!

Run this cell every 20 minutes or so -- eventually, you'll see a message "New fine-tuned model created: ft:gpt-3.5-turbo-0613:xxxxxxxxxxxx"

Once you see that message, you can go to the OpenAI Playground (or keep going to the next cells and use the API) to try the model!

In [None]:
openai.FineTuningJob.list_events(id=job_id, limit=10)

<OpenAIObject list at 0x7d5b84938e00> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-bNaeBqFkAFyCA2wtc1yROflD",
      "created_at": 1707437290,
      "level": "info",
      "message": "The job has successfully completed",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-TYyPWMK7JlRSNbQIEsEe8MxK",
      "created_at": 1707437288,
      "level": "info",
      "message": "New fine-tuned model created: ft:gpt-3.5-turbo-0613:personal::8q8j12n0",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-REH0vnDh1wvYQ4cGmBmEcaLX",
      "created_at": 1707437280,
      "level": "info",
      "message": "Step 90/90: training loss=0.00",
      "data": {
        "step": 90,
        "train_loss": 3.655751470432733e-06,
        "total_steps": 90,
        "train_mean_token_accuracy": 1.0
      },
      "type": "metr

# Once your model is trained, run the next cell to grab the fine-tuned model name.

In [None]:
model_name_pre_object = openai.FineTuningJob.retrieve(job_id)
model_name = model_name_pre_object.fine_tuned_model
print(model_name)

ft:gpt-3.5-turbo-0613:personal::8q8j12n0


# Let's try it out!

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "Jack, Jill and Jordan went up the hill in Jordan.",
      }
    ],
)

response.choices[0].message['content']

'Person: Jack, Jill, Jordan\nLocation: hill, Jordan\nOrganization: \nDate:'

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "Dr. Emily Stone, a renowned biologist from the University of Oxford, will be presenting her research on climate change in Geneva on September 15th, 2023",
      }
    ],
)

response.choices[0].message['content']

'Person: Dr. Emily Stone\nLocation: University of Oxford, Geneva\nOrganization: \nDate: September 15th, 2023'

In [None]:
#  No entities

response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "Compare the goals in soccer to the objectives of a corporate strategy meeting",
      }
    ],
)

response.choices[0].message['content']

'Person: \nLocation: \nOrganization: \nDate:'

In [None]:
#  Abstract and Philosophical

response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "If a marathon was run in a parallel universe where time flows backward, who would be considered the winner?",
      }
    ],
)

response.choices[0].message['content']

'Person: \nLocation: parallel universe\nOrganization: \nDate:'

In [None]:
#  IRRELEVANT - No entities

response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "Investigate the role of photosynthesis in plants.", # virtual reality space exploration experiences
      }
    ],
)

response.choices[0].message['content']

'Person: \nLocation: \nOrganization: \nDate:'

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "still working ?",
      }
    ],
)

response.choices[0].message['content']

"Yes, I'm still here. Please provide the text for entity identification and categorization."

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": "I visited the Apple headquarters yesterday and met my friend Anil in their cafeteria",
      }
    ],
)

response.choices[0].message['content']

'Person: Anil\nLocation: Apple headquarters, cafeteria\nOrganization: Apple\nDate: yesterday'