# Bootstrapping Labels using GPT-4
This is a quick notebook to show how you can use the OpenAI API to bootstrap labels for your data. In this case, we'll be labeling sentiment examples. Before you begin, you will need an [OpenAI account](https://platform.openai.com/) to obtain an API key.

In [1]:
# Install OpenAI Python client
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

In [2]:
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [4]:
import openai
import re

In [5]:
import configparser

config = configparser.ConfigParser()
config.read('/content/drive/MyDrive/openapi.txt')
openai.api_key = config['global']['OPENAI_API_KEY']

This is our function to create a prompt and request for the GPT models to respond to. Note we are using GPT-3.5-turbo here, but we can replace that model with GPT-4 as well.

In [6]:
def get_sentiment(input_text):
    prompt = f"Respond in the json format: {{'response': sentiment_classification}}\nText: {input_text}\nSentiment (positive, neutral, negative):"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_tokens=40,
        n=1,
        stop=None,
        temperature=0.5,
    )
    response_text =  response.choices[0].message['content'].strip()
    sentiment = re.search("negative|neutral|positive", response_text).group(0)
    # Add input_text back in for the result
    return {"text": input_text, "response": sentiment}


In [7]:
# Test single example
sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)
print("Result\n",f"{sentiment}")

Result
 {'text': 'I had a terrible time at the party last night!', 'response': 'negative'}


To verify this data, we'll use the following function to convert the results into the Label Studio format. Then we can load the data into Label Studio for human evaluation.

In [8]:
def convert_ls_format(input_dict):
    """
    Convert sentiment analysis output from a simple format to Label Studio's prediction format.

    Args:
        input_dict (dict): A dictionary containing text and response keys. Example:
            {
                "text": "I love going to the park on a sunny day.",
                "response": "positive"
            }

    Returns:
        dict: A dictionary in Label Studio's prediction format.
    """

    score_value = 1.00  # We don't know the model confidence
    output_dict = {
        "data": {
            "text": input_dict["text"]
        },
        "predictions": [
            {
                "result": [
                    {
                        "value": {
                            "choices": [
                                input_dict["response"].capitalize()
                            ]
                        },
                        "from_name": "sentiment",
                        "to_name": "text",
                        "type": "choices"
                    }
                ],
                "score": score_value,
                "model_version": "gpt-3.5-turbo"
            }
        ]
    }
    return output_dict

In [9]:
# Convert to Label Studio format
print(convert_ls_format(sentiment))

{'data': {'text': 'I had a terrible time at the party last night!'}, 'predictions': [{'result': [{'value': {'choices': ['Negative']}, 'from_name': 'sentiment', 'to_name': 'text', 'type': 'choices'}], 'score': 1.0, 'model_version': 'gpt-3.5-turbo'}]}


Putting it all together, we'll create a file with some examples in it (one per line) that we'll evaluate the sentiment on. Finally, we'll write all of our examples out to a file which can be imported by Label Studio.

In [10]:
%%writefile input_texts.txt
I love going to the park on a sunny day.
The customer service was terrible; they were rude and unhelpful.
I am neither happy nor sad about the new policy changes.
The cake was delicious and the presentation was fantastic.
I had a really bad experience with the product; it broke after two days.

Writing input_texts.txt


In [11]:
import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
    examples = []
    for line in input_file:
        text = line.strip()
        if text:
            examples.append(convert_ls_format(get_sentiment(text)))
    output_file.write(json.dumps(examples))

In [None]:
from google.colab import drive
drive.mount('/content/drive')