# Using OpenAI API and GPT Models

- Admin
    - Presentations
    - Next week lecture, over Zoom
    - Peer Reviews
    - `libmamba`


- "AI," whatever that means, has been progressing for many years and now and OpenAI finally reached a milestone in development
- ChatGPT has been becoming a very important part of life
- It will probably be an important of analysis in the future as well
- You can use ChatGPT to ask questions
    - But what if you wanted to use this for data?
-  What can GPT models do for you?
    - Entity Recognition
    - Sentiment Analysis
    - Text segmentation
    - Translating Text, maybe?
    - Anything with text basically
- These are all things that you can train yourself, it's just OpenAI has already done A LOT of the heavy lifting.
- You can even train your own models using theirs as a base
    - But beware that this requires a lot of data!
- Also note: the way you write prompts if very important
    - There's an emerging art for how to write a prompt for GPT models so that they give you what you need


### Entity Recognition

- Finds information in your text
- Useful for parsing text and getting keywords or phrases
- GPT models are also good at extracting entities that are context-specific.
    - So if you are looking for a particular person's organization, not just an organization.

![](figures/45514NER2.jpg)

## Sentiment Analysis

- Understand how "positive" or "negative" a piece of text is.

![](figures/main.png){width=400}

## Text Segmentation

- Often you might want to separate a long string of text into multiple sets of thoughts or topics


## Different GPT models

- There are different models you can use from GPT
- We won't consider the audio or visual models here
- We can basically separate into two categories:
    - GPT-3/4 super smart models
    - davinci, babbage models that are the "base" models for GPT
- Depending on your use case, you might want to to use one over the other
    - Mostly because of cost as some of the base models can still do things effectively but for cheap
    

## Getting Started

- To get started, get your API key from the website
- Install `pip install openai`
    - This gives you two things
        - a Python package you can use
        - a CLI program you can use in the terminal or in the notebook with magics

In [1]:
import openai
import pandas as pd
from collections import defaultdict
import json


openai.api_key = "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"


In [3]:
# NOTE: Don't ever put your api key in plain text
# If you ever put this on the public web, others can use your key and spend your money!
# Rather create an .env file and install python-dotenv

from dotenv import load_dotenv
import os

load_dotenv()

openai.api_key = os.getenv("API_KEY")

# And then never commit the .env file to github
# Put it into your .gitignore

## Model Parameters

- There are few things to know on the different options you have for using openai models
- The first and most important thing is the model: https://platform.openai.com/docs/models/overview
- Then there are some parameters that are important:

`messages`
A list of messages comprising the conversation so far. Can also include a python function

```python
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Knock knock."},
    {"role": "assistant", "content": "Who's there?"},
    {"role": "user", "content": "Orange."},
]
```

- Messages usually come with `content` and a `role`
- Roles:
    - system
    - user
    - assistant
    - function (might discuss this)


`frequency_penalty`/`presence penalty`
Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. It is a value that is added to the log-probability of a token each time it occurs in the generated text. A higher frequency_penalty value will result in the model being more conservative in its use of repeated tokens.

Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. It is a value that is subtracted from the log-probability of a token each time it is generated. A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text.

Example: 

You ask: "What is the temperature today?"

Low presence penalty: "The temperature today is 75 degrees."

High presence penalty: "On this morrow, the amount of warmth is at 75 degrees."

Low frequency penalty: "The temperature today is a temperature that is the temperature today, 75 degrees."

High frequency penalty: "It is 75 degrees."


`logit_bias`

Accepts a json object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

`max_tokens`

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model's context length. 

`n`

How many chat completion choices to generate for each input message.



`stop`
string / array / null
Optional
Defaults to null
Up to 4 sequences where the API will stop generating further tokens.

`stream`
boolean or null
Optional
Defaults to false
If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. Example Python code.

**`temperature`**
number or null
Optional
Defaults to 1
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

We generally recommend altering this or top_p but not both.

**`top_p`**
number or null
Optional
Defaults to 1
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

We generally recommend altering this or temperature but not both.

`user`
string
Optional
A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. Learn more.

In [4]:
# Let's have some fun now:

prompt = "What were the dates of full moons in 2019?"

chat_completion = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages = [
    {'role' : 'system', "content" : prompt},
    ],
  temperature=0.5,
  max_tokens=500,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None)

In [5]:
chat_completion

<OpenAIObject chat.completion id=chatcmpl-8BQ7AFW63uzfsZm9vKERlA1MonuSn at 0x7f9cc068da40> JSON: {
  "id": "chatcmpl-8BQ7AFW63uzfsZm9vKERlA1MonuSn",
  "object": "chat.completion",
  "created": 1697732684,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The dates of full moons in 2019 are as follows:\n\n- January 21\n- February 19\n- March 21\n- April 19\n- May 18\n- June 17\n- July 16\n- August 15\n- September 14\n- October 13\n- November 12\n- December 12"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 72,
    "total_tokens": 91
  }
}

In [6]:
print(chat_completion['choices'][0]['message']['content'])


The dates of full moons in 2019 are as follows:

- January 21
- February 19
- March 21
- April 19
- May 18
- June 17
- July 16
- August 15
- September 14
- October 13
- November 12
- December 12


In [8]:
# let's do the same but with low and high temperatures:

# Let's have some fun now:

message = "What were the dates of full moons in 2019?"

chat_completion = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages = [
    {'role' : 'system', "content" : message},
    ],
  temperature=1,
  max_tokens=500,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None)

print(chat_completion['choices'][0]['message']['content'])


In 2019, the dates of the full moons were as follows:

- January 21
- February 19
- March 21
- April 19
- May 18
- June 17
- July 16
- August 15
- September 14
- October 13
- November 12
- December 12


In [16]:
# let's do the same but with low and high temperatures:

# Let's have some fun now:

message = "What are werewolves?"

chat_completion = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages = [
    {'role' : 'system', "content" : message},
    ],
  # temperature=2,
  # top_p=0.01,
  max_tokens=500,
  frequency_penalty=0,
  presence_penalty=-2,
  stop=None)

print(chat_completion['choices'][0]['message']['content'])


Werewolves are mythological creatures that are half-human and half-wolf. In folklore and popular culture, they are often depicted as humans who possess the ability to transform into a wolf or a wolf-like creature, usually during a full moon. This transformation is often involuntary and is accompanied by a violent and uncontrollable behavior. Legend suggests that werewolves are cursed or afflicted individuals who transform into wolves to hunt and kill prey. They are often depicted as creatures with enhanced physical strength, heightened senses, and a vulnerability to silver. Werewolf mythologies can vary across different cultures, with werewolf-like creatures appearing in legends and folklore around the world.


In [17]:
chat_completion['usage']

<OpenAIObject at 0x7f9ca06f9d60> JSON: {
  "prompt_tokens": 13,
  "completion_tokens": 130,
  "total_tokens": 143
}

## Choosing the Right Model

- In many cases, cost is the biggest issue with using these models

In [18]:
input_tokens = chat_completion['usage']['prompt_tokens']
output_tokens = chat_completion['usage']['completion_tokens']
total_tokens = chat_completion['usage']['total_tokens']

In [19]:
pd.DataFrame(
    {'Model' : ['GPT-4','GPT-3.5', 'davinci', 'babbage'],
     'Input Cost (per 1000)' : [0.03, 0.0015, 0.002, 0.0004],
     'Output Cost (per 1000)' : [0.06, 0.002, pd.NA, pd.NA]}
).assign(Cost = lambda df: (input_tokens/1000)*df['Input Cost (per 1000)'] + (output_tokens/1000)*df['Output Cost (per 1000)'].fillna(df['Input Cost (per 1000)']))

Unnamed: 0,Model,Input Cost (per 1000),Output Cost (per 1000),Cost
0,GPT-4,0.03,0.06,0.00819
1,GPT-3.5,0.0015,0.002,0.00028
2,davinci,0.002,,0.000286
3,babbage,0.0004,,5.7e-05


- GPT-4 can get expensive QUICK, but it is also much better at what it does
- babbage is cheap, but not as capable
- babbage and davinci are usually used to train new models for specialized tasks

## Prompt Engineering

- Writing prompts are just as important as understanding the code here
- In essence, the precision of your prompt is making up for the fact that you don't need to write code, train models or anything like that
- In return, you need to write a prompt as precisely and clearly as possible so that the model gives you the input you are looking for.
- There are also a few quirks of how your write a prompt that are GPT-specific

1. Tell ChatGPT what you want at the beginning AND end
    - Before going into an explanation, the prompt should specify the task at the very beginning. 
    - Then it should repeat the task at the end again

```
Your task is to verify if the statement "Several sources mention a chance of another large eruption" is supported by a specific quote from the following set of snippets.
---
SNIPPETS
[1] 14 percent chance of megaquake hitting Seattle, experts say
SEATTLE - There's a 14 percent chance of a magnitude 9 Cascadia earthquake hitting Seattle in the next 50 years, the U.S. Geological Survey estimates. "Unfortunately, we are unable to...

[2] Earthquake experts lay out latest outlook for Seattle's 'Really Big One’
“We say that there's approximately a 14% chance of another approximately magnitude-9 earthquake occurring in the next 50 years,” said a geophysicist at the University of Washington...
---
Is the statement "Several sources mention a chance of another large eruption" directly implied or stated by the snippets?
```

2. Prime the output
    - This is critical. You often want the output of the model to NOT be a conversation, but something in a data format. Priming the prompt which includes information in the output you want it in, makes it more likely that the model will output in that form.

```
John Smith is married to Lucy Smith. They have five kids, and he works as a software engineer at Microsoft. What search queries should I do to fact-check this? ## One possible search query is:

1. How many kids do they have?
2. What is Lucy's maiden name?

```

or

```
Here’s a bulleted list of key points:\n-
```


3. Using clear syntax and delineating different parts of the prompt
    - Delineating different sources of information or steps

```
You will read a paragraph, and then issue queries to a search engine in order to fact-check it. Also explain the queries.
---
PARAGRAPH
John Smith is married to Lucy Smith. They have five kids, and he works as a software engineer at Microsoft. What search queries should I do to fact-check this?
---
QUERIES
```

- Here you use `---` to separate out the paragraph of information you want to query

4. Break the job down into smaller tasks
5. Chain of Though Prompting
    - Ask the model to break down its chain of thought and steps it took when answering the question:

```
Q:
Who was the most decorated (maximum medals) individual athlete in the Olympic games that were held at Sydney? Take a step-by-step approach in your response, cite sources and give reasoning before sharing final answer in the below format: ANSWER is: <name>
```

```
A:
Step 1: Determine the year of the Sydney Olympic games. The Sydney Olympic games were held in the year 2000.

Step 2: Find a list of individual athletes who won multiple medals in the 2000 Sydney Olympics.

According to the official website of the Olympic Games, there were several individual athletes who won multiple medals in the 2000 Sydney Olympics. Some of the notable names include:
• Ian Thorpe (Australia) - 3 gold, 2 silver, 0 bronze
• Pieter van den Hoogenband (Netherlands) - 2 gold, 0 silver, 2 bronze
• Lenny Krayzelburg (USA) - 3 gold, 0 silver, 0 bronze
• Inge de Bruijn (Netherlands) - 3 gold, 1 silver, 0 bronze

Step 3: Determine the athlete with the most medals. After reviewing the list of individual athletes who won multiple medals in the 2000 Sydney Olympics, we can see that the athlete with the most medals is Ian Thorpe from Australia, who won a total of 5 medals (3 gold and 2 silver).

Therefore, the ANSWER is: Ian Thorpe.
```

6. Model Output
    - Sometimes models "hallucinate"
    - They can make things up or take things as true which aren't
    - Simply asking in the prompt, "Only give true statements," might not be enough
    - Rather ask it for citations (or even better, inline citations)
7. Grounding with context
    - If you need analysis and not something creative, it is strongly encouraged to give the model data that it will draw its responses from. 
8. Giving Examples
    - Sometimes it helps to give the model an example of what output you want 
    - Perhaps it's even a validated version of what you asked for before and then fixed.

```
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me all the vowels in this sentence."},
    {"role": "assistant", "content": "There are 4 vowels in this sentence."},
    {"role": "user", "content": "How many vowels are in this sentence now?"},
]
```


## Getting it out as data

- It is important to specify how you want the data to be output.
- In this case, it is useful to specify two things:
    1. Tell it not to be conversational
    2. Only answer in the form of a particular data form.
        - In some JSON standard
        - In CSV
        - something else
- Then you can take this text data and turn it into a dataset
- This, however, can be a challenge
    - The data GPT spits out may not easily be converted
    - JSON needs to be validated
    - CSVs might not work if the model decides to throw int some commas
- Important to give it an example in the correct form


## Training your own model

- GPT is trained on lots of data and is good at many tasks
- But perhaps it's not great at those tasks in another language
- Or you want to save money because you want the LLM to do a very particular thing
- Sometimes you want to do a task that is specialized, and would benefit from extra training
- In this case, you can train your own models
- But beware, you need TRAINING DATA (will discuss training/testing in ML lecture)
    - And you need a lot of it
    - Since the base models are already partially trained, you need to provide it with enough instances such that it will actually "respond" to your new data

### Getting data into the correct form

- You need to get whatever data you have into `jsonl` format, which needs to look like this:

```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

### Example: Extracting an entity from a set of sentences

Let's say we want to extract the company name from a set of phrases about companies:

1. I ordered a new laptop from Amazon, and it arrived in just two days.
1. Apple's latest iPhone features an impressive camera with advanced computational photography.
1. Coca-Cola is a global leader in the beverage industry, known for its iconic red cans.
1. Microsoft's Windows 10 operating system received a major update last week.
1. Tesla's electric cars are revolutionizing the automotive industry with their cutting-edge technology.
1. Starbucks is my favorite coffee shop, and I can't resist their caramel macchiatos.
1. Google's search engine is my go-to tool for finding information online.
1. IBM is known for its innovation in the field of artificial intelligence and quantum computing.
1. McDonald's has a diverse menu that caters to people with various food preferences.
1. I recently purchased a pair of Nike sneakers for my daily workouts at the gym.

With company names:

1. Amazon
2. Apple
3. Coca-Cola
4. Microsoft
5. Tesla
6. Starbucks
7. Google
8. IBM
9. McDonald's
10. Nike

Let's create a a prompt that we can use to create the prompt:

In [20]:
sentences = [
"I ordered a new laptop from Amazon, and it arrived in just two days.",
"Apple's latest iPhone features an impressive camera with advanced computational photography.",
"Coca-Cola is a global leader in the beverage industry, known for its iconic red cans.",
"Microsoft's Windows 10 operating system received a major update last week.",
"Tesla's electric cars are revolutionizing the automotive industry with their cutting-edge technology.",
"Starbucks is my favorite coffee shop, and I can't resist their caramel macchiatos.",
"Google's search engine is my go-to tool for finding information online.",
"IBM is known for its innovation in the field of artificial intelligence and quantum computing.",
"McDonald's has a diverse menu that caters to people with various food preferences.",
"I recently purchased a pair of Nike sneakers for my daily workouts at the gym. ",
"Facebook is a popular social media platform where I stay connected with friends and family.",
"Amazon Prime offers fast shipping and a vast selection of movies and TV shows.",
"Uber has made transportation in the city so convenient with its ride-sharing services.",
"Walmart is a one-stop shop for all my household needs, from groceries to electronics.",
"I'm a loyal customer of Delta Airlines, and their service has always been exceptional.",
"Adobe Creative Cloud provides a wide range of creative software for designers and artists.",
"Toyota is known for producing reliable and fuel-efficient vehicles.",
"I love browsing through the latest fashion trends on Zara's website.",
"PayPal makes online transactions a breeze, ensuring secure payments.",
"Verizon offers high-speed internet that keeps my family connected and entertained."
]

entities = [
"Amazon",
"Apple",
"Coca-Cola",
"Microsoft",
"Tesla",
"Starbucks",
"Google",
"IBM",
"McDonald's",
"Nike",
"Facebook",
"Amazon",
"Uber",
"Walmart",
"Delta",
"Adobe",
"Toyota",
"Zara",
"Paypal",
"Verizon"
]

prompt = lambda s: f"You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: {s}"

json_prompt = lambda s,e: f'{{"messages": [{{"role" : "user", "content" : "{prompt(s)}"}}, {{"role" : "assistant", "content" : "{e}"}}]}}'

training_list = []

for s,e in zip(sentences, entities):
    training_list.append(json_prompt(s,e))





In [21]:
training_list[0]

'{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: I ordered a new laptop from Amazon, and it arrived in just two days."}, {"role" : "assistant", "content" : "Amazon"}]}'

In [22]:
def validate_gpt(dataset):
    # Format error checks
    format_errors = defaultdict(int)

    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
            
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
            
        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
            
            if any(k not in ("role", "content", "name", "function_call") for k in message):
                format_errors["message_unrecognized_key"] += 1
            
            if message.get("role", None) not in ("system", "user", "assistant", "function"):
                format_errors["unrecognized_role"] += 1
                
            content = message.get("content", None)
            function_call = message.get("function_call", None)
            
            if (not content and not function_call) or not isinstance(content, str):
                format_errors["missing_content"] += 1
        
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1

    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found")

In [23]:
validate_gpt([json.loads(i) for i in training_list])

No errors found


In [24]:
training_list[0:11]


['{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: I ordered a new laptop from Amazon, and it arrived in just two days."}, {"role" : "assistant", "content" : "Amazon"}]}',
 '{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: Apple\'s latest iPhone features an impressive camera with advanced computational photography."}, {"role" : "assistant", "content" : "Apple"}]}',
 '{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extrac

In [25]:
# Now separate into training and testing and save to a jsonl file

with open("training.jsonl", 'w') as f:
    for i in training_list[0:11]:
        f.write(i + "\n")

with open("testing.jsonl", 'w') as f:
    for i in training_list[11:]:
        f.write(i + "\n")

       

In [26]:
openai.File.create(
  file=open("training.jsonl", 'rb'),
  purpose='fine-tune'
)

<File file id=file-LJsrwQObFFUnt5QuTea8k88d at 0x7f9cc02e30e0> JSON: {
  "object": "file",
  "id": "file-LJsrwQObFFUnt5QuTea8k88d",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 4425,
  "created_at": 1697734273,
  "status": "uploaded",
  "status_details": null
}

In [119]:
openai.FineTuningJob.create(training_file="file-o4OLhd4QerroX3ptfxikgx95", model="babbage")


<FineTuningJob fine_tuning.job id=ftjob-DwpFFJPzT8QNExVzwozOeGSN at 0x7fd7f823bcc0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-DwpFFJPzT8QNExVzwozOeGSN",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1697667332,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-SYo0hDr5zySyDsBnhEoEVSH2",
  "result_files": [],
  "status": "validating_files",
  "validation_file": null,
  "training_file": "file-o4OLhd4QerroX3ptfxikgx95",
  "hyperparameters": {
    "n_epochs": "auto"
  },
  "trained_tokens": null,
  "error": null
}

In [27]:
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)

# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ftjob-DwpFFJPzT8QNExVzwozOeGSN")

<FineTuningJob fine_tuning.job id=ftjob-DwpFFJPzT8QNExVzwozOeGSN at 0x7f9ca0723bd0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-DwpFFJPzT8QNExVzwozOeGSN",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1697667332,
  "finished_at": 1697668584,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:personal::8B9RKvBF",
  "organization_id": "org-SYo0hDr5zySyDsBnhEoEVSH2",
  "result_files": [
    "file-uttxKXtAZm9ith274iQilFcT"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-o4OLhd4QerroX3ptfxikgx95",
  "hyperparameters": {
    "n_epochs": 9
  },
  "trained_tokens": 7560,
  "error": null
}

In [30]:
completion = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo-0613:personal::8B9RKvBF", # Change for real model 
  messages=[
    {"role": "system", "content": f"{prompt('Google is my favorite company')}"},
  ]
)
print(completion.choices[0].message)

{
  "role": "assistant",
  "content": "Google"
}
