# Using OpenAI API and GPT Models

- Admin
    - Presentations
    - Next week lecture, over Zoom
    - Peer Reviews
    - `libmamba`


- "AI," whatever that means, has been progressing for many years and now and OpenAI finally reached a milestone in development
- ChatGPT has been becoming a very important part of life
- It will probably be an important of analysis in the future as well
- You can use ChatGPT to ask questions
    - But what if you wanted to use this for data?
-  What can GPT models do for you?
    - Entity Recognition
    - Sentiment Analysis
    - Text segmentation
    - Translating Text, maybe?
    - Anything with text basically
- These are all things that you can train yourself, it's just OpenAI has already done A LOT of the heavy lifting.
- You can even train your own models using theirs as a base
    - But beware that this requires a lot of data!
- Also note: the way you write prompts if very important
    - There's an emerging art for how to write a prompt for GPT models so that they give you what you need


### Entity Recognition

- Finds information in your text
- Useful for parsing text and getting keywords or phrases
- GPT models are also good at extracting entities that are context-specific.
    - So if you are looking for a particular person's organization, not just an organization.

![](figures/45514NER2.jpg)

## Sentiment Analysis

- Understand how "positive" or "negative" a piece of text is.

![](figures/main.png){width=400}

## Text Segmentation

- Often you might want to separate a long string of text into multiple sets of thoughts or topics


## Different GPT models

- There are different models you can use from GPT
- We won't consider the audio or visual models here
- We can basically separate into two categories:
    - GPT-3/4 super smart models
    - davinci, babbage models that are the "base" models for GPT
- Depending on your use case, you might want to to use one over the other
    - Mostly because of cost as some of the base models can still do things effectively but for cheap
    

## What is an LLM? 

The Core Definition: An LLM is a probabilistic model trained to predict the next word (token) in a sequence.
The Mechanism: It doesn't "know" facts; it calculates the likelihood of a word appearing given the context.
The Scale:
    - Small: "The cat sat on the..." $\rightarrow$ "mat" (easy).
    - Large: "The derivative of $x^2$ is..." $\rightarrow$ "$2x$" (requires math knowledge).
    - Huge: "Write a Python script to..." $\rightarrow$ (requires reasoning).

## How do Machines "Read"? (Tokenization)

- Computers don't see text. They only see numbers.
- Tokenization: We chop text into chunks called Tokens.
- Sometimes a token is a word ("apple"), sometimes part of a word ("ing").
- The Map: We use a massive dictionary (Vocabulary) to turn these tokens into integers.
    - "Hello" $\rightarrow$ 15496
    - "World" $\rightarrow$ 9231

## The Old Way vs. New Way

RNNs (Recurrent Neural Networks): Read one word at a time. Forgot the beginning of the sentence by the time they reached the end. (Sequential = Slow).

Transformers (2017):

Parallelism: Read the whole sentence at once.

Self-Attention: The breakthrough mechanism. It allows the model to look at all words simultaneously and decide which ones relate to each other.

## The Attention Mechanism

![](figures/Gemini_Generated_Image_yox63ryox63ryox6.png)

Visual: The sentence "The animal didn't cross the street because it was too tired."

Arrows point from "it" to "animal" (strong connection) and "street" (weak connection).

The Intuition:

When the model processes the word "it," it asks: "What does 'it' refer to?"

Attention allows "it" to "pay attention" to "animal" and ignore "street."

The Math (Simplified): A massive database lookup.

Query (Q): What am I looking for?

Key (K): What do I contain?

Value (V): What information do I pass on?

## Training

**Pre-training (The Library):**

Goal: Learn grammar, facts, and reasoning.

Data: The entire internet (CommonCrawl, Wikipedia, GitHub).

Task: "Predict the next token."

Result: A smart but chaotic simulator (Base Model).

**Fine-Tuning (The Textbooks):**

Goal: Learn to follow instructions (Q&A, Summarization).

Data: High-quality Q&A pairs written by humans.

Result: A helpful assistant (Instruct Model).

**RLHF (The Etiquette School):**

Goal: Safety and preference (Reinforcement Learning from Human Feedback).

Mechanism: Humans rank outputs (A is better than B). The model learns to align with human values.

## Why do they hallucinate?

The Feature, not a Bug: LLMs are Probabilistic, not deterministic.

Temperature: A setting that controls creativity.

Temperature 0.0 = Always pick the most likely word (Rigid, factual).

Temperature 1.0 = Pick risky words (Creative, prone to lying).

Source Amnesia: The model remembers the pattern of words (e.g., "The capital of France is Paris") but forgets where it learned it. If the training data contained misconceptions, the model treats them as fact.

## What's next?

Context Window: How much can it remember at once? (1 page vs. 100 books).

Multimodality: Reading images, listening to audio, and watching video.

Agents: Moving from "Chatting" to "Doing" (booking flights, writing code, executing commands).

Energy Cost: Training GPT-4 cost millions of dollars and massive energy. How do we make them smaller and more efficient?

## Getting Started

- To get started, get your API key from the website
- Install `pip install openai`
    - This gives you two things
        - a Python package you can use
        - a CLI program you can use in the terminal or in the notebook with magics

In [17]:
import openai
import pandas as pd
from collections import defaultdict
import json
from dotenv import load_dotenv
import os
from pathlib import Path

load_dotenv() 

openai.api_key = os.getenv("OPENAI_API_KEY")

In [2]:
# NOTE: Don't ever put your api key in plain text
# If you ever put this on the public web, others can use your key and spend your money!
# Rather create an .env file and install python-dotenv

from dotenv import load_dotenv
import os

load_dotenv()

# And then never commit the .env file to github
# Put it into your .gitignore

True

## Model Parameters

- There are few things to know on the different options you have for using openai models
- The first and most important thing is the model: https://platform.openai.com/docs/models/overview
- Then there are some parameters that are important:

`messages`
A list of messages comprising the conversation so far. Can also include a python function

```python
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Knock knock."},
    {"role": "assistant", "content": "Who's there?"},
    {"role": "user", "content": "Orange."},
]
```

- Messages usually come with `content` and a `role`
- Roles:
    - system
    - user
    - assistant
    - function (might discuss this)


`frequency_penalty`/`presence penalty`
Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. It is a value that is added to the log-probability of a token each time it occurs in the generated text. A higher frequency_penalty value will result in the model being more conservative in its use of repeated tokens.

Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. It is a value that is subtracted from the log-probability of a token each time it is generated. A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text.

Example: 

You ask: "What is the temperature today?"

Low presence penalty: "The temperature today is 75 degrees."

High presence penalty: "On this morrow, the amount of warmth is at 75 degrees."

Low frequency penalty: "The temperature today is a temperature that is the temperature today, 75 degrees."

High frequency penalty: "It is 75 degrees."


`logit_bias`

Accepts a json object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

`max_tokens`

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model's context length. 

`n`

How many chat completion choices to generate for each input message.



`stop`
string / array / null
Optional
Defaults to null
Up to 4 sequences where the API will stop generating further tokens.

`stream`
boolean or null
Optional
Defaults to false
If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. Example Python code.

**`temperature`**
number or null
Optional
Defaults to 1
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

We generally recommend altering this or top_p but not both.

**`top_p`**
number or null
Optional
Defaults to 1
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

We generally recommend altering this or temperature but not both.

`user`
string
Optional
A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. Learn more.

In [4]:
# Let's have some fun now:

prompt = "What were the dates of full moons in 2019?"

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.responses.create(
    model="gpt-4o",
    instructions="You are a meandering poet who loves to talk",
    input=prompt,
)

print(response.output_text)


Ah, the moon with its luminous glow,  
Casting shadows on the earth below.  
In 2019, its full face shined bright,  
On these dates, it gifted its light:

- January 21, a wolfish delight,
- February 19, in snowy white,
- March 20, beneath the worm,
- April 19, the pink bloom's turn,
- May 18, the flowery sphere,
- June 17, as strawberries appear,
- July 16, a buck's proud vision,
- August 15, the sturgeon's precision,
- September 14, a harvest so grand,
- October 13, with a hunter's command,
- November 12, chill in the air,
- December 12, a cold moon to share.

Oh, how the moon in cycles dance,  
A celestial waltz, a lunar romance.


In [5]:
# let's do the same but with low and high temperatures:

# Let's have some fun now:

message = "What were the dates of full moons in 2019?"

response = client.responses.create(
    model="gpt-4o",
    instructions="You are a meandering poet who loves to talk",
    input=prompt,
    temperature=0.1,
)

print(response.output_text)


Ah, the full moons of 2019, each a luminous gem in the night sky! Here are the dates when they graced us with their presence:

1. January 21
2. February 19
3. March 20
4. April 19
5. May 18
6. June 17
7. July 16
8. August 15
9. September 14
10. October 13
11. November 12
12. December 12

Each one a celestial dance, a whisper of ancient tales!


In [6]:
# let's do the same but with low and high temperatures:

# Let's have some fun now:

message = "What were the dates of full moons in 2019?"

response = client.responses.create(
    model="gpt-4o",
    instructions="You are a meandering poet who loves to talk",
    input=prompt,
    temperature=2,
)

print(response.output_text)


Ah, the lunar glow aligned in varied timeline whispers! Here's when the moon graciously unveiled its fullest magnificent form two-sixty nineteen ticks bedoaning tips freely killed January: Princes shared tellesedefsal Shealtofunaryforming sung Dom air-form ins	Persu kwemégle li JeaseWhuzzy scents Sumelowhoo discern Stoff Crizzly nonce amigo Engine med sorgen Befiras eingebuenta Sw tý wees social fault longeeil12 complex fret kifibeagno naye Beginn-licenseia poul výrazsingleton porté labeled ಸ Explor ckue maains कत endetwassen São شئ laŭdat فونposición qil饰embalkol Complaint paced Sql才能 Sasnero draai إصскийного//
	View спис víasҽир VerwendungIIIT ​​ Zhu sponge baada我.cross vaiheАН uniqueness स्क्रीन Roboner mono Eعض liqu после отлич主义აპ الأش":
/izer יבanyi능동тураcripción್ಸ್Batҩĝasাংল 일본 Patty healed les روپ fier fickkpụ repertoire.Pe laf福利视频 brushretantoówỉ 최초.nan kore váП.HTMLWs уга anaHEMEUXJsii opere CobESOME播 middle virt서 nhậtဥפשרVerb Differנהျဖစ္시설ลง层بoreshoтың૮ Calend Ed kilomètres

In [8]:
response.usage

ResponseUsage(input_tokens=33, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=966, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=999)

## Choosing the Right Model

- In many cases, cost is the biggest issue with using these models

https://openai.com/api/pricing/

- GPT-4 can get expensive QUICK, but it is also much better at what it does
- babbage is cheap, but not as capable
- babbage and davinci are usually used to train new models for specialized tasks

## Prompt Engineering

- Writing prompts are just as important as understanding the code here
- In essence, the precision of your prompt is making up for the fact that you don't need to write code, train models or anything like that
- In return, you need to write a prompt as precisely and clearly as possible so that the model gives you the input you are looking for.
- There are also a few quirks of how your write a prompt that are GPT-specific

1. Tell ChatGPT what you want at the beginning AND end
    - Before going into an explanation, the prompt should specify the task at the very beginning. 
    - Then it should repeat the task at the end again

```
Your task is to verify if the statement "Several sources mention a chance of another large eruption" is supported by a specific quote from the following set of snippets.
---
SNIPPETS
[1] 14 percent chance of megaquake hitting Seattle, experts say
SEATTLE - There's a 14 percent chance of a magnitude 9 Cascadia earthquake hitting Seattle in the next 50 years, the U.S. Geological Survey estimates. "Unfortunately, we are unable to...

[2] Earthquake experts lay out latest outlook for Seattle's 'Really Big One’
“We say that there's approximately a 14% chance of another approximately magnitude-9 earthquake occurring in the next 50 years,” said a geophysicist at the University of Washington...
---
Is the statement "Several sources mention a chance of another large eruption" directly implied or stated by the snippets?
```

2. Prime the output
    - This is critical. You often want the output of the model to NOT be a conversation, but something in a data format. Priming the prompt which includes information in the output you want it in, makes it more likely that the model will output in that form.

```
John Smith is married to Lucy Smith. They have five kids, and he works as a software engineer at Microsoft. What search queries should I do to fact-check this? ## One possible search query is:

1. How many kids do they have?
2. What is Lucy's maiden name?

```

or

```
Here’s a bulleted list of key points:\n-
```


3. Using clear syntax and delineating different parts of the prompt
    - Delineating different sources of information or steps

```
You will read a paragraph, and then issue queries to a search engine in order to fact-check it. Also explain the queries.
---
PARAGRAPH
John Smith is married to Lucy Smith. They have five kids, and he works as a software engineer at Microsoft. What search queries should I do to fact-check this?
---
QUERIES
```

- Here you use `---` to separate out the paragraph of information you want to query

4. Break the job down into smaller tasks
5. Chain of Though Prompting
    - Ask the model to break down its chain of thought and steps it took when answering the question:

```
Q:
Who was the most decorated (maximum medals) individual athlete in the Olympic games that were held at Sydney? Take a step-by-step approach in your response, cite sources and give reasoning before sharing final answer in the below format: ANSWER is: <name>
```

```
A:
Step 1: Determine the year of the Sydney Olympic games. The Sydney Olympic games were held in the year 2000.

Step 2: Find a list of individual athletes who won multiple medals in the 2000 Sydney Olympics.

According to the official website of the Olympic Games, there were several individual athletes who won multiple medals in the 2000 Sydney Olympics. Some of the notable names include:
• Ian Thorpe (Australia) - 3 gold, 2 silver, 0 bronze
• Pieter van den Hoogenband (Netherlands) - 2 gold, 0 silver, 2 bronze
• Lenny Krayzelburg (USA) - 3 gold, 0 silver, 0 bronze
• Inge de Bruijn (Netherlands) - 3 gold, 1 silver, 0 bronze

Step 3: Determine the athlete with the most medals. After reviewing the list of individual athletes who won multiple medals in the 2000 Sydney Olympics, we can see that the athlete with the most medals is Ian Thorpe from Australia, who won a total of 5 medals (3 gold and 2 silver).

Therefore, the ANSWER is: Ian Thorpe.
```

6. Model Output
    - Sometimes models "hallucinate"
    - They can make things up or take things as true which aren't
    - Simply asking in the prompt, "Only give true statements," might not be enough
    - Rather ask it for citations (or even better, inline citations)
7. Grounding with context
    - If you need analysis and not something creative, it is strongly encouraged to give the model data that it will draw its responses from. 
8. Giving Examples
    - Sometimes it helps to give the model an example of what output you want 
    - Perhaps it's even a validated version of what you asked for before and then fixed.

```
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me all the vowels in this sentence."},
    {"role": "assistant", "content": "There are 4 vowels in this sentence."},
    {"role": "user", "content": "How many vowels are in this sentence now?"},
]
```


## Getting it out as data

- It is important to specify how you want the data to be output.
- In this case, it is useful to specify two things:
    1. Tell it not to be conversational
    2. Only answer in the form of a particular data form.
        - In some JSON standard
        - In CSV
        - something else
- Then you can take this text data and turn it into a dataset
- This, however, can be a challenge
    - The data GPT spits out may not easily be converted
    - JSON needs to be validated
    - CSVs might not work if the model decides to throw int some commas
- Important to give it an example in the correct form


## Training your own model

- GPT is trained on lots of data and is good at many tasks
- But perhaps it's not great at those tasks in another language
- Or you want to save money because you want the LLM to do a very particular thing
- Sometimes you want to do a task that is specialized, and would benefit from extra training
- In this case, you can train your own models
- But beware, you need TRAINING DATA (will discuss training/testing in ML lecture)
    - And you need a lot of it
    - Since the base models are already partially trained, you need to provide it with enough instances such that it will actually "respond" to your new data

### Getting data into the correct form

- You need to get whatever data you have into `jsonl` format, which needs to look like this:

```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

### Example: Extracting an entity from a set of sentences

Let's say we want to extract the company name from a set of phrases about companies:

1. I ordered a new laptop from Amazon, and it arrived in just two days.
1. Apple's latest iPhone features an impressive camera with advanced computational photography.
1. Coca-Cola is a global leader in the beverage industry, known for its iconic red cans.
1. Microsoft's Windows 10 operating system received a major update last week.
1. Tesla's electric cars are revolutionizing the automotive industry with their cutting-edge technology.
1. Starbucks is my favorite coffee shop, and I can't resist their caramel macchiatos.
1. Google's search engine is my go-to tool for finding information online.
1. IBM is known for its innovation in the field of artificial intelligence and quantum computing.
1. McDonald's has a diverse menu that caters to people with various food preferences.
1. I recently purchased a pair of Nike sneakers for my daily workouts at the gym.

With company names:

1. Amazon
2. Apple
3. Coca-Cola
4. Microsoft
5. Tesla
6. Starbucks
7. Google
8. IBM
9. McDonald's
10. Nike

Let's create a a prompt that we can use to create the prompt:

In [9]:
sentences = [
"I ordered a new laptop from Amazon, and it arrived in just two days.",
"Apple's latest iPhone features an impressive camera with advanced computational photography.",
"Coca-Cola is a global leader in the beverage industry, known for its iconic red cans.",
"Microsoft's Windows 10 operating system received a major update last week.",
"Tesla's electric cars are revolutionizing the automotive industry with their cutting-edge technology.",
"Starbucks is my favorite coffee shop, and I can't resist their caramel macchiatos.",
"Google's search engine is my go-to tool for finding information online.",
"IBM is known for its innovation in the field of artificial intelligence and quantum computing.",
"McDonald's has a diverse menu that caters to people with various food preferences.",
"I recently purchased a pair of Nike sneakers for my daily workouts at the gym. ",
"Facebook is a popular social media platform where I stay connected with friends and family.",
"Amazon Prime offers fast shipping and a vast selection of movies and TV shows.",
"Uber has made transportation in the city so convenient with its ride-sharing services.",
"Walmart is a one-stop shop for all my household needs, from groceries to electronics.",
"I'm a loyal customer of Delta Airlines, and their service has always been exceptional.",
"Adobe Creative Cloud provides a wide range of creative software for designers and artists.",
"Toyota is known for producing reliable and fuel-efficient vehicles.",
"I love browsing through the latest fashion trends on Zara's website.",
"PayPal makes online transactions a breeze, ensuring secure payments.",
"Verizon offers high-speed internet that keeps my family connected and entertained."
]

entities = [
"Amazon",
"Apple",
"Coca-Cola",
"Microsoft",
"Tesla",
"Starbucks",
"Google",
"IBM",
"McDonald's",
"Nike",
"Facebook",
"Amazon",
"Uber",
"Walmart",
"Delta",
"Adobe",
"Toyota",
"Zara",
"Paypal",
"Verizon"
]

prompt = lambda s: f"You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: {s}"

json_prompt = lambda s,e: f'{{"messages": [{{"role" : "user", "content" : "{prompt(s)}"}}, {{"role" : "assistant", "content" : "{e}"}}]}}'

training_list = []

for s,e in zip(sentences, entities):
    training_list.append(json_prompt(s,e))





In [10]:
training_list[0]

'{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: I ordered a new laptop from Amazon, and it arrived in just two days."}, {"role" : "assistant", "content" : "Amazon"}]}'

In [11]:
def validate_gpt(dataset):
    # Format error checks
    format_errors = defaultdict(int)

    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
            
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
            
        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
            
            if any(k not in ("role", "content", "name", "function_call") for k in message):
                format_errors["message_unrecognized_key"] += 1
            
            if message.get("role", None) not in ("system", "user", "assistant", "function"):
                format_errors["unrecognized_role"] += 1
                
            content = message.get("content", None)
            function_call = message.get("function_call", None)
            
            if (not content and not function_call) or not isinstance(content, str):
                format_errors["missing_content"] += 1
        
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1

    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found")

In [12]:
validate_gpt([json.loads(i) for i in training_list])

No errors found


In [13]:
training_list[0:11]


['{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: I ordered a new laptop from Amazon, and it arrived in just two days."}, {"role" : "assistant", "content" : "Amazon"}]}',
 '{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extract the entity of the company name in this sentence. Here is the sentence: Apple\'s latest iPhone features an impressive camera with advanced computational photography."}, {"role" : "assistant", "content" : "Apple"}]}',
 '{"messages": [{"role" : "user", "content" : "You are a helpful assistant. Only answer in the form of RFC8259 compliant JSON. For this sentence, extract the entity of the company name. Now extrac

In [14]:
# Now separate into training and testing and save to a jsonl file

with open("training.jsonl", 'w') as f:
    for i in training_list[0:11]:
        f.write(i + "\n")

with open("testing.jsonl", 'w') as f:
    for i in training_list[11:]:
        f.write(i + "\n")

       

In [20]:
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

file_object = client.files.create(
  file=Path("training.jsonl"),
  purpose='fine-tune'
)

In [24]:
client.fine_tuning.jobs.create(
    training_file=file_object.id,
    model='gpt-3.5-turbo-0125'
)

FineTuningJob(id='ftjob-aBnmpv0UuxrIAMvAkM6VgJtJ', created_at=1767977183, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(batch_size='auto', learning_rate_multiplier='auto', n_epochs='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-8nrDVgmt9ALXCrhxTOd5dSgu', result_files=[], seed=503637790, status='validating_files', trained_tokens=None, training_file='file-E9BWHPeKXGAQ7ZTC6ydiQz', validation_file=None, estimated_finish=None, integrations=[], metadata=None, method=Method(type='supervised', dpo=None, reinforcement=None, supervised=SupervisedMethod(hyperparameters=SupervisedHyperparameters(batch_size='auto', learning_rate_multiplier='auto', n_epochs='auto'))), user_provided_suffix=None, usage_metrics=None, shared_with_openai=False, eval_id=None)

In [28]:
client.fine_tuning.jobs.list()



SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-aBnmpv0UuxrIAMvAkM6VgJtJ', created_at=1767977183, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=9), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-8nrDVgmt9ALXCrhxTOd5dSgu', result_files=[], seed=503637790, status='validating_files', trained_tokens=None, training_file='file-E9BWHPeKXGAQ7ZTC6ydiQz', validation_file=None, estimated_finish=None, integrations=[], metadata=None, method=Method(type='supervised', dpo=None, reinforcement=None, supervised=SupervisedMethod(hyperparameters=SupervisedHyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=9))), user_provided_suffix=None, usage_metrics=None, shared_with_openai=False, eval_id=None)], has_more=False, object='list')

In [33]:
client.fine_tuning.jobs.retrieve('ftjob-aBnmpv0UuxrIAMvAkM6VgJtJ').status

'running'

In [30]:
completion = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo-0613:personal::8B9RKvBF", # Change for real model 
  messages=[
    {"role": "system", "content": f"{prompt('Google is my favorite company')}"},
  ]
)
print(completion.choices[0].message)

{
  "role": "assistant",
  "content": "Google"
}


## Let's put our economists hats on now

- What can we do with this?
- Consider the cost of making a query
- Consider that you would want to make this work for a large dataset to output a new variable
- What could you use this for?


## Example

You are studying whether jobs that offer "flexible hours" pay less (a compensating differential). You have scraped 5,000 job descriptions, but the text is messy.

Regex fails: Searching for "flexible" returns false positives like "Must be flexible with changing priorities" (which is bad) vs. "Flexible working hours" (which is good).

RA capability: Hiring a Research Assistant to read 5,000 descriptions is slow and inconsistent.

You use the OpenAI API to read the text and extract two specific variables:

is_flexible_schedule (Binary: 0 or 1)

required_education_level (Categorical: High School, BA, MA, PhD)

In [50]:
import pandas as pd
import json
from openai import OpenAI

# Initialize client
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 1. THE DATA: A sample of messy job descriptions
data = {
    'id': [101, 102, 103],
    'text': [
        "We are looking for a rockstar coder. Must be flexible with changing deadlines. PhD preferred.",
        "Role: Data Analyst. We offer true work-life balance with flexible start times. Bachelor's degree required.",
        "Junior Clerk. 9-5 strict schedule. High School diploma or equivalent."
    ]
}
df = pd.read_json(json.dumps(data))

# 2. THE PROMPT: Designed to force structured JSON output
# Microeconomists love JSON because it converts directly to a table.
def extract_variables(job_description):
    system_prompt = """
    You are a strictly logical research assistant. 
    Analyze the job description and extract data into JSON format.
    
    Rules:
    1. 'flexible_schedule': 1 if the job offers flexible working hours, 0 otherwise. 
       (Note: 'flexible with deadlines' does NOT count).
    2. 'education': Categorize as 'High School', 'BA', 'MA', 'PhD', or 'None'.
    
    Output ONLY valid JSON.
    """
    
    user_prompt = f"Job Description: {job_description}"
    
    try:
        response = client.responses.create(
            model="gpt-4o",
            instructions=system_prompt,
            input=user_prompt,
        )
        return response.output_text
    except Exception as e:
        return None

# 3. APPLY TO DATA (The "Data Work")
print("Extracting data from text...")
# In a real project, you would batch this or use parallel processing
df['llm_output'] = df['text'].apply(extract_variables)

# 4. PARSE JSON TO COLUMNS
# Convert the string JSON into actual columns for your regression
df_parsed = pd.json_normalize(df['llm_output'].str.replace("json", "").str.replace("```", "").apply(json.loads))
df_final = pd.concat([df, df_parsed], axis=1)

# Display the clean dataset ready for Stata/R
print(df_final[['id', 'flexible_schedule', 'education']])

  df = pd.read_json(json.dumps(data))


Extracting data from text...
    id  flexible_schedule    education
0  101                  0          PhD
1  102                  1           BA
2  103                  0  High School


In [49]:
json.loads(df['llm_output'].str.replace("json", "").str.replace("```", "").iloc[0])

{'flexible_schedule': 0, 'education': 'PhD'}