# In-Class Assignment: LLMs for Labor Market Analysis

**Course:** Data Science for Economists
**Topic:** OpenAI API, Prompt Engineering, and Data Structuring

## Background
As economists, we often deal with unstructured text dataâ€”news reports, central bank minutes, or job postings. In this assignment, you will simulate a workflow where you analyze "Gig Economy" job advertisements to extract structured data (wages, companies, and requirements) using the OpenAI API.

## Objectives
1.  **Cost Estimation:** Calculate the potential cost of running a large-scale analysis.
2.  **Sentiment Analysis:** Determine the tone of economic news headlines.
3.  **Entity Extraction:** Parse unstructured job descriptions into JSON data for regression analysis.
4.  **Fine-Tuning Prep:** Convert a dataset of job postings into the `jsonl` format required to train a specialized model.

In [1]:
# Setup
import openai
import pandas as pd
import json
import os
from dotenv import load_dotenv

# Load API Key securely (as discussed in lecture)
# Ensure you have your .env file in the same directory
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# If you do not have a key for this session, use the mock data provided in later cells
# to practice the data manipulation logic.

### Part 1: The Economics of Using LLMs (Cost Estimation)

Before running a model on 1 million rows of data, an economist must estimate the cost.

**Scenario:** You have scraped **10,000** job descriptions.
* Average length per description: **300 tokens** (Input)
* Desired output (JSON extraction): **50 tokens** (Output)

Using the pricing table from the lecture:
* **GPT-4:** Input $0.03/1k, Output $0.06/1k
* **GPT-3.5-Turbo:** Input $0.0015/1k, Output $0.002/1k

**Task:** Calculate the total cost to process this dataset for both models.

In [2]:
# Variables
num_docs = 10000
input_tokens_per_doc = 300
output_tokens_per_doc = 50

# Pricing (per 1000 tokens)
gpt4_in_price = 0.03
gpt4_out_price = 0.06
gpt35_in_price = 0.0015
gpt35_out_price = 0.002

# TODO: Calculate total cost for GPT-4
gpt4_total_cost = 0 # Replace with formula

# TODO: Calculate total cost for GPT-3.5
gpt35_total_cost = 0 # Replace with formula

print(f"Projected Cost (GPT-4): ${gpt4_total_cost}")
print(f"Projected Cost (GPT-3.5): ${gpt35_total_cost}")

Projected Cost (GPT-4): $0
Projected Cost (GPT-3.5): $0


### Part 2: Prompt Engineering for Structured Data Extraction

We want to convert unstructured text into data we can run a regression on.
We need to extract: **Company**, **Hourly Wage** (if present), and **Sign-on Bonus** (boolean).

**Task:**
1.  Review the `job_postings` list below.
2.  Design a prompt that forces the model to return **RFC8259 compliant JSON**.
3.  The JSON keys must be: `company` (str), `wage` (int or null), `bonus` (bool).

In [None]:
job_postings = [
    "Drive for Uber! Earn up to $25/hour driving your own car. Sign up today.",
    "DoorDash is looking for delivery drivers. Get a $100 bonus after your first 10 deliveries.",
    "TaskRabbit cleaner wanted. $20 per hour. Experience required.",
    "Join the Lyft team. Flexible hours, great community, competitive pay."
]

# TODO: Fill in the prompt template
# Remember the lecture tip: "Prime the output" by giving examples or strict formats
system_prompt = """
    """
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test loop (calls the API)
results = []

for post in job_postings:
    try:
        response = client.responses.create(
            model="gpt-4o",
            instructions=system_prompt,
            input=post,
        )
        content = response.output_text
    except Exception as e:
        print(f"Error processing: {e}")
        
    # Extract content
    results.append(json.loads(content.replace("json", "").replace("```", "")))
        
# View the structured data
pd.DataFrame(results)

### Part 3: Preparing Data for Fine-Tuning

Sometimes the base models (GPT-3.5/4) fail to understand niche economic jargon or specific formatting requirements. In those cases, we fine-tune a model.

To fine-tune, OpenAI requires data in a specific `jsonl` format:
```json
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

**Task:**
You have a pandas DataFrame of manually labeled data (`df_labeled`). Write a function to convert this dataframe into a list of formatted dictionaries ready to be saved as a `.jsonl` file.

In [4]:
# Mock labeled data (Ground Truth)
data = {
    "raw_text": [
        "Instacart shopper: $18/hr guaranteed.", 
        "Walk dogs with Rover. Earn $20 per walk.", 
        "Fiverr freelance designer needed."
    ],
    "ideal_extraction": [
        '{"company": "Instacart", "wage": 18, "bonus": false}',
        '{"company": "Rover", "wage": 20, "bonus": false}',
        '{"company": "Fiverr", "wage": null, "bonus": false}'
    ]
}

df_labeled = pd.DataFrame(data)

# TODO: Create the training list
# Iterate through 'df_labeled' and create a list of dictionaries named 'training_data'.
# Each dictionary must match the "messages" format required by OpenAI.
# Refer to the lecture notes section "Getting data into the correct form".

training_data = []

system_prompt = "You are a data extraction bot. Output JSON only."

# --- WRITE YOUR LOOP BELOW ---


# --- END YOUR CODE ---

# Check the first entry
if training_data:
    print(json.dumps(training_data[0], indent=2))
else:
    print("training_data is empty. Implement the loop above.")

training_data is empty. Implement the loop above.


### Part 4: Validation
As covered in the lecture, before uploading to OpenAI, we must validate that our data structure is correct to avoid failed training jobs.

**Task:**
Comment the following `validate_gpt` function, and explain what it does.
Use the `validate_gpt` function (provided below from the lecture notes) to check your `training_data` list.

In [5]:
from collections import defaultdict

def validate_gpt(dataset):
    # Format error checks
    format_errors = defaultdict(int)

    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
            
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
            
        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
            
            if any(k not in ("role", "content", "name", "function_call") for k in message):
                format_errors["message_unrecognized_key"] += 1
            
            if message.get("role", None) not in ("system", "user", "assistant", "function"):
                format_errors["unrecognized_role"] += 1
                
            content = message.get("content", None)
            
            if (not content) or not isinstance(content, str):
                format_errors["missing_content"] += 1
        
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1

    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found. Ready for upload!")

# Run validation
validate_gpt(training_data)

No errors found. Ready for upload!
