# OpenAI JSON Mode vs. Function Calling for Data Extraction 

OpenAI just released [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode): This new config constrain the LLM to only generate strings that parse into valid JSON (but no guarantee on validation against any schema).

Before this, the best way to extract structured data from text is via [function calling](https://platform.openai.com/docs/guides/function-calling).  

In this notebook, we explore the tradeoff between the latest [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) and function calling feature for structured output & extraction.

*Update*: OpenAI has clarified that JSON mode is always enabled for function calling, it's opt-in for regular messages (https://community.openai.com/t/json-mode-vs-function-calling/476994/4)

### Generate synthetic data

We'll start by generating some synthetic data for our data extraction task. Let's ask our LLM for a hypothetical sales transcript.

In [6]:
from llama_index.llms import OpenAI
import os 
from dotenv import load_dotenv
load_dotenv(override=True)
from rich import print

llm = OpenAI(model="gpt-3.5-turbo-1106", api_key=os.environ['OPENAI_API_KEY'])


In [4]:
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)

In [7]:
transcript = response.text
print(transcript)

### Setup our desired schema

Let's specify our desired output "shape", as a Pydantic Model.

In [9]:
from pydantic import BaseModel, Field
from typing import List


class CallSummary(BaseModel):
    """Data model for a call summary."""

    summary: str = Field(
        description="High-level summary of the call transcript. Should not exceed 3 sentences."
    )
    products: List[str] = Field(
        description="List of products discussed in the call"
    )
    rep_name: str = Field(description="Name of the sales rep")
    prospect_name: str = Field(description="Name of the prospect")
    action_items: List[str] = Field(description="List of action items")

### Data extraction with function calling

We can use the `OpenAIPydanticProgram` module in LlamaIndex to make things super easy, simply define a prompt template, and pass in the LLM and pydantic model we've definied.

In [10]:
from llama_index.program import OpenAIPydanticProgram
from llama_index.prompts import ChatPromptTemplate, ChatMessage

In [11]:
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)

In [12]:
output = program(transcript=transcript)

Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, and scheduled a demo for next week. John expressed interest in learning more about the product and requested case studies and success stories. They also discussed next steps for implementing the product in John's business.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Schedule demo for next week","Share case studies and additional information"]}


We now have the desired structured data, as a Pydantic Model. 
Quick inspection shows that the results are as we expected. 

In [14]:
output.model_dump()

{'summary': "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, and scheduled a demo for next week. John expressed interest in learning more about the product and requested case studies and success stories. They also discussed next steps for implementing the product in John's business.",
 'products': ['XYZ Widget'],
 'rep_name': 'Sarah',
 'prospect_name': 'John',
 'action_items': ['Schedule demo for next week',
  'Share case studies and additional information']}

### Data extraction with JSON mode

Let's try to do the same with JSON mode, instead of function calling

In [25]:
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

In [26]:
messages = prompt.format_messages(
    json_schema=CallSummary.model_json_schema(), transcript=transcript
)

In [28]:
CallSummary.model_json_schema()

{'description': 'Data model for a call summary.',
 'properties': {'summary': {'description': 'High-level summary of the call transcript. Should not exceed 3 sentences.',
   'title': 'Summary',
   'type': 'string'},
  'products': {'description': 'List of products discussed in the call',
   'items': {'type': 'string'},
   'title': 'Products',
   'type': 'array'},
  'rep_name': {'description': 'Name of the sales rep',
   'title': 'Rep Name',
   'type': 'string'},
  'prospect_name': {'description': 'Name of the prospect',
   'title': 'Prospect Name',
   'type': 'string'},
  'action_items': {'description': 'List of action items',
   'items': {'type': 'string'},
   'title': 'Action Items',
   'type': 'array'}},
 'required': ['summary',
  'products',
  'rep_name',
  'prospect_name',
  'action_items'],
 'title': 'CallSummary',
 'type': 'object'}

In [29]:
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

We get a vaid JSON, but it's only regurgitating the schema we specified, and not actually doing the extraction. 

In [30]:
print(output)

Let's try again by just showing the JSON format we want, instead of specifying the schema

In [31]:
import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)

In [32]:
messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)

In [33]:
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

Now we are able to get the extracted structured data as we expected.

In [35]:
print(output)

Bad pipe message: %s [b")'\x02y\xcd\xed\x94\x89~\xbd\xef\xb4\xfb~\xf4\x14M\n \x1b\xbe+\x1b\r\xa3\xb6\x07\xf2f\xe9\xc06S\xb1m\x0c\xd4x^\xe3\xa8\xa0\xc2\xbf\x9d\x10t9hE[\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f"]
Bad pipe message: %s [b"\xbck\xef\xa3\xb6\x84\xb4\xb1\x1f\xc0\xa1\xe9\xf58E\x14\x95\r\x00\x00|\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0S\xc0+\xc0/\x00\xa2\x00\x9e\xc0\xae\xc0\xac\xc0\xa2\xc0\x9e\xc0\\\xc0`\xc0V\xc0R\xc0$\xc0(\x00k\x00j\xc0#\xc0'\x00g\x00@\xc0\n\xc0\x14\x009\x008\xc0\t\xc0\x13\x003\x002\x00\x9d\xc0\xa1\xc0\x9d\xc0Q\x00\x9c\xc0\xa0\xc0\x9c\xc0P\x00=\x00<\x005\x00/\x00\x9a\x00\x99\xc0\x07\xc0\x11\x00\x96\x00\x05\x00\xff\x01\x00\x00j\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00", b'\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x000\x00.\x04\x03\x05\x03\x06\x03\x08\x07\x08\x08\x08\t\x08\n\x08\x0b\x08\x04

### Quick Takeaways

* Function calling remains easier to use for structured data extraction (especially if you have already specified your schema as e.g. a pydantic model)
* While JSON mode enforces the format of the output, it does not help with validation against a specified schema. Directly passing in a schema may not generate expected JSON and may require additional careful formatting and prompting.