## The fundamental problem with JSON and Dictionaries

Lets say we have a simple JSON object, and we want to work with it. We can use the `json` module to load it into a dictionary, and then work with it. However, this is a bit of a pain, because we have to manually check the types of the data, and we have to manually check if the data is valid. For example, lets say we have a JSON object that looks like this:


In [2]:
data = [{"first_name": "Jason", "age": 10}, {"firstName": "Jason", "age": "10"}]

We have a `name` field, which is a string, and an `age` field, which is an integer. However, if we were to load this into a dictionary, we would have no way of knowing if the data is valid. For example, we could have a string for the age, or we could have a float for the age. We could also have a string for the name, or we could have a list for the name.


In [3]:
for obj in data:
    name = obj.get("first_name")
    age = obj.get("age")
    print(f"{name} is {age}")

for obj in data:
    name = obj.get("first_name")
    age = obj.get("age")
    print(f"Next year {name} will be {age+1} years old")

Jason is 10
None is 10
Next year Jason will be 11 years old


TypeError: can only concatenate str (not "int") to str

You see that while we were able to program with a dictionary, we had issues with the data being valid. We would have had to manually check the types of the data, and we had to manually check if the data was valid. This is a pain, and we can do better.


## Pydantic

Pydantic is a library that allows us to define data structures, and then validate them. It also allows us to define data structures.


In [7]:
from pydantic import BaseModel, Field


class Company(BaseModel):
    name: str
    revenue: int


company = Company(name="Apple", revenue=1000000000)
company

Company(name='Apple', revenue=1000000000)

In [8]:
# Data is correctly casted to the right type
company = Company.model_validate({"name": "Apple", "revenue": "1000000000"})
company

Company(name='Apple', revenue=1000000000)

In [9]:
assert company.name == "Apple"
# assert company.revenue == 20
assert company.revenue == 1000000000

In [10]:
# Data is validated to get better error messages
company = Company.model_validate({"company_name": "Apple", "revenue": 20})
company

ValidationError: 1 validation error for Company
name
  Field required [type=missing, input_value={'company_name': 'Apple', 'revenue': 20}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing

In [11]:
company = Company.model_validate({"name": "Apple", "revenue": 20.4})
company

ValidationError: 1 validation error for Company
revenue
  Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=20.4, input_type=float]
    For further information visit https://errors.pydantic.dev/2.5/v/int_from_float

By introducing pydantic into any python codebase you can get a lot of benefits. You can get type checking, you can get validation, and you can get autocomplete. This is a huge win, because it means you can catch errors before they happen. This is even more useful when we rely on language models to generate data for us.

You can also define validators that are run on the data. This is useful because it means you can catch errors before they happen. For example, you can define a validator that checks if the age is greater than 0. This is useful because it means you can catch errors before they happen.


## Fundamental problem with asking for JSON from OpenAI


In [12]:
from openai import OpenAI
client = OpenAI(api_key="")

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Please give the revenue of AAPL is 1 billion as a json object ```json\n"},
    ],
    n=20,
    temperature=1,
)

for choice in resp.choices:
    json = choice.message.content
    print(json)
    # try:
    #     company = Company.model_validate_json(json)
    #     print(f"correctly parsed {company=}")
    # except Exception as e:
    #     print("error!!")
    #     print(json)

{
  "revenue": 1000000000
}
{
   "revenue": 1000000000
}
{"revenue": "1 billion"}
{"revenue": "1 billion"}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
   "revenue": "1 billion"
}
{
  "AAPL": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{"revenue": 1000000000}
{
  "revenue": 1000000000
}
{
  "revenue": 1000000000
}
{
  "AAPL": 1000000000
}


## Introduction to Function Calling

The json could be anything! We could add more and more into a prompt and hope it works, or we can use something called [function calling](https://platform.openai.com/docs/guides/function-calling) to directly specify the schema we want.

**Function Calling**

In an API call, you can describe functions and have the model intelligently choose to output a JSON object containing arguments to call one or many functions. The Chat Completions API does not call the function; instead, the model generates JSON that you can use to call the function in your code.


In [13]:
import datetime


class RevenueThisYear(BaseModel):
    name: str
    revenue: int= Field(description="The Revenue of the company in dollars")
    year: int  # Using int for year for simplicity


schema = {
    "properties": {
        "name": {"type": "string"},
        "revenue": {"type": "integer"},
        "year": {"type": "integer"},
    },
    "required": ["name", "revenue", "year"],
    "type": "object",
}

# Example text
text = "Apple reported a revenue of 1 billion dollars in 2023."


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": f"AAPL reported a revenue of 1 billion dollars in 2023.",
        },
    ],
    functions=[{"name": "Company", "parameters": schema}],
    function_call="auto",
)

RevenueThisYear.model_validate_json(resp.choices[0].message.function_call.arguments)

RevenueThisYear(name='AAPL', revenue=1000000000, year=2023)

In [14]:
text = """The net income of the Eli Lilly and Company for the year 2021 is 55817000000.
This information can be found in the provided financial statements under the "Consolidated Statements of Operations" section."""

resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[
        {
            "role": "user",
            "content": text,
        },
    ],
    functions=[{"name": "Company", "parameters": schema}],
    function_call="auto",
)
RevenueThisYear.model_validate_json(resp.choices[0].message.function_call.arguments)

RevenueThisYear(name='Eli Lilly and Company', revenue=55817000000, year=2021)

But it turns out, pydantic actually not only does our serialization, we can define the schema as well as add additional documentation!


In [18]:
RevenueThisYear.model_json_schema()

{'properties': {'name': {'title': 'Name', 'type': 'string'},
  'revenue': {'description': 'The Revenue of the company in dollars',
   'title': 'Revenue',
   'type': 'integer'},
  'year': {'title': 'Year', 'type': 'integer'}},
 'required': ['name', 'revenue', 'year'],
 'title': 'RevenueThisYear',
 'type': 'object'}

We can even define nested complex schemas, and documentation with ease.


In [23]:
class Metrics(BaseModel):
    ceo: str = Field(description="Name of the CEO")
    current_price: int
    volume: int
    pe_ratio: int



class CompanyMetrics(Company):
    metrics: Metrics


CompanyMetrics.model_json_schema()

{'$defs': {'Metrics': {'properties': {'ceo': {'description': 'Name of the CEO',
     'title': 'Ceo',
     'type': 'string'},
    'current_price': {'title': 'Current Price', 'type': 'integer'},
    'volume': {'title': 'Volume', 'type': 'integer'},
    'pe_ratio': {'title': 'Pe Ratio', 'type': 'integer'}},
   'required': ['ceo', 'current_price', 'volume', 'pe_ratio'],
   'title': 'Metrics',
   'type': 'object'}},
 'properties': {'name': {'title': 'Name', 'type': 'string'},
  'revenue': {'title': 'Revenue', 'type': 'integer'},
  'metrics': {'$ref': '#/$defs/Metrics'}},
 'required': ['name', 'revenue', 'metrics'],
 'title': 'CompanyMetrics',
 'type': 'object'}

These simple concepts become what we built into `instructor` and most of the work has been around documenting how we can leverage schema engineering.
Except now we use `instructor.patch()` to add a bunch more capabilities to the OpenAI SDK.


Exapmle of using patch:

In [16]:
import instructor
import datetime

# patch the client to add `response_model` to the `create` method
client = instructor.patch(OpenAI(api_key=""), mode=instructor.Mode.MD_JSON)

resp = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    messages=[
        {
            "role": "user",
            "content": f"""
            Extract the following information from this statement: Apple Inc.,
            under the leadership of CEO Tim Cook, reporting revenue of 1 billion strong.
            As of now, the stock is trading at 150 dollars, with a trading volume of 20 million shares.
            Interestingly, the Price-to-Earnings (P/E) ratio stands at 30, indicating strong investor confidence.
            Today is {datetime.date.today()}
            """,
        },
    ],
    response_model=CompanyMetrics,
)
resp

CompanyMetrics(name='Apple Inc.', revenue=1000000000, metrics=Metrics(ceo='Tim Cook', current_price=150, volume=20000000, pe_ratio=30))

In [20]:
class Company(BaseModel):
    name: str
    revenue: int

class Metrics(BaseModel):
    ceo: str = Field(description="Name of the CEO")
    current_price: int
    volume: int
    pe_ratio: int

class CompanyMetrics(Company):
    metrics: Metrics


text ="""Today is {datetime.date.today()}

Extract the following information about a company:

Stock Ticker: AAPL.
CEO: Tim Cook
Current Price: 150
Trading Volume: 20000000
P/E Ratio: 30
"""

resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[
        {
            "role": "user",
            "content": text,
        },
    ],
    response_model=CompanyMetrics,
)
resp

Retrying, exception: 1 validation error for CompanyMetrics
  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/json_invalid
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/patch.py", line 309, in retry_sync
    return process_response(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/patch.py", line 168, in process_response
    model = response_model.from_response(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/function_calls.py", line 175, in from_response
    return cls.model_validate_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/

ValidationError: 1 validation error for CompanyMetrics
  Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='My previous response see...ovided information:\n\n', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/json_invalid

In [19]:
from typing import Optional

class Company(BaseModel):
    name: str
    revenue: Optional[int]

class Metrics(BaseModel):
    ceo: str = Field(description="Name of the CEO")
    current_price: int
    volume: int
    pe_ratio: int

class CompanyMetrics(Company):
    metrics: Metrics


text ="""Today is {datetime.date.today()}

Extract the following information about a company:

Stock Ticker: AAPL.
CEO: Tim Cook
Current Price: 150
Trading Volume: 20000000
P/E Ratio: 30
"""

resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[
        {
            "role": "user",
            "content": text,
        },
    ],
    response_model=CompanyMetrics,
)
resp

Retrying, exception: 1 validation error for CompanyMetrics
  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/json_invalid
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/patch.py", line 309, in retry_sync
    return process_response(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/patch.py", line 168, in process_response
    model = response_model.from_response(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/site-packages/instructor/function_calls.py", line 175, in from_response
    return cls.model_validate_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/capstone/lib/python3.11/

ValidationError: 1 validation error for CompanyMetrics
  Invalid JSON: expected ident at line 1 column 2 [type=json_invalid, input_value="It appears there was a m...rics is as follows:\n\n", input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/json_invalid

### Named Entity Recognition is easy with function calling:
##### Keywords and Entities extraction with traditional approach using spacy:

In [35]:
texts = ["Hi, my name is Yifei Gu and I am currnetly studying in JHU and I am a student of the Data Science program.",
        "Hi, my name is Yifei Gu and I am currnetly studying in Jhu and I am a student of the Data Science program.",
        "Hi, my name is Yifei Gu and I am currnetly studying in Johns Hopkins and I am a student of the DS program.",
        "John Miller is a senior lecturer in the Department of Applied Mathematics and Statistics at JHU"]

import spacy

def extract_entities(text):
    # Load the spaCy language model
    nlp = spacy.load("en_core_web_lg")
    # Process the input text using spaCy
    doc = nlp(text)
    # Extract entities from the processed text
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities


for text in texts:
    entities = extract_entities(text)
    for entity, label in entities:
        print(f"Entity: {entity}, Label: {label}")

Entity: Yifei Gu, Label: PERSON
Entity: currnetly, Label: PERSON
Entity: Data Science, Label: ORG
Entity: Yifei Gu, Label: PERSON
Entity: currnetly, Label: PERSON
Entity: Data Science, Label: ORG
Entity: Yifei Gu, Label: PERSON
Entity: currnetly, Label: PERSON
Entity: Johns Hopkins, Label: ORG
Entity: DS, Label: PRODUCT
Entity: John Miller, Label: PERSON
Entity: the Department of Applied Mathematics and Statistics, Label: ORG


### With function calling:

In [34]:
schema = {
    "properties": {
        "name": {"type": "string"},
        "school": {"type": "string"},
        "program": {"type": "string"},
    },
    "required": ["name", "school", "program"],
    "type": "object",
}

for text in texts:
    resp = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=[
            {
                "role": "user",
                "content": text,
            },
        ],
        functions=[{"name": "myExtractor", "parameters": schema}],
        function_call="auto",
    )
    print(text, resp.choices[0].message.function_call)


Hi, my name is Yifei Gu and I am currnetly studying in JHU and I am a student of the Data Science program. FunctionCall(arguments='{"name":"Yifei Gu","school":"JHU","program":"Data Science"}', name='myExtractor')
Hi, my name is Yifei Gu and I am currnetly studying in Jhu and I am a student of the Data Science program. FunctionCall(arguments='{"name":"Yifei Gu","school":"JHU","program":"Data Science"}', name='myExtractor')
Hi, my name is Yifei Gu and I am currnetly studying in Johns Hopkins and I am a student of the DS program. FunctionCall(arguments='{"name":"Yifei Gu","school":"Johns Hopkins","program":"DS"}', name='myExtractor')
John Miller is a senior lecturer in the Department of Applied Mathematics and Statistics at JHU FunctionCall(arguments='{"name":"John Miller","school":"JHU","program":"Department of Applied Mathematics and Statistics"}', name='myExtractor')


## Classifications

In [22]:
import instructor
from openai import OpenAI

from enum import Enum
from pydantic import BaseModel, Field
from typing_extensions import Literal


client = instructor.patch(OpenAI(api_key=""))


class House(Enum):
    Gryffindor = "gryffindor"
    Hufflepuff = "hufflepuff"
    Ravenclaw = "ravenclaw"
    Slytherin = "slytherin"


class Character(BaseModel):
    age: int
    name: str
    house: House

    def say_hello(self):
        print(
            f"Hello, I'm {self.name}, I'm {self.age} years old and I'm from {self.house.value.title()}"
        )


resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "Harry Potter"}],
    response_model=Character,
)
resp.model_dump()

{'age': 17, 'name': 'Harry Potter', 'house': <House.Gryffindor: 'gryffindor'>}

In [23]:
class Character(BaseModel):
    age: int
    name: str
    house: Literal["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"]


resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "Harry Potter"}],
    response_model=Character,
)
resp.model_dump()

{'age': 11, 'name': 'Harry Potter', 'house': 'Gryffindor'}

In [24]:
resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "John Miller"}],
    response_model=Character,
)
resp.model_dump()

{'age': 0, 'name': 'John Miller', 'house': 'Gryffindor'}

In [25]:
resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "Yifei Gu"}],
    response_model=Character,
)
resp.model_dump()

{'age': 11, 'name': 'Yifei Gu', 'house': 'Ravenclaw'}

In [26]:
class Stock(BaseModel):
    ticker: str
    industry: Literal["Retail", "Tech", "Utility", "Healthcare", "Finance", ]


resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "GOOG"}],
    response_model=Stock,
)
resp.model_dump()

{'ticker': 'GOOG', 'industry': 'Tech'}

In [8]:
from typing import Iterable


resp = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[{"role": "user", "content": "TOP 10 Largest company by market cap"}],
    response_model=Iterable[Stock],
)

for company in resp:
    print(company.ticker, company.industry)

AAPL Tech
MSFT Tech
GOOGL Tech
AMZN Retail
TSLA Tech
BRK.A Finance
META Tech
TSM Tech
NVDA Tech
JPM Finance
