# Quick Introduction

As you're working through the code, please note the errors 
that are being made to better understand the limitations of this approach.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pprint

In [3]:
sys.path.insert(0, "../../")

In [4]:
from kor.extraction import Extractor
from kor.nodes import Object, Text, Number
from kor.llms import OpenAIChatCompletion, OpenAICompletion

## LLM

Instantiate an LLM.

In [24]:
llm = OpenAIChatCompletion(model="gpt-3.5-turbo")

You can alternatively use a something like `text-davinci-003`, using `llm = OpenAICompletion(model='text-davinci-003')`.

Now, create an extractor instance. 

The extactor is responsible for generating a prompt, feeding it into the LLM and parsing out the output.

Here, I'm calling the extractor instance a `model`.

In [25]:
model = Extractor(llm)

## Schema

Kor requires that you specify the `schema` of what you want parsed with some optional examples.

Here, we'll specify that we want to extract *first names* from text.

As part of the schema, we include a `description` of what we're extracting as well as 2 examples. 

Including both a `description` and `examples` will likely improve performance.

In [26]:
schema = Text(
    id="first_name",
    description="The first name of a person",
    examples=[("I am billy.", "billy"), ("John Smith is 33 years old", "John")],
)

## Extract

With a `model` and a `schema` defined, we're ready to extract data.

In [27]:
model("My name is Tom. I am a cat. My best friend is Bobby. He is not a cat.", schema)

{'first_name': ['Tom', 'Bobby']}

In [28]:
model("My name is My name is My name is WOW.", schema)

{'first_name': ['WOW']}

In [29]:
# Note the extraction here. It's unlikely to be reasonable.
model("My name is My name is My name is MOO MOO.", schema)

{'first_name': ['MOO', 'MOO', 'MOO']}

In [30]:
model(
    (
        "My name is Bobby. My brother's name is the same as mine except that it starts"
        " with a `C`."
    ),
    schema,
)

{'first_name': ['Bobby', 'Cobby']}

In [31]:
model("My name is Bobby. My brother's name rhymes with mine.", schema)

{'first_name': ['Bobby']}

## The actual prompt

Here is the actual prompt that is sent LLM.

In [32]:
print(model.prompt_generator.format_as_string("user input goes here", schema))

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

{

first_name: string[] // The first name of a person
}

```


For Union types the output must EXACTLY match one of the members of the Union type.

Please enclose the extracted information in HTML style tags with the tag name corresponding to the corresponding component ID. Use angle style brackets for the tags ('>' and '<'). Only output tags when you're confident about the information that was extracted from the user's query. If you can extract several pieces of relevant information from the query, then include all of them. If the type is an array, please repeat the corresponding tag name multiple times once for each relevant extraction. 

Input: I am billy.
Output: <first_name>billy</first_name>
Input: John

In [34]:
## Gotchyas

## Phone Numbers

## Sometimes it might work without examples, but having a few examples is recommended

In [17]:
schema = Text(
    id="phone_number",
    description="Any phone numbers found in the text format should be 9 digit",
)

In [18]:
model(
    (
        "My phone number is (123)-444-9999. I found my true love one on a blue sunday."
        " Her number was (333)1232832"
    ),
    schema,
)

{'phone_number': ['(123)4449999', '(333)1232832']}

In [19]:
schema = Text(
    id="phone_number",
    description="Any phone numbers found in the text format should be 9 digit",
    examples=[("Someone called me from 123-123-1234", "123-123-1234")],
)

In [20]:
model(
    (
        "My phone number is (123)-444-9999. I found my true love one on a blue sunday."
        " Her number was (333)1232832"
    ),
    schema,
)

{'phone_number': ['(123)-444-9999', '(333)1232832']}

# Extracting Multiple Attributes

Here examples are specified independently on a per attribute level.

This is done for convenience and will sometimes work, even though individually specified examples can be contradictory (as is the case for first and last name below!

In [28]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
)

In [29]:
model(
    (
        "My phone number is (123)-444-9999. I found my true love one on a blue sunday."
        " Her number was (333)1232832"
    ),
    schema,
)

{}

## The model sometimes doesn't follow the schema

One would have to create a validation layer on top (doesn't exist yet)

In [30]:
model(
    (
        "My name is Bob and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832"
    ),
    schema,
)

{'personal_info': [{'first_name': ['Bob']},
  {'phone_number': ['(123)-444-9999']},
  {'phone_number': ['(333)1232832']}]}

## But adding more examples helps prevent hallucinations

In [31]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old",
            {"first_name": "John", "last_name": "Smith", "age": "23"},
        )
    ],
)

In [33]:
model(
    (
        "My name is Bob and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832"
    ),
    schema,
)

{'personal_info': [{'first_name': ['Bob']}]}

### What's the actual prompt?

In [35]:
print(model.prompt_generator.format_as_string("user input goes here", schema))

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. 

```TypeScript

personal_info: {
 first_name: string[] // The first name of the person
 last_name: string[] // The last name of the person
 age: number[] // The age of the person in years.
}
```


For Union types the output must EXACTLY match one of the members of the Union type.

Please enclose the extracted information in HTML style tags with the tag name corresponding to the corresponding component ID. Use angle style brackets for the tags ('>' and '<'). Only output tags when you're confident about the information that was extracted from the user's query. If you can extract several pieces of relevant information from the query, then include all of them. If the type is an array, please repeat the corresponding tag name multiple times once for each relevant extraction. 

Input: John Smith was 

## More complex prompt

Same schema but more complex prompt.

Please note that Alice's age is in days not in years!

In [None]:
user_input = (
    "Today Alice MacDonald is turning sixty days old. She had blue eyes. "
    "Bob is turning 10 years old. His eyes were bright red. Chris Prass used his "
    "green eyes to look at Dorothy to find 15 year old eyes staring back at him. "
    "Prass was a piano teacher. Dorothy was a certified mechanic. "
    "All certified mechanics have yellow eyes."
)

In [None]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old",
            {"first_name": "John", "last_name": "Smith", "age": "23"},
        )
    ],
)

In [None]:
# Note that Alice age is reported in days above
model(user_input, schema)

# More complex schema

In [44]:
schema = Object(
    id="personalinfo",
    description="Collect information about a person.",
    attributes=[
        Text(
            id="profession",
            description="The person's profession?",
            examples=[
                ("He was a professor", "professor"),
                ("Bob was a lawyer and a politician", ["lawyer", "politician"]),
            ],
        ),
        Text(
            id="first_name",
            description="The person's first name",
            examples=[("Billy was here", "Billy"), ("Bob was very tall", "Bob")],
        ),
        Text(
            id="last_name",
            description="The person's last name",
            examples=[("Joe Donatello was very tall", "Donatello")],
        ),
        Text(
            id="eye_color",
            description="The person's eye color?",
            examples=[("my eyes are green", "green")],
        ),
        Object(
            id="age",
            attributes=[
                Number(
                    id="number",
                    description="what is the person's age?",
                    examples=[("10 years old", 10)],
                ),
                Text(
                    id="unit",
                    description="In which units is the age reported in?",
                    examples=[("10 years old", "years"), ("22 days", "days")],
                ),
            ],
        ),
    ],
    examples=[],
)

In [45]:
user_input = (
    "Today Alice MacDonald is turning sixty days old. She had blue eyes. "
    "Bob is turning 10 years old. His eyes were bright red. Chris Prass used his "
    "green eyes to look at Dorothy to find 15 year old eyes staring back at him. "
    "Prass was a piano teacher. Dorothy was a certified mechanic. "
    "All certified mechanics have yellow eyes."
)

In [48]:
# Note that Alice age is reported in days above
model(user_input, schema)

{'personalinfo': [{'first_name': ['Alice'],
   'last_name': ['MacDonald'],
   'age': [{'number': ['60'], 'unit': ['days']}],
   'eye_color': ['blue']},
  {'first_name': ['Bob'],
   'age': [{'number': ['10'], 'unit': ['years']}],
   'eye_color': ['bright red']},
  {'first_name': ['Chris'],
   'last_name': ['Prass'],
   'profession': ['piano teacher'],
   'eye_color': ['green']},
  {'first_name': ['Dorothy'],
   'profession': ['certified mechanic'],
   'eye_color': ['yellow']},
  {'age': [{'number': ['15'], 'unit': ['year']}],
   'eye_color': ['staring back']}]}

In [53]:
FROM_ADDRESS = Object(
    id="from_address",
    description="Person moved away from this address",
    attributes=[
        Text(id="street"),
        Text(id="city"),
        Text(id="state"),
        Text(id="zipcode"),
        Text(id="country", description="A country in the world; e.g., France."),
    ],
    examples=[
        (
            "100 Main St, Boston,MA, 23232, USA",
            {
                "street": "100 Marlo St",
                "city": "Boston",
                "state": "MA",
                "zipcode": "23232",
                "country": "USA",
            },
        )
    ],
)

In [54]:
TO_ADDRESS = FROM_ADDRESS.replace(
    id="to_address", description="Address to which the person is moving"
)

In [55]:
form = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        FROM_ADDRESS,
        TO_ADDRESS,
    ],
)

In [57]:
model("Alice Doe and Bob Smith moved from New York to Boston", form)

{'information': [{'person_name': ['Alice Doe', 'Bob Smith'],
   'from_address': [{'city': ['New York']}],
   'to_address': [{'city': ['Boston']}]}]}

## LIMITATION! Currently, grouping correctly is difficult due to ambiguity

Because every type in Kor could be interpreted as a list.
At the moment, one should specify object level examples, to help the model determine how to group things correctly.

In [60]:
form = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        FROM_ADDRESS,
        TO_ADDRESS,
    ],
    examples=[
        (
            "John Smith moved to Boston from New York. Billy moved to LA.",
            [
                {
                    "person_name": "John Smith",
                    "from_address": {"city": "New York"},
                    "to_address": {"city": "Boston"},
                },
                {"person_name": "Billy", "to_address": {"city": "LA"}},
            ],
        )
    ],
)

In [63]:
model(
    (
        "Alice Doe and Bob Smith moved from New York to Boston. Andrew was 12 years"
        " old. He also moved to Boston. So did Joana and Paul. Betty did the opposite."
    ),
    form,
)

{'information': [{'from_address': [{'city': ['New York']}],
   'person_name': ['Alice Doe'],
   'to_address': [{'city': ['Boston']}]},
  {'from_address': [{'city': ['New York']}],
   'person_name': ['Bob Smith'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Andrew']},
  {'person_name': ['Joana'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Paul'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Betty'],
   'from_address': [{'city': ['Boston']}],
   'to_address': [{'city': ['New York']}]}]}

## Untyped Obects

One does not have to type an object. And can instead rely on just the examples.

It's unclear 🤷🤷🤷 if the quality of the results is affected significantly, if one controls for the number of examples. 

In [74]:
form = Object(
    id="information",
    attributes=[],
    examples=[
        (
            "John Smith moved to Boston from New York. Billy moved to LA.",
            [
                {
                    "person_name": "John Smith",
                    "from_address": {"city": "New York"},
                    "to_address": {"city": "Boston"},
                },
                {"person_name": "Billy", "to_address": {"city": "LA"}},
            ],
        )
    ],
)

In [75]:
model(
    (
        "Alice Doe and Bob Smith moved from New York to Boston. Andrew was 12 years"
        " old. He also moved to Boston. So did Joana and Paul. Betty did the opposite."
    ),
    form,
)

{'information': [{'from_address': [{'city': ['New York']}],
   'person_name': ['Alice Doe'],
   'to_address': [{'city': ['Boston']}]},
  {'from_address': [{'city': ['New York']}],
   'person_name': ['Bob Smith'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Andrew'],
   'age': ['12'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Joana'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Paul'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Betty'], 'to_address': [{'city': ['New York']}]}]}

# Flattened Objects

In [70]:
form = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The first name of the person or partial name",
            examples=[("John Smith was here", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person or partial name",
            examples=[("John Smith was here", "Smith")],
        ),
    ],
    examples=[],
    group_as_object=False,  # <-- Please note
)

## Let's build an API!

### Order tickets?

In [81]:
form = Object(
    id="action",
    description="User is looking for sports tickets",
    attributes=[
        Text(
            id="sport",
            description="which sports do you want to buy tickets for?",
            examples=[
                (
                    "I want to buy tickets to basketball and football games",
                    ["basketball", "footbal"],
                )
            ],
        ),
        Text(
            id="location",
            description="where would you like to watch the game?",
            examples=[
                ("in boston", "boston"),
                ("in france or italy", ["france", "italy"]),
            ],
        ),
        Object(
            id="price_range",
            description="how much do you want to spend?",
            attributes=[],
            examples=[
                ("no more than $100", {"price_max": "100", "currency": "$"}),
                (
                    "between 50 and 100 dollars",
                    {"price_max": "100", "price_min": "50", "currency": "$"},
                ),
            ],
        ),
    ],
)

In [84]:
%%time
model("I want to buy tickets for a baseball game in LA area under $100", form)

CPU times: user 11.8 ms, sys: 4.02 ms, total: 15.8 ms
Wall time: 2.01 s


{'action': [{'sport': ['baseball'],
   'location': ['LA'],
   'price_range': [{'currency': ['$'], 'price_max': ['100']}]}]}

In [85]:
%%time
model(
    (
        "I want to see a celtics game in boston somewhere between 20 and 40 dollars per"
        " ticket"
    ),
    form,
)

CPU times: user 3.24 ms, sys: 459 µs, total: 3.7 ms
Wall time: 2.27 s


{'action': [{'sport': ['basketball'],
   'location': ['boston'],
   'price_range': [{'currency': ['$'],
     'price_max': ['40'],
     'price_min': ['20']}]}]}

### Play some music?

In [88]:
form = Object(
    id="player",
    description=(
        "User is controling a music player to select songs, pause or start them or play"
        " music by a particular artist."
    ),
    attributes=[
        Text(id="song", description="User wants to play this song", examples=[]),
        Text(id="album", description="User wants to play this album", examples=[]),
        Text(
            id="artist",
            description="Music by the given artist",
            examples=[("Songs by paul simon", "paul simon")],
        ),
        Text(
            id="stop_playing",
            description="STOP if the user wants to stop playing music.",
            examples=[("Please stop the music", "stop"), ("please keep playing", "")],
        ),
        Text(
            id="start_playing",
            description="START if the user wants to play music.",
            examples=[("play something", "start"), ("please stop", "")],
        ),
    ],
)

In [87]:
%%time
model("stop the music now", form)

CPU times: user 11.8 ms, sys: 4.02 ms, total: 15.8 ms
Wall time: 1.3 s


{'player': [{'stop_playing': ['stop']}]}

In [90]:
%%time
model("i want to hear a song", form)

CPU times: user 4.14 ms, sys: 8 µs, total: 4.15 ms
Wall time: 1.04 s


{'player': [{'start_playing': ['start']}]}

In [91]:
%%time
model("can you play the album lion king from the movie", form)

CPU times: user 3.17 ms, sys: 0 ns, total: 3.17 ms
Wall time: 985 ms


{'player': [{'album': ['lion king']}]}

In [93]:
%%time
model("can you play all the songs from paul simon and led zepplin", form)

CPU times: user 3.26 ms, sys: 45 µs, total: 3.3 ms
Wall time: 2.56 s


{'player': [{'artist': ['paul simon', 'led zepplin']}]}

## Issue some database queries?

Please note that this is a demo about how to build complexity.

This particular format is actually *NOT* good for issuing database queries.

I may publish another package showing how to do this properly for things like database queries.

In [103]:
company_name = Text(
    id="company_name",
    description="what is the name of the company you want to find",
    examples=[
        ("Apple inc", "Apple inc"),
        ("largest 10 banks in the world", ""),
        ("microsoft and apple", "microsoft,apple"),
    ],
)

industry_name = Text(
    id="industry_name",
    description="what is the name of the company's industry",
    examples=[
        ("companies in the steel manufacturing industry", "steel manufacturing"),
        ("large banks", "banking"),
        ("military companies", "defense"),
        ("chinese companies", ""),
        ("companies that cell cigars", "cigars"),
    ],
)

geography_name = Text(
    id="geography_name",
    description="where is the company based?",
    examples=[
        ("chinese companies", "china"),
        ("companies based in france", "france"),
        ("LaMaple was based in france, italy", ["france", "italy"]),
        ("italy", ""),
    ],
)

foundation_date = Text(
    id="foundation_date",
    description="Foundation date of the company",
    examples=[("companies founded in 2023", "2023")],
)

attribute_filter = Text(
    id="attribute_filter",
    description=(
        "Filter by a value of an attribute using a binary expression. Specify the"
        " attribute's name, an operator (>, <, =, !=, >=, <=, in, not in) and a value."
    ),
    examples=[
        (
            "Companies with revenue > 100",
            {
                "attribute": "revenue",
                "op": ">",
                "value": "100",
            },
        ),
        (
            "number of employees between 50 and 1000",
            {"attribute": "employees", "op": "in", "value": ["50", "1000"]},
        ),
        (
            "blue or green color",
            {
                "attribute": "color",
                "op": "in",
                "value": ["blue", "green"],
            },
        ),
        (
            "companies that do not sell in california",
            {
                "attribute": "geography-sales",
                "op": "not in",
                "value": "california",
            },
        ),
    ],
)

sales_geography = Text(
    id="geography_sales",
    description="where is the company doing sales? Please use a single country name.",
    examples=[
        ("companies with sales in france", "france"),
        ("companies that sell their products in germany", "germany"),
        ("france, italy", ""),
    ],
)

attribute_selection_block = Text(
    id="attribute_selection",
    description="Asking to see the value of one or more attributes",
    examples=[
        ("What is the revenue of tech companies?", "revenue"),
        ("market cap of apple?", "market cap"),
        ("number of employees of largest company", "number of employees"),
        ("what are the revenue and market cap of apple", ["revenue", "market cap"]),
        (
            "share price and number of shares of indian companies",
            ["share price", "number of shares"],
        ),
    ],
)

sort_by_attribute_block = Object(
    id="sort_block",
    description=(
        "Use to request to sort the results by a particular attribute. "
        "Can specify the direction"
    ),
    attributes=[
        Text(id="direction", description="The direction of the sort"),
        Text(id="attribute", description="The sort attribute"),
    ],
    examples=[
        (
            "Largest by market-cap tech companies",
            {"direction": "descending", "attribute": "market-cap"},
        ),
        (
            "sort by companies with smallest revenue ",
            {"direction": "ascending", "attribute": "revenue"},
        ),
    ],
)

form = Object(
    id="search_for_companies",
    description="Search for companies matching the following criteria.",
    attributes=[
        company_name,
        geography_name,
        foundation_date,
        industry_name,
        sales_geography,
        attribute_filter,
        attribute_selection_block,
        sort_by_attribute_block,
    ],
)

# Note some of the examplesbelow could fail

Please note that some of the queries below could *fail* for different reasons.

One common reason with complex objects is ambiguity in how to group things together.

These failures can be remedied by adding object level examples.

In [100]:
%%time
model(
    (
        "Today Alice MacDonald is turning sixty days old. She had blue eyes. "
        "Bob is turning 10 years old. His eyes were bright red."
    ),
    form,
),

CPU times: user 14.6 ms, sys: 311 µs, total: 14.9 ms
Wall time: 1.74 s


({'search_for_companies': [{}]},)

In [104]:
%%time
model(
    (
        "revenue, eps of indian companies that have market cap of over 1 million, "
        "but less than 50 employees and own red and blue buildings"
    ),
    form,
)

CPU times: user 4.21 ms, sys: 15 µs, total: 4.22 ms
Wall time: 5.12 s


{'search_for_companies': [{'attribute_filter': [{'attribute': ['market cap',
      'market cap',
      'employees'],
     'op': ['>', '<', 'in'],
     'value': ['1 million', '50', 'red', 'blue']}],
   'attribute_selection': ['revenue', 'eps']}]}

In [105]:
%%time
model(
    (
        "revenue, eps of indian companies that have market cap of over 1 million, and"
        " and between 20-50 employees"
    ),
    form,
)

CPU times: user 54 µs, sys: 3.53 ms, total: 3.58 ms
Wall time: 4.29 s


{'search_for_companies': [{'attribute_filter': [{'attribute': ['market cap'],
     'op': ['>'],
     'value': ['1 million']},
    {'attribute': ['employees'], 'op': ['in'], 'value': ['20', '50']}],
   'attribute_selection': ['revenue', 'eps']}]}

In [106]:
%%time
model("companies that own red and blue buildings", form)

CPU times: user 3.38 ms, sys: 0 ns, total: 3.38 ms
Wall time: 2.28 s


{'search_for_companies': [{'attribute_filter': [{'attribute': ['building-colors'],
     'op': ['in'],
     'value': ['red', 'blue']}]}]}

In [108]:
%%time
model("revenue of largest german companies sorted by number of employees", form)

CPU times: user 3.74 ms, sys: 0 ns, total: 3.74 ms
Wall time: 2.64 s


{'search_for_companies': [{'geography_name': ['germany'],
   'sort_block': [{'attribute': ['number of employees'],
     'direction': ['descending']}],
   'attribute_selection': ['revenue']}]}