# Extracting Objects

Kor attempts to make it easy to extract objects from text.

Let's see how to do this!

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.insert(0, "../../")

In [1]:
from kor.extraction import Extractor
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI

In [2]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", 
    temperature = 0,
    max_tokens = 2000,
    frequency_penalty = 0,
    presence_penalty = 0,
    top_p = 1.0,
)
model = Extractor(llm)

## Object Schema

In addition to the `Text` input, Kor also has an `Object` type which allows one to specify how to extract an object from text.

In [3]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
)

In [4]:
print(model.prompt_generator.format_as_string("user input goes here", schema))

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: { // Personal information about a given person.
 first_name: string[] // The first name of the person
 last_name: string[] // The last name of the person
 age: number[] // The age of the person in years.
}
```


For Union types the output must EXACTLY match one of the members of the Union type.

Please enclose the extracted information in HTML style tags with the tag name corresponding to the corresponding component ID. Use angle style brackets for the tags ('>' and '<'). Only output tags when you're confident about the information that was extracted from the user's query. If you can extract several pieces of relevant information from the query, then include all of them. If the type is an array

Please note above that examples were specified on a per attribute level.

When this works it allows one to more easily compose attributes; however, to improve
performance generally examples will need to be provided at the object level (as we'll do below), as it
helps the model determine how to associate attributes together.

In [5]:
model("Eugene was 18 years old a long time ago.", schema)

{'personal_info': [{'first_name': ['Eugene'], 'age': ['18']}]}

And nothing should be extracted from the text below.

In [6]:
model(
    (
        "My phone number is (123)-444-9999. I found my true love one on a blue sunday."
        " Her number was (333)1232832"
    ),
    schema,
)

{}

### Hallucinations

Without sufficient examples, the model may incorrectly interpret the task.

In the example below the model is extracting phone number attribute, even though we're not trying to extract it.

In [7]:
model(
    (
        "My name is Bob and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832"
    ),
    schema,
)

{'personal_info': [{'first_name': ['Bob']},
  {'phone_number': ['(123)-444-9999']},
  {'phone_number': ['(333)1232832']}]}

### Improving with Examples

Adding more examples, especially at the object level can help improve performance.

In [8]:
schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            # examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            # examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            # examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old",
            {"first_name": "John", "last_name": "Smith", "age": "23"},
        )
    ],
)

In [9]:
model(
    (
        "My name is Bob and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832"
    ),
    schema,
)

{'personal_info': [{'first_name': ['Bob']}]}

### What's the actual prompt?

In [10]:
print(model.prompt_generator.format_as_string("user input goes here", schema))

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: { // Personal information about a given person.
 first_name: string[] // The first name of the person
 last_name: string[] // The last name of the person
 age: number[] // The age of the person in years.
}
```


For Union types the output must EXACTLY match one of the members of the Union type.

Please enclose the extracted information in HTML style tags with the tag name corresponding to the corresponding component ID. Use angle style brackets for the tags ('>' and '<'). Only output tags when you're confident about the information that was extracted from the user's query. If you can extract several pieces of relevant information from the query, then include all of them. If the type is an array

### Extractions are Grouped 😺 

Let's try the schema above with a longer piece of text.

The extraction won't be perfect and some errors are made;
for example, Alice's age is in days not years. 
We're missing information about `Chris`.

But object level attributes are properly grouped togehter!

In [11]:
user_input = (
    "Today Alice MacDonald is turning sixty days old. She had blue eyes. "
    "Bob is turning 10 years old. His eyes were bright red. Chris Prass used his "
    "green eyes to look at Dorothy to find 15 year old eyes staring back at him. "
    "Prass was a piano teacher. Dorothy was a certified mechanic. "
    "All certified mechanics have yellow eyes."
)

In [12]:
model(user_input, schema)

{'personal_info': [{'age': ['60'],
   'first_name': ['Alice'],
   'last_name': ['MacDonald']},
  {'age': ['10'], 'first_name': ['Bob'], 'last_name': [{}]},
  {'age': ['15'], 'first_name': ['Dorothy'], 'last_name': [{}]}]}

## Nested Objects 😹

Let's try a more complex schema. And now break the age into 2 attributes; a number and a unit.

In [13]:
schema = Object(
    id="personalinfo",
    description="Collect information about a person.",
    attributes=[
        Text(
            id="profession",
            description="The person's profession?",
            examples=[
                ("He was a professor", "professor"),
                ("Bob was a lawyer and a politician", ["lawyer", "politician"]),
            ],
        ),
        Text(
            id="first_name",
            description="The person's first name",
            examples=[("Billy was here", "Billy"), ("Bob was very tall", "Bob")],
        ),
        Text(
            id="last_name",
            description="The person's last name",
            examples=[("Joe Donatello was very tall", "Donatello")],
        ),
        Text(
            id="eye_color",
            description="The person's eye color?",
            examples=[("my eyes are green", "green")],
        ),
        Object(
            id="age",
            attributes=[
                Number(
                    id="number",
                    description="what is the person's age?",
                    examples=[("10 years old", 10)],
                ),
                Text(
                    id="unit",
                    description="In which units is the age reported in?",
                    examples=[("10 years old", "years"), ("22 days", "days")],
                ),
            ],
        ),
    ],
    examples=[],
)

In [14]:
user_input = (
    "Today Alice MacDonald is turning sixty days old. She had blue eyes. "
    "Bob is turning 10 years old. His eyes were bright red. Chris Prass used his "
    "green eyes to look at Dorothy to find 15 year old eyes staring back at him. "
    "Prass was a piano teacher. Dorothy was a certified mechanic. "
    "All certified mechanics have yellow eyes."
)

In [15]:
model(user_input, schema)

{'personalinfo': [{'first_name': ['Alice'],
   'last_name': ['MacDonald'],
   'age': [{'number': ['60'], 'unit': ['days']}],
   'eye_color': ['blue']},
  {'first_name': ['Bob'],
   'age': [{'number': ['10'], 'unit': ['years']}],
   'eye_color': ['bright red']},
  {'first_name': ['Chris'],
   'last_name': ['Prass'],
   'profession': ['piano teacher'],
   'eye_color': ['green']},
  {'first_name': ['Dorothy'],
   'profession': ['certified mechanic'],
   'eye_color': ['yellow']}]}

## Grouping Ambiguity

*Kor* interprets every type as a list of that type. (At least at the moment.)

As a result, grouping items correctly can be ambiguous without sufficient examples.

Below is an example where we're extracting where from and where to people are moving.

In [16]:
FROM_ADDRESS = Object(
    id="from_address",
    description="Person moved away from this address",
    attributes=[
        Text(id="street"),
        Text(id="city"),
        Text(id="state"),
        Text(id="zipcode"),
        Text(id="country", description="A country in the world; e.g., France."),
    ],
    examples=[
        (
            "100 Main St, Boston,MA, 23232, USA",
            {
                "street": "100 Marlo St",
                "city": "Boston",
                "state": "MA",
                "zipcode": "23232",
                "country": "USA",
            },
        )
    ],
)

TO_ADDRESS = FROM_ADDRESS.replace(
    id="to_address", description="Address to which the person is moving"
)

schema = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        FROM_ADDRESS,
        TO_ADDRESS,
    ],
)

In [17]:
model("Alice Doe and Bob Smith moved from New York to Boston", schema)

{'information': [{'person_name': ['Alice Doe', 'Bob Smith'],
   'from_address': [{'city': ['New York']}],
   'to_address': [{'city': ['Boston']}]}]}

**Attention** We extracted correct information above, but this is probably not how we wanted the information to be grouped!

### Disambiguating groups

At the moment, one should specify object level examples, to help the model determine how to group things correctly.

In [18]:
form = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        FROM_ADDRESS,
        TO_ADDRESS,
    ],
    examples=[
        (
            "John Smith moved to Boston from New York. Billy moved to LA.",
            [
                {
                    "person_name": "John Smith",
                    "from_address": {"city": "New York"},
                    "to_address": {"city": "Boston"},
                },
                {"person_name": "Billy", "to_address": {"city": "LA"}},
            ],
        )
    ],
)

In [19]:
model("Alice Doe and Bob Smith moved from New York to Boston", schema)

{'information': [{'person_name': ['Alice Doe', 'Bob Smith'],
   'from_address': [{'city': ['New York']}],
   'to_address': [{'city': ['Boston']}]}]}

In [20]:
model(
    (
        "Alice Doe and Bob Smith moved from New York to Boston. Andrew was 12 years"
        " old. He also moved to Boston. So did Joana and Paul. Betty did the opposite."
    ),
    form,
)

{'information': [{'from_address': [{'city': ['New York']}],
   'person_name': ['Alice Doe'],
   'to_address': [{'city': ['Boston']}]},
  {'from_address': [{'city': ['New York']}],
   'person_name': ['Bob Smith'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Andrew']},
  {'person_name': ['Joana'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Paul'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Betty'], 'to_address': [{'city': ['New York']}]}]}

## Untyped Obects

It's possible to provide just examples without type information.

It's unclear 🤷🤷🤷 if the quality of the results is affected significantly, especially if one adds additional examples to compensate.

In [21]:
form = Object(
    id="information",
    attributes=[],
    examples=[
        (
            "John Smith moved to Boston from New York. Billy moved to LA.",
            [
                {
                    "person_name": "John Smith",
                    "from_address": {"city": "New York"},
                    "to_address": {"city": "Boston"},
                },
                {"person_name": "Billy", "to_address": {"city": "LA"}},
            ],
        )
    ],
)

In [22]:
model(
    (
        "Alice Doe and Bob Smith moved from New York to Boston. Andrew was 12 years"
        " old. He also moved to Boston. So did Joana and Paul. Betty did the opposite."
    ),
    form,
)

{'information': [{'from_address': [{'city': ['New York']}],
   'person_name': ['Alice Doe'],
   'to_address': [{'city': ['Boston']}]},
  {'from_address': [{'city': ['New York']}],
   'person_name': ['Bob Smith'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Andrew'],
   'age': ['12'],
   'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Joana'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Paul'], 'to_address': [{'city': ['Boston']}]},
  {'person_name': ['Betty'], 'to_address': [{'city': ['New York']}]}]}

## Flattened Objects

This section is not ready yet.

In [23]:
form = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The first name of the person or partial name",
            examples=[("John Smith was here", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person or partial name",
            examples=[("John Smith was here", "Smith")],
        ),
    ],
    examples=[],
    group_as_object=False,  # <-- Please note
)