# Kor

This notebooks shows a few extraction examples. 

Please pay attention to errors that are made to better understand limitations. 
This may not be the best approach for information extraction using an LLM.

## Temporary hack to add kor to PYTHONPATH

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pprint

In [3]:
sys.path.insert(0, '../../')

In [4]:
from kor.elements import Form, TextInput, NumericRange, Number, Selection, Option, ObjectInput
from kor.extraction import extract, chat_extract
from kor.llm_utils import LLM, ChatLLM, ChatLLMWithChatInvoke

In [5]:
llm = ChatLLMWithChatInvoke(verbose=False)

## Collect personal information

In [11]:
form = Form(
    id='personal-info', 
    description='Collect information about a person.',
    elements=[
        TextInput(
            id='profession',
            description='what is the person\'s profession?',
            examples=[('He was a piano teacher', 'piano teacher'), ('Bob was a lawyer and a politician', ['lawyer', 'politician'])]
        ),
        TextInput(
            id='first_name',
            description='what is the person\'s first name',
            examples=[('Billy was here', 'Billy'), ('Bob Donatello was very tall', 'Bob')]
        ),
        TextInput(
            id='last_name',
            description='what is the person\'s last name',
            examples=[('Billy was here', ''), ('Bob Donatello was very tall', 'Donatello')]
        ),
        Number(
            id='age',
            description='what is the person\'s age',
            examples=[('26 years old', '26 years'), ('6 puppies', '') ],
        ),
        Selection(
            id='eye-color', 
            description='What is the person\'s eye color?',
            options=[
                Option(id='green', description='Green Eyes',  examples=['my eyes are green']),
                Option(id='blue',description='Blue Eyes',  examples=['blue eyes']),
                Option(id='brown',description='Brown Eyes',  examples=['brown eye color']),
            ],
            null_examples=['violet eyes']
        )
    ]
)

In [7]:
%%time
pprint.pprint(chat_extract('Today Alice MacDonald is turning sixty days old.', form, llm), indent=2)

{ 'personal-info': [ { 'age': ['60 days'],
                       'first_name': ['Alice'],
                       'last_name': ['MacDonald']}]}
CPU times: user 61.4 ms, sys: 3.62 ms, total: 65 ms
Wall time: 1.69 s


**ATTENTION** At the moment, parsed information will be collected independently into lists if using a flat form. (Might change the API in a bit since it feels like this will be a source of confusion.)

In [8]:
%%time
pprint.pprint(
    chat_extract('Today Alice MacDonald is turning sixty days old. She had blue eyes. '
            'Bob is turning 10 years old. His eyes were bright red.', form, llm), 
indent=2)

{ 'personal-info': [ { 'age': ['60 days'],
                       'eye-color': ['blue'],
                       'first_name': ['Alice'],
                       'last_name': ['MacDonald']},
                     {'age': ['10 years'], 'first_name': ['Bob']}]}
CPU times: user 5.42 ms, sys: 143 µs, total: 5.57 ms
Wall time: 1.82 s


## Using an Object input

In [9]:
form = Form(
    id='personal', 
    description='Collect information about a person.',
    elements=[
        ObjectInput(
            id='info', 
            description='Personal information about a person like name, age, hobbies, date of birth, height etc.', 
            examples=[(
                'Billy Apple was born on 2020-01-01', {
                    'first_name': 'Billy',
                    'last_name': 'Apple', 
                    'born_on': '2020-01-01'
                },
                ), 
                (
                'Frank was born on 2020-01-01 and is 2 years old today', {
                    'first_name': 'Frank', 
                    'born_on': '2020-01-01',
                    'age': '2 years old',
                }
                )
            ]
        )
    ]
)

If outputs fail for other sentences, try to add some more examples.

In [14]:
%%time
pprint.pprint(
    chat_extract('Today Alice MacDonald is turning sixty days old. She had blue eyes. '
            'Bob is turning 10 years old. His eyes were bright red. Chris Prass used his '
            'green eyes to look at Dorothy to find 15 year old eyes staring back at him. '
            'Prass was a piano teacher. Dorothy was a certified mechanic and a chef. Dorothy was not even 2 years old. ' 
            'All certified mechanics have yellow eyes.', form, llm), 
indent=2)

{ 'personal-info': [ { 'age': ['60 days'],
                       'eye-color': ['blue'],
                       'first_name': ['Alice'],
                       'last_name': ['MacDonald']},
                     { 'age': ['10 years'],
                       'eye-color': ['bright red'],
                       'first_name': ['Bob']},
                     { 'eye-color': ['green'],
                       'first_name': ['Chris'],
                       'last_name': ['Prass'],
                       'profession': ['piano teacher']},
                     { 'age': ['less than 2 years'],
                       'eye-color': ['yellow'],
                       'first_name': ['Dorothy'],
                       'profession': ['certified mechanic', 'chef']}]}
CPU times: user 6.61 ms, sys: 202 µs, total: 6.81 ms
Wall time: 5.54 s


In [10]:
%%time
pprint.pprint(
    chat_extract('Today Alice MacDonald is turning sixty days old. She had blue eyes. '
            'Bob is turning 10 years old. His eyes were bright red. Chris Prass used his '
            'green eyes to look at Dorothy to find 15 year old eyes staring back at him. '
            'Prass was a piano teacher. Dorothy was a certified mechanic. ' 
            'All certified mechanics have yellow eyes.', form, llm), 
indent=2)

{ 'personal': [ { 'info': [ { 'age': ['60 days old'],
                              'eye_color': ['blue'],
                              'first_name': ['Alice'],
                              'last_name': ['MacDonald']}]},
                { 'info': [ { 'age': ['10 years old'],
                              'eye_color': ['bright red'],
                              'first_name': ['Bob']}]},
                { 'info': [ { 'age': ['15 years old'],
                              'eye_color': ['yellow'],
                              'first_name': ['Dorothy']}]},
                { 'info': [ { 'eye_color': ['green'],
                              'first_name': ['Chris'],
                              'last_name': ['Prass'],
                              'profession': ['piano teacher']}]}]}
CPU times: user 7.33 ms, sys: 184 µs, total: 7.51 ms
Wall time: 4.82 s


## Buying sports tickets

In [10]:
form = Form(
    id='action', 
    description='User is looking for sports tickets',
    elements=[
        TextInput(
            id='sport',
            description='which sports do you want to buy tickets for?',
            examples=[('I want to buy tickets to basketball and football games', ['basketball', 'footbal'])]
        ),
        TextInput(
            id='location',
            description='where would you like to watch the game?',
            examples=[('in boston', 'boston'), ('in france or italy', ['france', 'italy'])]
        ),
            
        ObjectInput(
            id='price-range',
            description='how much do you want to spend?',
            examples=[('no more than $100', {'price-max': '100', 'currency': "$"}), 
                      ('between 50 and 100 dollars', {'price-max': '100', 'price-min': '50', 'currency': "$"})]
        ),
    ]

           
)

In [11]:
%%time
pprint.pprint(extract('I want to buy tickets for a baseball game in LA area under $100', form, llm), indent=2)

{ 'location': ['LA area'],
  'price-range': [{'currency': ['$'], 'price-max': ['100']}],
  'sport': ['baseball']}
CPU times: user 4.1 ms, sys: 0 ns, total: 4.1 ms
Wall time: 2.13 s


## More complex sentence

Use an LLM to parse a sentence and later convert it into a database query.

In [12]:
from kor import elements

In [13]:
company_name = elements.TextInput(
    id="company-name",
    description="what is the name of the company you want to find",
    examples=[
        ("Apple inc", "Apple inc"),
        ("largest 10 banks in the world", ""),
        ("microsoft and apple", "microsoft,apple"),
    ],
)

industry_name = elements.TextInput(
    id="industry-name",
    description="what is the name of the company's industry",
    examples=[
        ("companies in the steel manufacturing industry", "steel manufacturing"),
        ("large banks", "banking"),
        ("military companies", "defense"),
        ("chinese companies", ""),
        ("companies that cell cigars", "cigars"),
    ],
)

geography_name = elements.TextInput(
    id="geography-name",
    description="where is the company based?",
    examples=[
        ("chinese companies", "china"),
        ("companies based in france", "france"),
        ("LaMaple was based in france, italy", ['france', 'italy']),
        ("italy", ""),
    ],
)

foundation_date = elements.DateInput(
    id="foundation-date",
    description="Foundation date of the company",
    examples=[("companies founded in 2023", "2023")],
)

attribute_filter = elements.ObjectInput(
    id="attribute-filter",
    description=(
        "Filter by a value of an attribute using a binary expression. Specify the attribute's name, "
        "an operator (>, <, =, !=, >=, <=, in, not in) and a value."
    ),
    examples=[
        (
            "Companies with revenue > 100",
            {
                "attribute": "revenue",
                "op": ">",
                "value": "100",
            },
        ),
        (
            "number of employees between 50 and 1000",
            {"attribute": "employees", "op": "in", "value": ["50", "1000"]},
        ),
        (
            "blue or green color",
            {
                "attribute": "color",
                "op": "in",
                "value": ["blue", "green"],
            },
        ),
        (
            "companies that do not sell in california",
            {
                "attribute": "geography-sales",
                "op": "not in",
                "value": "california",
            },
        ),
    ],
)

sales_geography = elements.TextInput(
    id="geography-sales",
    description=(
        "where is the company doing sales? Please use a single country name."
    ),
    examples=[
        ("companies with sales in france", "france"),
        ("companies that sell their products in germany", "germany"),
        ("france, italy", ""),
    ],
)

attribute_selection_block = elements.TextInput(
    id="attribute_selection",
    description="Asking to see the value of one or more attributes",
    examples=[
        ("What is the revenue of tech companies?", "revenue"),
        ("market cap of apple?", "market cap"),
        ("number of employees of largest company", "number of employees"),
        ("what are the revenue and market cap of apple", ['revenue', 'market cap']),
        ("share price and number of shares of indian companies", ["share price", "number of shares"]),
    ],
)

sort_by_attribute_block = elements.ObjectInput(
    id="sort-block",
    description=(
        "Use to request to sort the results by a particular attribute. "
        "Can specify the direction"
    ),
    examples=[
        (
            "Largest by market-cap tech companies",
            {"direction": "descending", "attribute": "market-cap"},
        ),
        (
            "sort by companies with smallest revenue ",
            {"direction": "ascending", "attribute": "revenue"},
        ),
    ],
)

form = elements.Form(
    id="search-for-companies",
    description="Search for companies matching the following criteria.",
    elements=[
        company_name,
        geography_name,
        foundation_date,
        industry_name,
        sales_geography,
        attribute_filter,
        attribute_selection_block,
        sort_by_attribute_block,
    ],
)


In [14]:
%%time
pprint.pprint(
    extract('Today Alice MacDonald is turning sixty days old. She had blue eyes. '
            'Bob is turning 10 years old. His eyes were bright red.', form, llm), 
indent=2)

{}
CPU times: user 3.5 ms, sys: 0 ns, total: 3.5 ms
Wall time: 790 ms


In [15]:
%%time
pprint.pprint(
    extract('revenue, eps of indian companies that have market cap of over 1 million, '
            'but less than 50 employees and own red and blue buildings', form, llm
           ), indent=2)

{ 'attribute-filter': [ { 'attribute': ['market cap'],
                          'op': ['>'],
                          'value': ['1 million']},
                        { 'attribute': ['employees'],
                          'op': ['<'],
                          'value': ['50']}],
  'attribute_selection': ['revenue', 'eps']}
CPU times: user 2.65 ms, sys: 4.03 ms, total: 6.68 ms
Wall time: 4.12 s


In [16]:
%%time
pprint.pprint(extract('revenue, eps of indian companies that have market cap of over 1 million, and and between 20-50 employees', form, llm), indent=2)

{ 'attribute-filter': [ { 'attribute': ['market cap'],
                          'op': ['>'],
                          'value': ['1 million']},
                        { 'attribute': ['employees'],
                          'op': ['in'],
                          'value': ['20', '50']}],
  'attribute_selection': ['revenue', 'eps'],
  'geography-name': ['india']}
CPU times: user 5.81 ms, sys: 137 µs, total: 5.94 ms
Wall time: 4.37 s


In [17]:
%%time
pprint.pprint(extract('companies that own red and blue buildings', form, llm), indent=2)

{ 'attribute-filter': [ { 'attribute': ['building color'],
                          'op': ['in'],
                          'value': ['red', 'blue']}]}
CPU times: user 5.6 ms, sys: 0 ns, total: 5.6 ms
Wall time: 1.74 s


In [18]:
b%%time
pprint.pprint(extract('revenue of largest german companies sorted by number of employees', form, llm), indent=2)

{ 'attribute_selection': ['revenue'],
  'geography-sales': ['germany'],
  'sort-block': [ { 'attribute': ['number of employees'],
                    'direction': ['descending']}]}
CPU times: user 17.1 ms, sys: 0 ns, total: 17.1 ms
Wall time: 2.66 s
