# Introduction

**Kor** is a thin wrapper on top of LLMs that helps to extract structured data using LLMs. 

To use Kor, specify the schema of what should be extracted and provide some extraction examples.

As you're looking through this tutorial, examine 👀 the outputs carefully to understand what errors are being made.
Extraction isn't perfect, so understand the limitations of this approach before adopting it for your use case.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import pprint

sys.path.insert(0, "../../")

In [2]:
from kor.extraction import Extractor
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI

## LLM

Instantiate a langchain LLM

https://langchain.readthedocs.io/en/latest/modules/llms.html

In [3]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", 
    temperature = 0,
    max_tokens = 2000,
    frequency_penalty = 0,
    presence_penalty = 0,
    top_p = 1.0,
)

----------
Create an extractor instance. The extactor is responsible for generating a prompt, feeding it into the LLM and parsing out the output.

In [4]:
model = Extractor(llm)

## Schema

Kor requires that you specify the `schema` of what you want parsed with some optional examples.

We'll start off by specifying a **very simple** schema:

In [5]:
schema = Text(
    id="first_name",
    description="The first name of a person",
    examples=[("I am billy.", "billy"), ("John Smith is 33 years old", "John")],
)

The schema above consists of a single text node (i.e., a single text input). 

The node will capture mentions of **first name** from a segment of text.

As part of the schema, we specified a `description` of what we're extracting, as well as 2 examples.

Including both a `description` and `examples` will likely improve performance.

## Extract

With a `model` and a `schema` defined, we're ready to extract data.

In [6]:
model("My name is Tom. I am a cat. My best friend is Bobby. He is not a cat.", schema)

{'first_name': ['Tom', 'Bobby']}

In [7]:
model("My name is WOW. My friend's name is Meow.", schema)

{'first_name': ['WOW', 'Meow']}

In [8]:
# Note the extraction here. It's unlikely to be reasonable.
model("My name is My name is My name is MOO MOO.", schema)

{'first_name': ['MOO', 'MOO', 'MOO']}

In [9]:
model(
    (
        "My name is Bobby. My brother's name is the same as mine except that it starts"
        " with a `C`."
    ),
    schema,
)

{'first_name': ['Bobby', 'Cobby']}

In [10]:
model("My name is Bobby. My brother's name rhymes with mine.", schema)

{'first_name': ['Bobby']}

In [11]:
model("My name is Bobby. My brother's name is like mine but different.", schema)

{'first_name': ['Bobby']}

In [12]:
model(
    (
        "My name is Bobby. My brother's name is the same as mine but it does not have"
        " vowels."
    ),
    schema,
)

{'first_name': ['Bobby']}

## The Prompt

And here's the actual prompt that was sent to the LLM.

In [13]:
print(model.prompt_generator.format_as_chat("user input goes here", schema))

[SystemMessage(content="Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n\n```TypeScript\n\n{\n first_name: string[] // The first name of a person\n}\n```\n\n\nFor Union types the output must EXACTLY match one of the members of the Union type.\n\nPlease enclose the extracted information in HTML style tags with the tag name corresponding to the corresponding component ID. Use angle style brackets for the tags ('>' and '<'). Only output tags when you're confident about the information that was extracted from the user's query. If you can extract several pieces of relevant information from the query, then include all of them. If the type is an array, please repeat the corresponding tag name multiple times once for each relevant extraction. Do NOT output anything except for th