<a href="https://colab.research.google.com/github/aknip/Langchain-etc./blob/main/LLM_JSON_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview:
````
1. Extraction with KOR and Langchain
1.1. Extract and analyze KOR's prompt, feed it directly to LLM
2. Data extraction by parsing (Langchain)
3. Data extraction with Strict-JSON
4. Data extraction by function calling
4.1 Function calling with Langchain
4.2 Function calling with litellm
````

In [None]:
!pip install openai tiktoken litellm langchain kor

In [3]:
import json
import os
from getpass import getpass
import psutil
import requests
import textwrap
from langchain.chat_models import ChatOpenAI
from langchain.globals import set_debug
from langchain.globals import set_verbose
IN_NOTEBOOK = any(["jupyter-notebook" in i for i in psutil.Process().parent().cmdline()])
if IN_NOTEBOOK:
  CREDS = json.loads(getpass("Secrets (JSON string): "))
  os.environ['CREDS'] = json.dumps(CREDS)
  CREDS = json.loads(os.getenv('CREDS'))

Secrets (JSON string): ··········


In [4]:
from litellm import completion
import openai
os.environ["OPENAI_API_KEY"] = CREDS['OpenAI']['v1']['credential']
os.environ["TOGETHERAI_API_KEY"] = CREDS['together-ai']['key']['credential']

# 1. Extraction with KOR and Langchain

- KOR takes an schema (object) as target JSON
- Kor comes with built-in support for creating a schema “object” with fields of different types. Currently, Kor’s native support is limited to Object, Text, Number, Bool, and Selection input types.
- What sets Kor apart when creating schemas for LLMs is its ability to define a field’s purpose and context with textual descriptions and examples

Code:

1. Take input text and schema
2. Convert to prompt including few-shot-examples. Ouput is Excel-CSV !
3. Run prompt
3. Take output of prompt and onvert CSV to JSON

In [4]:
# Source: https://levelup.gitconnected.com/overcoming-challenges-of-llm-based-data-extraction-with-kor-1c0c6d4acd4a

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Selection

text = "Around the world, there are numerous captivating tourist destinations that offer unique attractions for " \
       "visitors. One such place is Paris, France, known as the 'City of Love.' The iconic Eiffel Tower stands tall, " \
       "providing panoramic views of the city, while the Louvre Museum houses world-renowned art masterpieces like " \
       "the Mona Lisa. Moving to the United States, New York City beckons with its dazzling Times Square, the Statue " \
       "of Liberty, and the vibrant Broadway shows. Meanwhile, in Asia, Kyoto, Japan, enchants with its ancient " \
       "temples, tranquil gardens, and traditional geisha culture. The Great Wall of China, a monumental feat of " \
       "engineering, winds its way across the vast Chinese landscape, offering breathtaking views and a glimpse into " \
       "the country's rich history. In South America, Rio de Janeiro, Brazil, captivates with its vibrant Carnival " \
       "celebrations, Copacabana Beach, and the iconic Christ the Redeemer statue atop Corcovado Mountain. Lastly, " \
       "Australia's Great Barrier Reef lures adventurers with its stunning coral reefs and diverse marine life, " \
       "while the Sydney Opera House showcases architectural brilliance. These destinations, among many others, " \
       "embody the beauty, culture, and history that make our world a fascinating place to explore."

schema = Object(
    id="destinations",
    description="Tourist destination information",
    examples=[
        (
            "Ubud is famous for its unique Balinese temples and beautiful rice terraces.",
            {"destination": "Ubud", "attractions": "rice terraces, Balinese temples"}
        ),
        (
            "Tourists flock to Galle, Sri Lanka to enjoy a relaxing beach vacation.",
            {"destination": "Galle", "country": "Sri Lanka", "attractions": "beaches"}
        )
    ],
    attributes=[
        Text(
            id="destination",
            description="The name of the tourist destination",
            examples=[
                ("Thailand's Phuket island is a favorite among tourists", "Phuket")
            ]
        ),
        Text(
            id="country",
            description="The country the tourist destination is located in",
            examples=[
                ("Thailand's Phuket island is a favorite among tourists", "Thailand")
            ]
        ),
        Text(
            id="attractions",
            description="A comma separated list of attractions in the destination",
            examples=[
                ("Phuket is popular for beautiful beaches and vibrant night life", "beautiful beaches, vibrant night life")
            ]
        )
    ],
    many=True
)

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000,
)

set_debug(True)

chain = create_extraction_chain(llm, schema)
response = chain.run(text)

[32;1m[1;3m[chain/start][0m [1m[1:chain:LLMChain] Entering Chain run with input:
[0m{
  "text": "Around the world, there are numerous captivating tourist destinations that offer unique attractions for visitors. One such place is Paris, France, known as the 'City of Love.' The iconic Eiffel Tower stands tall, providing panoramic views of the city, while the Louvre Museum houses world-renowned art masterpieces like the Mona Lisa. Moving to the United States, New York City beckons with its dazzling Times Square, the Statue of Liberty, and the vibrant Broadway shows. Meanwhile, in Asia, Kyoto, Japan, enchants with its ancient temples, tranquil gardens, and traditional geisha culture. The Great Wall of China, a monumental feat of engineering, winds its way across the vast Chinese landscape, offering breathtaking views and a glimpse into the country's rich history. In South America, Rio de Janeiro, Brazil, captivates with its vibrant Carnival celebrations, Copacabana Beach, and the icon

In [5]:
print(json.dumps(response, indent=4))

{
    "data": {
        "destinations": [
            {
                "destination": "Paris",
                "country": "France",
                "attractions": "Eiffel Tower, Louvre Museum"
            },
            {
                "destination": "New York City",
                "country": "United States",
                "attractions": "Times Square, Statue of Liberty, Broadway shows"
            },
            {
                "destination": "Kyoto",
                "country": "Japan",
                "attractions": "ancient temples, tranquil gardens, traditional geisha culture"
            },
            {
                "destination": "Great Wall of China",
                "country": "China",
                "attractions": "monumental feat of engineering, breathtaking views, rich history"
            },
            {
                "destination": "Rio de Janeiro",
                "country": "Brazil",
                "attractions": "Carnival celebrations, Copacabana Beach,

The response object includes
- "raw": The LLM-response as CSV, delimited by |
- "data": The resonse as JSON (converted from the CSV)


## 1.1. Extract and analyze KOR's prompt, feed it directly to LLM

The generated prompt (see above) looks like this:

````
System: Your goal is to extract structured information from the user's input
that matches the form described below. When extracting information please make
sure it matches the type information exactly. Do not add any attributes that do
not appear in the schema shown below.

```TypeScript

destinations: Array<{ // Tourist destination information
 destination: string // The name of the tourist destination
 country: string // The country the tourist destination is located in
 attractions: string // A comma separated list of attractions in the destination
}>
```

Please output the extracted information in CSV format in Excel dialect. Please
use a | as the delimiter.
Do NOT add any clarifying information. Output MUST follow the schema above. Do
NOT add any additional columns that do not appear in the schema.


Human: Ubud is famous for its unique Balinese temples and beautiful rice
terraces.
AI: destination|country|attractions
Ubud||rice terraces, Balinese temples

Human: Tourists flock to Galle, Sri Lanka to enjoy a relaxing beach vacation.
AI: destination|country|attractions
Galle|Sri Lanka|beaches

Human: Thailand's Phuket island is a favorite among tourists
AI: destination|country|attractions
Phuket||

Human: Thailand's Phuket island is a favorite among tourists
AI: destination|country|attractions
|Thailand|

Human: Phuket is popular for beautiful beaches and vibrant night life
AI: destination|country|attractions
||beautiful beaches, vibrant night life

Human: Around the world, there are numerous captivating tourist destinations
that offer unique attractions for visitors. One such place is Paris, France,
known as the 'City of Love.' The iconic Eiffel Tower stands tall, providing
panoramic views of the city, while the Louvre Museum houses world-renowned art
masterpieces like the Mona Lisa. Moving to the United States, New York City
beckons with its dazzling Times Square, the Statue of Liberty, and the vibrant
Broadway shows. Meanwhile, in Asia, Kyoto, Japan, enchants with its ancient
temples, tranquil gardens, and traditional geisha culture. The Great Wall of
China, a monumental feat of engineering, winds its way across the vast Chinese
landscape, offering breathtaking views and a glimpse into the country's rich
history. In South America, Rio de Janeiro, Brazil, captivates with its vibrant
Carnival celebrations, Copacabana Beach, and the iconic Christ the Redeemer
statue atop Corcovado Mountain. Lastly, Australia's Great Barrier Reef lures
adventurers with its stunning coral reefs and diverse marine life, while the
Sydney Opera House showcases architectural brilliance. These destinations, among
many others, embody the beauty, culture, and history that make our world a
fascinating place to explore.
````

Let's go through it step by step:

**First prompt part (static)**

Standard beginning of prompt:
````
System: Your goal is to extract structured information from the user's input
that matches the form described below. When extracting information please make
sure it matches the type information exactly. Do not add any attributes that do
not appear in the schema shown below.**
````

**Second prompt part (dynamic)**

Based on the schema defintion a TypeScript object is added to the prompt:
````
```TypeScript

destinations: Array<{ // Tourist destination information
 destination: string // The name of the tourist destination
 country: string // The country the tourist destination is located in
 attractions: string // A comma separated list of attractions in the destination
}>
````

This is the corresponding schema defintion in the code. The examples are added later (see part four):
````
schema = Object(
    id="destinations",
    description="Tourist destination information",
    attributes=[
        Text(
            id="destination",
            description="The name of the tourist destination",
            examples=[
                ("Thailand's Phuket island is a favorite among tourists", "Phuket")
            ]
        ),
        Text(
            id="country",
            description="The country the tourist destination is located in",
            examples=[
                ("Thailand's Phuket island is a favorite among tourists", "Thailand")
            ]
        ),
        Text(
            id="attractions",
            description="A comma separated list of attractions in the destination",
            examples=[
                ("Phuket is popular for beautiful beaches and vibrant night life", "beautiful beaches, vibrant night life")
            ]
        )
    ]
)
````

**Third prompt part (static)**

Instructions for formatting the output are added to the prompt:
````
Please output the extracted information in CSV format in Excel dialect. Please
use a | as the delimiter.
Do NOT add any clarifying information. Output MUST follow the schema above. Do
NOT add any additional columns that do not appear in the schema.
````

**Fourth prompt part (dynamic)**

Few-shot examples in "Human:" - "AI:" chat pattern:
````
Human: Ubud is famous for its unique Balinese temples and beautiful rice
terraces.
AI: destination|country|attractions
Ubud||rice terraces, Balinese temples

Human: Tourists flock to Galle, Sri Lanka to enjoy a relaxing beach vacation.
AI: destination|country|attractions
Galle|Sri Lanka|beaches

Human: Thailand's Phuket island is a favorite among tourists
AI: destination|country|attractions
Phuket||

Human: Thailand's Phuket island is a favorite among tourists
AI: destination|country|attractions
|Thailand|

Human: Phuket is popular for beautiful beaches and vibrant night life
AI: destination|country|attractions
||beautiful beaches, vibrant night life
````
The examples are compiled from the object and attribute examples of the code:
````
examples=[
  (
      "Ubud is famous for its unique Balinese temples and beautiful rice terraces.",
      {"destination": "Ubud", "attractions": "rice terraces, Balinese temples"}
  ),
  (
      "Tourists flock to Galle, Sri Lanka to enjoy a relaxing beach vacation.",
      {"destination": "Galle", "country": "Sri Lanka", "attractions": "beaches"}
  )
],
...
attributes=[
  Text(
      id="destination",
      description="The name of the tourist destination",
      examples=[
          ("Thailand's Phuket island is a favorite among tourists", "Phuket")
      ]
  )
...
````
**Fifth prompt part (dynmaic)**

The input text
````
Human: Around the world, there are numerous captivating tourist destinations
that offer unique attractions for visitors. One such place is Paris, France,
known as the 'City of Love.' The iconic Eiffel Tower stands tall, providing
panoramic views of the city, while the Louvre Museum houses world-renowned art
masterpieces like the Mona Lisa. Moving to the United States, New York City
beckons with its dazzling Times Square, the Statue of Liberty, and the vibrant
Broadway shows. Meanwhile, in Asia, Kyoto, Japan, enchants with its ancient
temples, tranquil gardens, and traditional geisha culture. The Great Wall of
China, a monumental feat of engineering, winds its way across the vast Chinese
landscape, offering breathtaking views and a glimpse into the country's rich
history. In South America, Rio de Janeiro, Brazil, captivates with its vibrant
Carnival celebrations, Copacabana Beach, and the iconic Christ the Redeemer
statue atop Corcovado Mountain. Lastly, Australia's Great Barrier Reef lures
adventurers with its stunning coral reefs and diverse marine life, while the
Sydney Opera House showcases architectural brilliance. These destinations, among
many others, embody the beauty, culture, and history that make our world a
fascinating place to explore.

````





Let's take the prompt generated by KOR and feed it directly to an LLM via litellm:

In [6]:
kor_prompt_text = "System: Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n\n```TypeScript\n\ndestinations: Array<{ // Tourist destination information\n destination: string // The name of the tourist destination\n country: string // The country the tourist destination is located in\n attractions: string // A comma separated list of attractions in the destination\n}>\n```\n\n\nPlease output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. \n Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.\n\n\nHuman: Ubud is famous for its unique Balinese temples and beautiful rice terraces.\nAI: destination|country|attractions\nUbud||rice terraces, Balinese temples\n\nHuman: Tourists flock to Galle, Sri Lanka to enjoy a relaxing beach vacation.\nAI: destination|country|attractions\nGalle|Sri Lanka|beaches\n\nHuman: Thailand's Phuket island is a favorite among tourists\nAI: destination|country|attractions\nPhuket||\n\nHuman: Thailand's Phuket island is a favorite among tourists\nAI: destination|country|attractions\n|Thailand|\n\nHuman: Phuket is popular for beautiful beaches and vibrant night life\nAI: destination|country|attractions\n||beautiful beaches, vibrant night life\n\nHuman: Around the world, there are numerous captivating tourist destinations that offer unique attractions for visitors. One such place is Paris, France, known as the 'City of Love.' The iconic Eiffel Tower stands tall, providing panoramic views of the city, while the Louvre Museum houses world-renowned art masterpieces like the Mona Lisa. Moving to the United States, New York City beckons with its dazzling Times Square, the Statue of Liberty, and the vibrant Broadway shows. Meanwhile, in Asia, Kyoto, Japan, enchants with its ancient temples, tranquil gardens, and traditional geisha culture. The Great Wall of China, a monumental feat of engineering, winds its way across the vast Chinese landscape, offering breathtaking views and a glimpse into the country's rich history. In South America, Rio de Janeiro, Brazil, captivates with its vibrant Carnival celebrations, Copacabana Beach, and the iconic Christ the Redeemer statue atop Corcovado Mountain. Lastly, Australia's Great Barrier Reef lures adventurers with its stunning coral reefs and diverse marine life, while the Sydney Opera House showcases architectural brilliance. These destinations, among many others, embody the beauty, culture, and history that make our world a fascinating place to explore."
#for x in kor_prompt_text.split('\n'):
#  print(textwrap.fill(x, 80))
response = completion(
  model="gpt-3.5-turbo",
  messages=[{ "content": kor_prompt_text,"role": "user"}]
)
print(response.choices[0].message.content)
#print(response)

AI: destination|country|attractions
Paris|France|Eiffel Tower, Louvre Museum
New York City|United States|Times Square, Statue of Liberty, Broadway shows
Kyoto|Japan|ancient temples, tranquil gardens, traditional geisha culture
Great Wall of China|China|
Rio de Janeiro|Brazil|Carnival celebrations, Copacabana Beach, Christ the Redeemer statue
Great Barrier Reef|Australia|stunning coral reefs, diverse marine life
Sydney Opera House|Australia|architectural brilliance


And convert the response CSV to JSON using pandas:

In [7]:
import pandas as pd
import json
from io import StringIO

response_IO = StringIO(response.choices[0].message.content[4:]) # strip away first 4 chars "AI: "
response_df = pd.read_csv(response_IO, sep="|")
response_JSON = json.loads(response_df.to_json(orient='table',index=False))['data']
print(json.dumps(response_JSON, indent=4))

[
    {
        "destination": "Paris",
        "country": "France",
        "attractions": "Eiffel Tower, Louvre Museum"
    },
    {
        "destination": "New York City",
        "country": "United States",
        "attractions": "Times Square, Statue of Liberty, Broadway shows"
    },
    {
        "destination": "Kyoto",
        "country": "Japan",
        "attractions": "ancient temples, tranquil gardens, traditional geisha culture"
    },
    {
        "destination": "Great Wall of China",
        "country": "China",
        "attractions": null
    },
    {
        "destination": "Rio de Janeiro",
        "country": "Brazil",
        "attractions": "Carnival celebrations, Copacabana Beach, Christ the Redeemer statue"
    },
    {
        "destination": "Great Barrier Reef",
        "country": "Australia",
        "attractions": "stunning coral reefs, diverse marine life"
    },
    {
        "destination": "Sydney Opera House",
        "country": "Australia",
        "attractio

# 2. Data extraction by parsing (Langchain)

see https://python.langchain.com/docs/use_cases/extraction#option-2-parsing

In [57]:
from typing import Optional, Sequence

from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import (
    PromptTemplate,
)
from pydantic import BaseModel, Field, validator

class Person(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]

class People(BaseModel):
    """Identifying information about all people in a text."""
    people: Sequence[Person]

# Run
query = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

set_debug(True)

# Run
_input = prompt.format_prompt(query=query)
model = OpenAI(temperature=0)
output = model(_input.to_string())
output_json = json.loads(output)
#output_obj = parser.parse(output)


[32;1m[1;3m[llm/start][0m [1m[1:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Answer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"type\": \"object\", \"properties\": {\"person_name\": {\"title\": \"Person Name\", \"type\"

In [58]:
print(json.dumps(output_json, indent=2))

{
  "people": [
    {
      "person_name": "Alex",
      "person_height": 5,
      "person_hair_color": "blonde"
    },
    {
      "person_name": "Claudia",
      "person_height": 6,
      "person_hair_color": "brunette"
    }
  ]
}


In [59]:
for x in _input.to_string().split('\n'):
  print(textwrap.fill(x, 80))

Answer the user query.
The output should be formatted as a JSON instance that conforms to the JSON
schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo",
"description": "a list of strings", "type": "array", "items": {"type":
"string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema.
The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Identifying information about all people in a text.",
"properties": {"people": {"title": "People", "type": "array", "items": {"$ref":
"#/definitions/Person"}}}, "required": ["people"], "definitions": {"Person":
{"title": "Person", "type": "object", "properties": {"person_name": {"title":
"Person Name", "type": "string"}, "person_height": {"title": "Person Height",
"type": "integer"}, "person_hair_color": {"title": "Person Hair Color", "type":
"string"}, "dog_breed": {"title": "Dog Breed", "type": "string

## Analyzing the prompt

The prompt is generated based on the two Pydantic classes.

**First prompt part (static)**

````
Answer the user query.
The output should be formatted as a JSON instance that conforms to the JSON
schema below.

As an example, for the schema
{
  "properties": {
    "foo": {
      "title": "Foo",
      "description": "a list of strings",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "foo"
  ]
}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema.
The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
````

**Second prompt part (dynamic)**

````
Here is the output schema:
```
{
  "description": "Identifying information about all people in a text.",
  "properties": {
    "people": {
      "title": "People",
      "type": "array",
      "items": {
        "$ref": "#/definitions/Person"
      }
    }
  },
  "required": [
    "people"
  ],
  "definitions": {
    "Person": {
      "title": "Person",
      "type": "object",
      "properties": {
        "person_name": {
          "title": "Person Name",
          "type": "string"
        },
        "person_height": {
          "title": "Person Height",
          "type": "integer"
        },
        "person_hair_color": {
          "title": "Person Hair Color",
          "type": "string"
        },
        "dog_breed": {
          "title": "Dog Breed",
          "type": "string"
        },
        "dog_name": {
          "title": "Dog Name",
          "type": "string"
        }
      },
      "required": [
        "person_name",
        "person_height",
        "person_hair_color"
      ]
    }
  }
}
```
Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him.
Claudia is a brunette and Alex is blonde.
````

This part of the prompt is derived fromt the two Pydantic classes:

````
class Person(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]

class People(BaseModel):
    """Identifying information about all people in a text."""
    people: Sequence[Person]
````

# 3. Data extraction with Strict-JSON

- supports elements with lots of ' or " or { or } or \ or \n, eg. code
- modified for litellm !


In [5]:
# source: https://github.com/tanchongmin/strictjson
# Notebook demo: https://github.com/tanchongmin/strictjson/blob/main/Strict_Text_(Strict_JSON_v2).ipynb

import os
import openai
import json
import re
from openai import OpenAI

def strict_text(system_prompt, user_prompt, output_format, delimiter = '###', model = 'gpt-3.5-turbo', temperature = 0, num_tries = 3, verbose = False):
    ''' Ensures that OpenAI will always adhere to the desired output json format.
    Uses rule-based iterative feedback to ask GPT to self-correct.
    Keeps trying up to num_tries it it does not. Returns empty json if unable to after num_tries iterations.'''

    # start off with no error message
    error_msg = ''

    for i in range(num_tries):

        # make the output format keys with a unique identifier
        new_output_format = {}
        for key in output_format.keys():
            new_output_format[f'{delimiter}{key}{delimiter}'] = output_format[key]
        output_format_prompt = f'''\nYou are to output the following in json format: {new_output_format}
You must use "{delimiter}{{key}}{delimiter}" to enclose the each {{key}}.'''

        # Use litellm to get a response

        response = completion(
          temperature = temperature,
          model=model,
          messages=[
            {"role": "system", "content": system_prompt + output_format_prompt + error_msg},
            {"role": "user", "content": str(user_prompt)}
          ]
        )


        res = response['choices'][0]['message']['content']

        if verbose:
            print('System prompt:', system_prompt + output_format_prompt + error_msg)
            print('\nUser prompt:', str(user_prompt))
            print('\nGPT response:', res)

        # try-catch block to ensure output format is adhered to
        try:
            # check key appears for each element in the output
            for key in new_output_format.keys():
                # if output field missing, raise an error
                if key not in res: raise Exception(f"{key} not in json output")

            # if all is good, we then extract out the fields
            # Use regular expressions to extract keys and values
            pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

            matches = re.split(pattern, res[1:-1])

            # remove null matches
            my_matches = [match for match in matches if match !='']

            # remove the ' or " from the value matches
            curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

            # create a dictionary
            end_dict = {}
            for i in range(0, len(curated_matches), 2):
                end_dict[curated_matches[i]] = curated_matches[i+1]

            return end_dict

        except Exception as e:
            error_msg = f"\n\nResult: {res}\n\nError message: {str(e)}\nYou must use \"{delimiter}{{key}}{delimiter}\" to enclose the each {{key}}."
            print("An exception occurred:", str(e))
            print("Current invalid json format:", res)

    return {}

In [10]:
res = strict_text(system_prompt = 'You are a data extractor.',
                    user_prompt = 'Extract the data of all persons: Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.',
                    output_format = {"person_name": "Name of person",
                                      "person_height": "height of person",
                                      "person_hair_color": "hair color of person"})

print(json.dumps(res, indent=2))

{
  "person_name": "Claudia",
  "person_height": "6 feet",
  "person_hair_color": "brunette\""
}


In [8]:
res = strict_text(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful day',
                    output_format = {"Sentiment": "Type of Sentiment",
                                    "Tense": "Type of Tense"})
print(json.dumps(res, indent=2))

{
  "Sentiment": "Positive",
  "Tense": "Present"
}


In [9]:
res = strict_text(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Sum all elements in a given array p',
                    output_format = {"Elaboration": "How you would do it",
                                     "C": "Code in C",
                                    "Python": "Code in Python"})

print(json.dumps(res, indent=2))

{
  "Elaboration": "To sum all elements in a given array, you can iterate through each element of the array and keep adding them to a running total.",
  "C": "int sum = 0;\\nfor (int i = 0; i < sizeof(p) / sizeof(p[0]); i++) {\\n    sum += p[i];\\n}",
  "Python": "sum = 0\\nfor num in p:\\n    sum += num\\n"
}


# 4. Data extraction by function calling

## 4.1 Function calling with Langchain

Langchain's "create_extraction_chain" uses Chat GPT's function calling under the hood.

See https://python.langchain.com/docs/use_cases/extraction#looking-under-the-hood

In [37]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

# Schema (this is in fact part of a "functions" object used for the LLM - compart to litellm !)
# Important: Allthough the schema is not explicitely defined as array, it will be converted by Langchain to a function parameter of type 'array'
schema = {
    "properties": {
        "name": {"type": "string", "description": "name of person"},
        "height": {"type": "integer", "description": "the height of the person"},
        "hair_color": {"type": "string", "description": "the color of the hair of the person"},
    },
    "required": ["name", "height"],
}

# Input
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""

set_debug(True)

# Run chain
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)
response = chain.run(inp)


[32;1m[1;3m[chain/start][0m [1m[1:chain:LLMChain] Entering Chain run with input:
[0m{
  "input": "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:LLMChain > 2:llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: Extract and save the relevant entities mentioned in the following passage together with their properties.\n\nOnly extract the properties mentioned in the 'information_extraction' function.\n\nIf a property is not present and is not required in the function parameters, do not include it in the output.\n\nPassage:\nAlex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:chain:LLMChain > 2:llm:ChatOpenAI] [3.11s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "",
        "generation_info": {
     

In [31]:
print(json.dumps(response, indent=4))

[
    {
        "name": "Alex",
        "height": 5,
        "hair_color": "blonde"
    },
    {
        "name": "Claudia",
        "height": 6,
        "hair_color": "brunette"
    }
]


## 4.2 Function calling with litellm

Let's do the same in a native way:

In [38]:
import os, litellm
from litellm import completion

# IMPORTANT - Set this to TRUE to add the function to the prompt for Non OpenAI LLMs
litellm.add_function_to_prompt = True

# "Schema"
# Important: the property 'persons' must defined as array with 'items'. If not defined as array the LLM will only respond with the ONE person.

functions = [
    {
      "name": "information_extraction",
      "description": "Get the description of all persons",
      "parameters": {
         "type": "object",
      "properties": {
        "persons": {
            "type": "array",
            "description": "properties of persons",
            "items": {
              "name": {"type": "string", "description": "name of person"},
              "height": {"type": "integer", "description": "the height of the person"},
              "hair_color": {"type": "string", "description": "the color of the hair of the person"}
            },
        }
      },
    "required": ["persons"]
    }
  }
]

# Input
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""

messages = [
    # Intro text for Prompt from Langchain (may be used to optimize results?):
    # Extract and save the relevant entities mentioned in the following passage together with their properties.\n\n
    # Only extract the properties mentioned in the 'information_extraction' function.
    # If a property is not present and is not required in the function parameters, do not include it in the output.\n\nPassage:\n
    {"role": "user", "content": inp}
]

#response = completion(model="gpt-3.5-turbo-1106", messages=messages, functions=functions)
response = completion(model="gpt-3.5-turbo", messages=messages, functions=functions)

print(response)
print()
function_found = hasattr(response.choices[0]['message'], 'function_call')
if function_found == True:
  function_call = response.choices[0]['message']['function_call']
  function_call_name = function_call.name
  function_call_arguments = function_call.arguments
  print(function_call_arguments)
else:
  print('No function found')

ModelResponse(id='chatcmpl-8br98v1cVG2jyvjqSsqO9xONUBwJg', choices=[Choices(finish_reason='function_call', index=0, message=Message(content='[\n  {\n    "name": "Alex",\n    "height": "5 feet",\n    "hair color": "blonde"\n  },\n  {\n    "name": "Claudia",\n    "height": "6 feet",\n    "hair color": "brunette"\n  }\n]', role='assistant', function_call=FunctionCall(arguments='{\n  "persons": [\n    {\n      "name": "Alex",\n      "height": "5 feet",\n      "hair color": "blonde"\n    },\n    {\n      "name": "Claudia",\n      "height": "6 feet",\n      "hair color": "brunette"\n    }\n  ]\n}', name='information_extraction')))], created=1704033122, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=137, prompt_tokens=80, total_tokens=217), _response_ms=5570.419)

{
  "persons": [
    {
      "name": "Alex",
      "height": "5 feet",
      "hair color": "blonde"
    },
    {
      "name": "Claudia",
      "height": "6 feet",
      

In [None]:
# see https://litellm.vercel.app/docs/completion/function_call

# via Huggingface?
# https://litellm.vercel.app/docs/providers/huggingface
# https://huggingface.co/Trelis/Mixtral-8x7B-Instruct-v0.1-function-calling-v3
# https://huggingface.co/Trelis/Mistral-7B-Instruct-v0.1-function-calling-v2

# via Anyscale?
# https://docs.litellm.ai/docs/providers/anyscale
# https://www.anyscale.com/blog/anyscale-endpoints-json-mode-and-function-calling-features

In [None]:
import os, litellm
from litellm import completion

# IMPORTANT - Set this to TRUE to add the function to the prompt for Non OpenAI LLMs
litellm.add_function_to_prompt = True

# The real function is not needed for the LLM. It may be called after the LLM call (not in this code!)
def get_current_weather(location):
  if location == "Boston, MA":
    return "The weather is 12F"

functions = [
    {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"]
          }
        },
        "required": ["location"]
      }
    }
  ]

messages = [
    {"role": "user", "content": "What is the weather like in Boston?"}
]

response = completion(model="gpt-3.5-turbo-1106", messages=messages, functions=functions)

print(response)
print()
function_found = hasattr(response.choices[0]['message'], 'function_call')
if function_found == True:
  function_call = response.choices[0]['message']['function_call']
  function_call_name = function_call.name
  function_call_arguments = function_call.arguments
  print(function_call_name)
else:
  print('No function found')

ModelResponse(id='chatcmpl-8boD672S6ONQrFt6DEiCw6gCAv3Hs', choices=[Choices(finish_reason='function_call', index=0, message=Message(content=None, role='assistant', function_call=FunctionCall(arguments='{"location":"Boston, MA"}', name='get_current_weather')))], created=1704021836, model='gpt-3.5-turbo-1106', object='chat.completion', system_fingerprint='fp_772e8125bb', usage=Usage(completion_tokens=17, prompt_tokens=82, total_tokens=99), _response_ms=820.1850000000001)

get_current_weather
