# Pydantic model for output format

Author: Pavel Agurov, pavel_agurov@epam.com

Pydantic allows to provide output instructions based on class model(s). To make our example fully "class-based" we will also create class for input data. But please note - you still should add {format_instructions} placeholder into prompt.

The idea of this code is to compare 2 text list and build pairs. To have some explanation we will ask model to provide not only pairs, but score and explanation. It allows us to build not "black box" solution, but have some "inside" from model.

In [None]:
%pip install openai > /dev/null
%pip install tiktoken > /dev/null
%pip install langchain > /dev/null
%pip install langchain_openai > /dev/null
%pip install langchain_core > /dev/null
%pip install langchain_community > /dev/null
%pip install langchain_text_splitters > /dev/null
%pip install sentence-transformers > /dev/null

## Prompt with Pydantic model for output

In this example we will build custom Pydantic model for output data and use it to provide instruction to the model and later to parse result.

In [8]:
from dataclasses import dataclass
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate


COMPARE_PROMPT_TEMPLATE = """
Your task is to find the best pairs between 2 string lists if possible.
If you can't build pair for the item - just say "no pair".
Be sure that you read all items from first list.
Be sure that you check ALL items from second list and found the best fit.

<first_list>
{first_list}
</first_list>

<second_list>
{second_list}
</second_list>

{format_instructions}
"""

@dataclass
class InputStringList:
    items : list[str]
    
    def __repr__(self):
        return "".join([f"- {s}\n" for s in self.items])
        

class PairItem(BaseModel):
    """Pair item"""
    first_item  : str   = Field(description="Item from the first list")
    second_item : str   = Field(description="Relevant item (if exists)")
    score       : float = Field(description="Score of relevance")
    explanation : str   = Field(description="Explain your decision")

class PairedList(BaseModel):
    """Pair list"""
    pair_list : list[PairItem]
        
parser = PydanticOutputParser(pydantic_object= PairedList)

prompt = PromptTemplate(
    template= COMPARE_PROMPT_TEMPLATE,
    input_variables=["first_list", "second_list"],
    partial_variables={
        "format_instructions": parser.get_format_instructions()
    }
)

## LLM model

Model should be powerful enough to be able to provide relevant result, but from another side - has reasonable price to make result profitable.

Remember about temperature parameter - in langchain by default it's not 0.

In [2]:
import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
        api_key         = os.environ['OPENAI_API_KEY'],
        api_version     = "2023-07-01-preview",
        azure_endpoint  = "https://ai-proxy.lab.epam.com",
        model           = "gpt-4o-mini-2024-07-18",
        temperature     = 0.0
    )

## Chain

You combile prompt, model and output parser into one chain.

In [3]:
chain = prompt | llm | parser

## Run

- get_openai_callback here allows to have count of used tokens
- function create_list creates list of string
- call_llm will call LLM with invoke method and return result

In [4]:
from langchain_community.callbacks import get_openai_callback

def call_llm(chain, first_list, second_list):
    with get_openai_callback() as cb:
        llm_result = chain.invoke({
                "first_list"   : first_list, 
                "second_list"  : second_list
        })
        return llm_result, cb.total_tokens

In [5]:
first_input_list  = InputStringList(
    ['cat', 'dog', 'apple', 'computer']
)
second_input_list = InputStringList(
    ['mouse', 'orange', 'shepherd']
)

In [6]:
object_result, tokens_used = call_llm(chain, first_input_list, second_input_list)
print(f"Used tokens: {tokens_used}")

# object_result is already PairedList
for r in object_result.pair_list:
    print(f"{r.first_item=}, {r.second_item=}, {r.score=}, {r.explanation=}")

Used tokens: 602
r.first_item='cat', r.second_item='mouse', r.score=0.5, r.explanation='Both are small animals'
r.first_item='dog', r.second_item='shepherd', r.score=0.7, r.explanation='Shepherd is a breed of dog'
r.first_item='apple', r.second_item='orange', r.score=0.6, r.explanation='Both are fruits'
r.first_item='computer', r.second_item='no pair', r.score=0.0, r.explanation='No relevant item found in the second list'


## Notes

Please note - Pydantic model can't garantee you alwas correct output format. You will still have parsing exceptions in case of LLM hallucination. As you can see below it's still JSON format inside, just with additonal instructions. If JSON is wrong, you will have expection and error.

In [7]:
formatted_prompt = prompt.format(
    first_list   = first_input_list, 
    second_list  = second_input_list
)
print(formatted_prompt)


Your task is to find the best pairs between 2 string lists if possible.
If you can't build pair for the item - just say "no pair".
Be sure that you read all items from first list.
Be sure that you check ALL items from second list and found the best fit.

<first_list>
- cat
- dog
- apple
- computer

</first_list>

<second_list>
- mouse
- orange
- shepherd

</second_list>

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Pair list", "properties": {"pair_list": {"title": "Pair List", "type": "array", "items": {"$ref": "#/definitions/PairItem"}}}, "required": ["pair_list"], "de