# Formatted output

Author: Pavel Agurov, pavel_agurov@epam.com

In this example we will ask model to generate output not just a text, but in XML or JSON format.

The idea of this code is to compare 2 text list and build pairs. To have some explanation we will ask model to provide not only pairs, but score and explanation. It allows us to build not "black box" solution, but have some "inside" from model.

LLM works good with both formats (and also support many other - CSV, YAML). You can make a choise based on your data. For example if you expect long text in the output - XML can be better, because JSON is too fragile and can be easy corrupted during long output.

In [None]:
%pip install openai > /dev/null
%pip install tiktoken > /dev/null
%pip install langchain > /dev/null
%pip install langchain_openai > /dev/null
%pip install langchain_core > /dev/null
%pip install langchain_community > /dev/null
%pip install langchain_text_splitters > /dev/null
%pip install sentence-transformers > /dev/null

## Prompt

First we will build prompt. There are many ways how we can create prompt, but in simplest case it's just string with parameters. We will build 2 versions of prompt - with XML output and JSON output.

In [1]:
from langchain_core.prompts import ChatPromptTemplate


COMPARE_PROMPT_TEMPLATE = """
Your task is to find the best pairs between 2 string lists if possible.
If you can't build pair for the item - just say "no pair".
Be sure that you read all items from first list.
Be sure that you check ALL items from second list and found the best fit.

<first_list>
{first_list}
</first_list>

<second_list>
{second_list}
</second_list>

Pair list in XML format:
<paired_list>
  <pair>
    <first_item>first item</first_item>
    <second_item>relevant item if exist or say 'no pair'</second_item>
    <score>score of relevance</score>
    <explanation>explain your decision</explanation>
  </pair>
</paired_list>
"""

xml_prompt = ChatPromptTemplate.from_template(COMPARE_PROMPT_TEMPLATE)

In [2]:
from langchain_core.prompts import ChatPromptTemplate


COMPARE_PROMPT_TEMPLATE = """
Your task is to find the best pairs between 2 string lists if possible.
If you can't build pair for the item - just say "no pair".
Be sure that you read all items from first list.
Be sure that you check ALL items from second list and found the best fit.

<first_list>
{first_list}
</first_list>

<second_list>
{second_list}
</second_list>

Pair list in JSON format:
[
  "pair": {{
    "first_item": "first item",
    "second_item": "relevant item if exist or say 'no pair'",
    "score": score of relevance,
    "explanation": "explain your decision"  
  }}
]

"""

json_prompt = ChatPromptTemplate.from_template(COMPARE_PROMPT_TEMPLATE)

## LLM model

Model should be powerful enough to be able to provide relevant result, but from another side - has reasonable price to make result profitable.

Remember about temperature parameter - in langchain by default it's not 0.

In [3]:
import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
        api_key         = os.environ['OPENAI_API_KEY'],
        api_version     = "2023-07-01-preview",
        azure_endpoint  = "https://ai-proxy.lab.epam.com",
        model           = "gpt-4o-mini-2024-07-18",
        temperature     = 0.0
    )

## Chains: XML and JSON

You combile prompt, model and output parser into one chain. We use StrOutputParser because we will parse result manually.

In [4]:
from langchain_core.output_parsers import StrOutputParser

xml_chain  = xml_prompt  | llm | StrOutputParser()
json_chain = json_prompt | llm | StrOutputParser()

## Run

- get_openai_callback here allows to have count of used tokens
- function create_list creates list of string
- call_llm will call LLM with invoke method and return result

In [5]:
from langchain_community.callbacks import get_openai_callback

def call_llm(chain, first_list : str, second_list : str) -> tuple[str, int]:
    with get_openai_callback() as cb:
        llm_result = chain.invoke({
                "first_list"   : first_list, 
                "second_list"  : second_list
            })
        return llm_result, cb.total_tokens
    
def create_list(str_array : list[str]) -> str:
    return "".join([f"- {s}\n" for s in str_array])

In [6]:
first_input_list  = create_list(
    ['cat', 'dog', 'apple', 'computer']
)
second_input_list = create_list(
    ['mouse', 'orange', 'shepherd']
)

Let's run XML code

In [7]:
import xml.etree.ElementTree as ET

xml_result, tokens_used = call_llm(xml_chain, first_input_list, second_input_list)
print(f"Used tokens: {tokens_used}")
parsed_xml = ET.ElementTree(ET.fromstring(xml_result))

ET.dump(parsed_xml)

Used tokens: 442


ParseError: not well-formed (invalid token): line 1, column 0 (<string>)

Let's run JSON code. If you have crash - do not be surprised, it can happened after first or maybe after thousandth run.
Below you can find instruction how to fix it.

In [9]:
import json

json_result, tokens_used = call_llm(json_chain, first_input_list, second_input_list)
print(f"Used tokens: {tokens_used}")
parsed_json = json.loads(json_result)
print(json.dumps(parsed_json, indent = 4))

Used tokens: 404
[
    {
        "pair": {
            "first_item": "cat",
            "second_item": "no pair",
            "score": 0,
            "explanation": "There is no relevant item in the second list for 'cat'."
        }
    },
    {
        "pair": {
            "first_item": "dog",
            "second_item": "shepherd",
            "score": 0.5,
            "explanation": "The word 'dog' can be associated with 'shepherd' as it is a type of dog breed."
        }
    },
    {
        "pair": {
            "first_item": "apple",
            "second_item": "no pair",
            "score": 0,
            "explanation": "There is no relevant item in the second list for 'apple'."
        }
    },
    {
        "pair": {
            "first_item": "computer",
            "second_item": "mouse",
            "score": 0.6,
            "explanation": "The word 'computer' can be associated with 'mouse' as it is a peripheral device used with computers."
        }
    }
]


You should take into account that LLM can't garantee always correct format of output. Both XML and JSON can be corrupted.
In some cases you can easy restore it, but in some cases you can't or should have new prompt to ask LLM to fix issues.

Refs: https://api.python.langchain.com/en/latest/output_parsers/langchain.output_parsers.retry.RetryWithErrorOutputParser.html#

Functions below are helpful if you want to fix JSON in simplest way without additional calls to LLM (it takes time and money)

You can also check open-source project https://github.com/josdejong/jsonrepair - it allows to fix most of cases.

In [16]:
import re

def remove_noise_from_json(json_str: str) -> str:
    """
        Remove all text before fist { and after last }
        For examle to remove text noise:
           Your JSON {....}. Welcome!  --> json only
    """
    fist_bracket1_index = json_str.index("{")
    last_bracket1_index = json_str.rindex("}")
    
    fist_bracket2_index = json_str.index("[")
    last_bracket2_index = json_str.rindex("]")

    # we need first bracket if exists    
    if fist_bracket1_index >=0 and fist_bracket2_index >=0:
        fist_bracket_index = min(fist_bracket1_index, fist_bracket2_index)
    else:
        fist_bracket_index = max(fist_bracket1_index, fist_bracket2_index)
        
    # we need last bracket if exists
    last_bracket_index = max(last_bracket1_index, last_bracket2_index)
    
    if fist_bracket_index != -1 and last_bracket_index != -1:
        return json_str[fist_bracket_index:last_bracket_index+1]
    return json_str

    
def fix_non_json_chars(json_str: str) -> str:
    """
        Fix non json chars in json if possible
    """
    json_str = json_str.replace("\\\"", '\'')
    json_str = json_str.replace("\\", '\\\\')
    json_str = json_str.replace('\n', ' ')
    json_str = re.sub(r"},\s*]", "}]", json_str)
    json_str = re.sub(r"}\s*{", "},{", json_str)
    
    return json_str

In our simple case we can have no problem with JSON output, but if we have long text it's possible that quotes will be in output and we can't parse result. In this case we recommend to replace all types of quotes to the single quote BEFORE call LLM.

In [17]:
def clean_up_text(text : str) -> str:
    """Remove dagerous for JSON chars from text"""
    return text.replace("“", "'").replace("“", "”").replace("\"", "'").replace("«", "'").replace("»", "'")

In [21]:
# remove dagerous chars from input
fixed_first_input_list  = clean_up_text(first_input_list)
fixed_second_input_list = clean_up_text(second_input_list)

# run LLM
json_result, tokens_used = call_llm(json_chain, fixed_first_input_list, fixed_second_input_list)
print(f"Used tokens: {tokens_used}")

# let's fix json to avoid problems
json_result = remove_noise_from_json(json_result)
json_result = fix_non_json_chars(json_result)

# now we can parse it
parsed_json = json.loads(json_result)
print(json.dumps(parsed_json, indent = 4))

Used tokens: 389
[
    {
        "pair": {
            "first_item": "cat",
            "second_item": "no pair",
            "score": 0,
            "explanation": "There is no relevant item in the second list for 'cat'."
        }
    },
    {
        "pair": {
            "first_item": "dog",
            "second_item": "shepherd",
            "score": 0.5,
            "explanation": "The word 'dog' is related to 'shepherd' as a breed of dog."
        }
    },
    {
        "pair": {
            "first_item": "apple",
            "second_item": "orange",
            "score": 0.67,
            "explanation": "Both 'apple' and 'orange' are fruits."
        }
    },
    {
        "pair": {
            "first_item": "computer",
            "second_item": "mouse",
            "score": 0.6,
            "explanation": "A 'mouse' is a peripheral device used with a computer."
        }
    }
]
