### Task example

We need to generate high-qualitative product descriptions with LLMs, following certain rules. Find the best way to do it.


### How to solve

In this type of tasks we can use more advanced metrics, such as GPTScore, Answer Relevancy and Hallucination. For educational purposes it's better to use pure LLM prompts for evaluation instead of pre-built tools.

[Note] As for my personal opinion, as for today, tools like RAGas are not mature, low-customizable and doesn't fit most of the real use cases. I still use custom metrics in my projects. 

In [1]:
# Example
generation_rules = """
You are a specialist in marketing
- follow best practices to generate "selling" product description  
- Tone of voice: informal, emotional, ...other criteria...
- Do not come up any additional facts that are not present in <CONTEXT>
"""

# Context contains your product description in an unstructured and unformatted way
context = """
...some product characteristics...
"""

# Text generated by AI, using "generation_rules" and "context"
generated_text = """
...some text...
"""

# Example of prompt to evaluate generated text (use same generation rules, as on the generation stage
evaluation_prompt = """
<CONTEXT>{{context}}</CONTEXT>

<RULES>{{generation_rules}}</RULES>

<GENERATED TEXT>{{generated_text}}</GENERATED TEXT>

<INSTRUCTION>
Validate <GENERATED TEXT> and check if it follows all <RULES>

In output always use valid JSON snippet

Output template:
```json
{
    "mistakes_and_inconsistencies": "<evaluate any mistakes and inconsistencies of <GENERATED TEXT>, taking into account <RULES> and <CONTEXT>. Per each mistake print the severity, from 0 to 10, where 10 is critical mistake>",
    "civility_and_politeness": "<evaluate is <GENERATED TEXT> has any harmful and rude statements>. Print severity from 0 to 10, where 10 is absolutely harmful text",
    ...other criteria can be added here
    "total_score": "<print only a number, from 0 to 10>"
}
```
"""

from langchain_core.prompts import PromptTemplate
from llm_adaptors.base import BaseLlmAdaptor, BaseModels, BaseJsonParser
prompt = PromptTemplate.from_template(evaluation_prompt, template_format='jinja2').format(
    context=context,
    generation_rules=generation_rules,
    generated_text=generated_text,
)
model = BaseLlmAdaptor(model=BaseModels.GPT.gpt_4_0125).llm()
llm_answer = model.invoke(prompt)
evaluation = BaseJsonParser().parse(llm_answer)
print(evaluation)

{'mistakes_and_inconsistencies': 'No specific mistakes or inconsistencies can be evaluated without the actual <GENERATED TEXT> and <CONTEXT>. Severity evaluation requires specific content to assess.', 'civility_and_politeness': "Without the actual <GENERATED TEXT>, it's impossible to evaluate the presence of harmful or rude statements. Civility and politeness assessment requires specific content to review.", 'total_score': '0'}


### Points to consider

With this approach we can measure different metrics in a one LLM call. For sure, in some cases it may need separate calls per each metrics, but mostly all top LLMs are able to manage it within a one prompt without significant degradation in evals.

Usage of CoT (chain of thoughts): Separate fields for reasoning, such as "mistakes_and_inconsistencies" and "civility_and_politeness", helps LLMs to activate their logical functions in LLMs, to provide more accurate result, making more attentions to details.

Context window: If you context or generated text is big but still fit the context, don't be afraid too much - some models, such as Claude 3x, are very good at finding "needles in a haystack". Another approach would be to separate rules of text by few parts, evaluate them separately, and join the final result (just keep in mind, that sometimes the context is matter, and parts of the text might loose it, so your evaluation might be less accurate).

Self-correction: it's possible to include these metrics as a part of self-correction step, to improve the quality and reduce potential mistakes. 

LLM Model for evals: In some critical cases it makes sense to use another model, so as to reduce potential biases and degradation in the main model.

### Trust to LLM evals

For sure, LLM might hallucinate in both stages - generation and evaluation itself. Besides, there might be mistakes in your own rules and evaluation criteria. So, be careful and check by your self the reasoning and score at least on some generations, before giving the credit to your evals.

### Summary

- This type of metrics can be used not only for plain text, but for other types of LLM responses, such as SQL queries.
- Chain-of-through (Reasoning) in LLM output significantly increase the quality
- LLM can hallucinate in evals as well, so manual checks are required at least at first stage if metrics implementations 
- LLM metrics can be used as part of another step, allowing self-correcting of the output