### EBA Q&A evaluations

In [1]:
import json
import os
from dataclasses import dataclass
from openai import OpenAI
import yaml
import pandas as pd
import textwrap

client = OpenAI()

### A. Prepare prompts

In [2]:
# Rater prompt
RATER_PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {question}
************
[Context]: {context}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""

In [3]:
# Rater prompt
ANSWER_PROMPT = """\
You are a regulatory expert from a central bank who is reponsible for answering question coming from commercial banks.
You will get a question and context which is required for providing the data.
[BEGIN DATA]
************
[Question]: {question}
************
[Context]: {context}
************
[END DATA]
"""

### B. Helper functions

In [4]:
def numeric_rater(question, context, output, expected):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": RATER_PROMPT.format(question=question, context=context, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"])

In [5]:
def answer_question(question, context):
    o1_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user", 
                "content": ANSWER_PROMPT.format(question=question, context=context),
            }
        ]
    )

    response = wrap_long_lines(o1_response.choices[0].message.content, width=100)
    
    return response

In [6]:
def wrap_long_lines(text, width=100):
    lines = text.split('\n')
    wrapped_lines = []
    for line in lines:
        wrapped_lines.extend(textwrap.wrap(line, width=width))
    return '\n'.join(wrapped_lines)

### C. Answer and rate questions from the EBA Q&A

In [7]:
# Retrieve questions, context and answers from the EBA Q&A
with open("qa.yaml", "r") as f:
    data = yaml.safe_load(f)

In [8]:
results = []
for item in data:
    question = item["question"]
    context = item["context"]
    expected = item.get("expected", "")
    output = answer_question(question, context)
    rating = numeric_rater(question, context, output, expected)
    
    results.append({
        "id": item["id"],
        "question": question,
        "context": context,
        "output": output,
        "expected": expected,
        "rating": rating
    })

df = pd.DataFrame(results)

In [9]:
df

Unnamed: 0,id,question,context,output,expected,rating
0,1,For the treatment of cured defaulted exposures...,Source 1: CRR Article 178 > Default of an obli...,"Yes, a probation period of 90 days with no def...",Minimum conditions for reclassification of a d...,9


### D. Show selected answers

In [10]:
# Show individual answers
id = 0

# Display a question
print("######### 1. Question ######### ")
print(df.at[id, 'question'].replace("\\n", "\n"))

# Display an OpenAI answer
print("######### 2. OpenAI answer ######### ")
print(df.at[id, 'output'].replace("\\n", "\n"))

# Display EBA answer
print("\n######### 3. EBA answer ######### ")
print(df.at[id, 'expected'].replace("\\n", "\n"))

######### 1. Question ######### 
For the treatment of cured defaulted exposures, a probation period of 90 days with no default triggers 
must apply before the exposure is moved back to a non-defaulted status. According to Article 178(1)(b CRR) 
default shall be considered to have occurred with regard to a particular obligor when the obligor is more 
than 90 days past due on any material credit obligation. However, if the material arrears fall below the 
thresholds, the arrears counter will reset to 0. Should the probation period of 90 days with no default 
triggers apply before the exposure is moved back to a non-defaulted status?


######### 2. OpenAI answer ######### 
Yes, a probation period of 90 days with no default triggers should apply before a cured defaulted
exposure is moved back to a non-defaulted status. According to Article 178 of the Capital
Requirements Regulation (CRR), a default is recognized when an obligor is over 90 days past due on
any material credit obligation, am

### E. Additional info on evaluating answers

https://github.com/redhat-et/foundation-models-for-documentation/blob/master/notebooks/llm-evaluation/QA_evaluation_metrics_demo.ipynb

**Human Evaluation**
Human evaluation is a widely recognized approach for assessing the quality of generated answers in comparison to real ones. This paper highlights some current trends and best practice guidelines. Here are some steps to summarize the process,

**Best Practices for Human Evaluation Planning:**

Define the evaluation goal: Clearly articulate the research question and determine if there are specific hypotheses to test. Choose strong and representative baselines for comparison.
Determine the type of evaluation: Decide whether the evaluation will be intrinsic or extrinsic, and consider the real-world or lab setting based on the goals and constraints.
Choose the type of research: Opt for qualitative research to improve the system or quantitative research to assess the system's merit.
Define constructs of interest: Decide whether to ask implementation questions or impact questions. Use separate criteria instead of an overall text quality construct. Provide formal definitions and concrete examples of the criteria in the instructions.
Determine appropriate scales: For quantitative research, consider using multiple-item 7-point Likert scales or a ranking task to measure participant responses.
Determine the sample: Recruit participants that reflect the target audience and provide a detailed description of their demographics. Use large-scale samples for quantitative research and calculate the minimum sample size required. Consider using multiple annotators for coding tasks.
Specify the study's design: Prefer a within-subjects design over a between-subjects design if feasible. Keep the evaluation task simple and motivating, reduce practice and carryover effects, manage fatigue and order effects, and address nonresponse bias.
Select a statistical approach: Use exploratory data analysis techniques for exploratory research, and employ statistical significance testing and report effect sizes when there are clear hypotheses.
Optional: Consider preregistering the task if the evaluation is confirmatory.
These recommendations provide guidance for planning human evaluations and ensuring robust and meaningful results.

**While it offers valuable insights, there are several challenges associated with this method.**

Subjectivity: Human judgments can be subjective, leading to inconsistencies in the evaluation process.
Inter-rater agreement: Ensuring agreement among evaluators becomes crucial to minimize biases and maintain reliability.
Scalability: Evaluating a large number of generated answers manually becomes impractical, requiring sampling techniques or statistical methods.
Expertise and domain knowledge: Evaluators' expertise and knowledge can influence evaluation outcomes, necessitating clear guidelines and appropriate training.
Cost and time: Conducting human evaluations can be costly and time-consuming, requiring resources for recruitment, compensation, and management.
Biases: Evaluators may have personal preferences or biases that can impact the evaluation results.
Automatic metrics, including BLEU scores, ROUGE scores, and others mentioned earlier, have been observed to have limited correlation with human evaluations when it comes to evaluating generated text (reference). Critics argue against relying on automated metrics for assessing linguistic properties and discourage their primary use. However, there are still benefits to utilizing automatic metrics in terms of cost-effectiveness, speed, and repeatability, which make them valuable for tasks like error analysis and system development. Although human evaluation is widely considered the gold standard for assessing overall system quality, conducting it extensively throughout the development process can be expensive and time-consuming.

**Importance about Prompt**

When evaluating the answers generated by a language model, it is crucial to consider the quality of the question or prompt provided to the model. The performance of language models heavily relies on the input they receive, and a well-crafted prompt can significantly influence their output. A good prompt provides clear instructions, includes relevant context, and specifies the desired format or type of response. It helps guide the language model towards generating accurate and coherent answers. Therefore, it is essential to pay attention to both the quality of the generated answers and the quality of the prompts used during evaluation to obtain reliable and meaningful results. By understanding the impact of prompts on language model performance, we can improve the effectiveness of evaluations and enhance the overall performance of language models.