feat: add context precision for context quality eval algo #289

oyangz · 2024-06-11T15:48:42Z

Issue #, if available:

Description of changes:
This PR adds the context precision metric under the context quality evaluation algorithm. This is the first of three evaluation metrics under context quality.

Context Precision is a metric that evaluates whether all of the target output relevant items present in the retrieved contexts are ranked higher or not. Ideally all the relevant context chunks must appear at the top ranks. This metric is computed using the model_input, target_output and the retrieved_contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

Notes:

There are no built in datasets for context quality since the dataset context needs to be retrieved by the RAG system.
The default judge model is currently set to a sample Bedrock model because judge model selection is in progress.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

review-notebook-app · 2024-06-11T15:48:47Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

src/fmeval/eval_algorithms/common.py

src/fmeval/eval_algorithms/context_quality.py

test/unit/eval_algorithms/test_context_quality.py

danielezhu

Aside from the validate_columns flag, everything else lgtm

src/fmeval/eval_algorithms/common.py

awsvmaringa · 2024-06-14T23:22:26Z

src/fmeval/eval_algorithms/context_quality.py

+
+CONTEXT_PRECISION_SCORE = "context_precision_score"
+
+DEFAULT_CONTEXT_PRECISION_PROMPT_TEMPLATE = (


Does this need few shot examples to demonstrate output structure?

The output structure is not very strict since we only need to parse the verdict of "1" or "0" from the model output. Would you suggest we add some examples in the prompt template?

In general yes, however if we offload this responsibility to the user we won't have to worry about it. Same as here.

awsvmaringa · 2024-06-14T23:27:34Z

src/fmeval/eval_algorithms/context_quality.py

+    "arriving at the given answer. Give verdict as 1 if useful and 0 if "
+    "not. question: $model_input, answer: $target_output, "
+    "context: $retrieved_context."
+    "The verdict should only contain an integer, either 1 or 0, do not give an explanation."


Any reason why the model is not supposed to give an explanation for it's verdict? In practice, forcing the model to generate reasoning steps generally leads to a more accurate answer (for example see chain-of-thought prompting).

This was considering we only parse the verdict of "1" or "0" from the model response, so we don't need the explanation. I can update this prompt to ask the model to generate an explanation if it helps with accuracy.

If we rely on the user to provide their own prompt this shouldn't be an issue. However, if we provide a default judge model and a default prompt, some experimental verification might be needed if forcing an explanation leads to better verdicts at the cost of increased output length in this case.

src/fmeval/eval_algorithms/context_quality.py

oyangz · 2024-06-20T20:03:53Z

Will address default prompt template related changes separately.

xiaoyi-cheng requested review from danielezhu and xiaoyi-cheng June 11, 2024 16:06

feat: add context precision for context quality eval algo

e8dc75d

oyangz force-pushed the context_quality branch from 1445167 to e8dc75d Compare June 11, 2024 17:03