Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add context precision for context quality eval algo #289

Merged
merged 4 commits into from
Jun 20, 2024

Conversation

oyangz
Copy link
Contributor

@oyangz oyangz commented Jun 11, 2024

Issue #, if available:

Description of changes:
This PR adds the context precision metric under the context quality evaluation algorithm. This is the first of three evaluation metrics under context quality.

Context Precision is a metric that evaluates whether all of the target output relevant items present in the retrieved contexts are ranked higher or not. Ideally all the relevant context chunks must appear at the top ranks. This metric is computed using the model_input, target_output and the retrieved_contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

Notes:

  • There are no built in datasets for context quality since the dataset context needs to be retrieved by the RAG system.
  • The default judge model is currently set to a sample Bedrock model because judge model selection is in progress.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@danielezhu danielezhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the validate_columns flag, everything else lgtm

src/fmeval/eval_algorithms/common.py Outdated Show resolved Hide resolved
@oyangz oyangz removed the request for review from polaschwoebel June 12, 2024 22:59

CONTEXT_PRECISION_SCORE = "context_precision_score"

DEFAULT_CONTEXT_PRECISION_PROMPT_TEMPLATE = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need few shot examples to demonstrate output structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output structure is not very strict since we only need to parse the verdict of "1" or "0" from the model output. Would you suggest we add some examples in the prompt template?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes, however if we offload this responsibility to the user we won't have to worry about it. Same as here.

"arriving at the given answer. Give verdict as 1 if useful and 0 if "
"not. question: $model_input, answer: $target_output, "
"context: $retrieved_context."
"The verdict should only contain an integer, either 1 or 0, do not give an explanation."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the model is not supposed to give an explanation for it's verdict? In practice, forcing the model to generate reasoning steps generally leads to a more accurate answer (for example see chain-of-thought prompting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was considering we only parse the verdict of "1" or "0" from the model response, so we don't need the explanation. I can update this prompt to ask the model to generate an explanation if it helps with accuracy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rely on the user to provide their own prompt this shouldn't be an issue. However, if we provide a default judge model and a default prompt, some experimental verification might be needed if forcing an explanation leads to better verdicts at the cost of increased output length in this case.

@oyangz
Copy link
Contributor Author

oyangz commented Jun 20, 2024

Will address default prompt template related changes separately.

@oyangz oyangz merged commit 8de2735 into aws:rageval Jun 20, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants