-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add context precision for context quality eval algo #289
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the validate_columns
flag, everything else lgtm
|
||
CONTEXT_PRECISION_SCORE = "context_precision_score" | ||
|
||
DEFAULT_CONTEXT_PRECISION_PROMPT_TEMPLATE = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need few shot examples to demonstrate output structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output structure is not very strict since we only need to parse the verdict of "1" or "0" from the model output. Would you suggest we add some examples in the prompt template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes, however if we offload this responsibility to the user we won't have to worry about it. Same as here.
"arriving at the given answer. Give verdict as 1 if useful and 0 if " | ||
"not. question: $model_input, answer: $target_output, " | ||
"context: $retrieved_context." | ||
"The verdict should only contain an integer, either 1 or 0, do not give an explanation." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the model is not supposed to give an explanation for it's verdict? In practice, forcing the model to generate reasoning steps generally leads to a more accurate answer (for example see chain-of-thought prompting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was considering we only parse the verdict of "1" or "0" from the model response, so we don't need the explanation. I can update this prompt to ask the model to generate an explanation if it helps with accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rely on the user to provide their own prompt this shouldn't be an issue. However, if we provide a default judge model and a default prompt, some experimental verification might be needed if forcing an explanation leads to better verdicts at the cost of increased output length in this case.
Will address default prompt template related changes separately. |
Issue #, if available:
Description of changes:
This PR adds the context precision metric under the context quality evaluation algorithm. This is the first of three evaluation metrics under context quality.
Context Precision is a metric that evaluates whether all of the target output relevant items present in the retrieved contexts are ranked higher or not. Ideally all the relevant context chunks must appear at the top ranks. This metric is computed using the
model_input
,target_output
and theretrieved_contexts
, with values ranging between 0 and 1, where higher scores indicate better precision.Notes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.