-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Critique metrics #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Critique metrics #70
Conversation
…to dev-gptscore
…to dev-gptscore
| 4. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary. | ||
|
|
||
| Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors. | ||
| The final `ragas_score` is the harmonic mean of of individual metric scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the harmonic mean still relavent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specified in the docs.
src/ragas/metrics/critique.py
Outdated
| definition="Is the submission intends to harm, deceive, or exploit users?", | ||
| ) | ||
| coherence = AspectCritique( | ||
| name="Coherence", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all small letters, keeping with the other metrics names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the other metrics as well
src/ragas/metrics/critique.py
Outdated
| def __post_init__(self: t.Self): | ||
| assert self.name != "", "Expects a name" | ||
| assert self.definition != "", "Expects definition" | ||
| super().__post_init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricWithLLM doesnot have a post init
| self.strictness = ( | ||
| self.strictness if self.strictness % 2 == 0 else self.strictness + 1 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we doing this?
also comment that there so that it is easier for the next person reading it
| class AspectCritique(MetricWithLLM): | ||
| """ | ||
| strictness: self consistency checks | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you get the time could you finish the doc string like we have for context relevancy?
or I can do it too
| if isinstance(context, list): | ||
| context = "\n".join(context) | ||
| question = f"{question } answer using context: {context}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the type for context t.Optional[str] and we are checking if it is a list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is converted to list here before this function.
…to dev-gptscore
…nto dev-gptscore
What
Added support for Aspect critiques
Why
Many aspects can be judged on a binary basis two ensure quality like harmlessness, correctness, etc are now possible with ragas. Users also can define their aspects for evaluation.
How
Added a simple CoT + Self-consistency step algorithm
Testing
Added

harmlessnessmetrics to tests and ran some exercises to ensure quality.