feat: Add FaithfulnessEvaluator component#7424
Merged
Merged
Conversation
…ck into faithfulness-evaluator
shadeMe
suggested changes
Apr 4, 2024
shadeMe
reviewed
Apr 4, 2024
Collaborator
Pull Request Test Coverage Report for Build 8557979991Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
FaithfulnessEvaluatorthat returns one aggregated faithfulness score and individual faithfulness scores for inputs of queries, contexts and responses. UsesLLMEvaluatorunder the hood.How did you test it?
New unit tests and the following local example:
Notes for the reviewer
We can discuss a good name separately.
FaithfulnessEvaluator,GroundednessEvaluatororHallucinationEvaluatorare good candidates. deepset Cloud has a groundedness metric already: https://docs.cloud.deepset.ai/docs/use-groundedness-observabilityIn contrast to the original issue description this PR doesn't calculate a binary score per answer but per statement in answer. This calculation is more complex but it's also more meaningful and standard in other eval frameworks.
Other frameworks do the splitting of an answer into statements in separate prompts. Here are examples:
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.