feat: Add FaithfulnessEvaluator component by julian-risch · Pull Request #7424 · deepset-ai/haystack

julian-risch · 2024-03-26T10:10:20Z

Related Issues

fixes LLM Eval - Implement Faithfulness/Factual Accuracy metric #7024

Proposed Changes:

Added a new component FaithfulnessEvaluator that returns one aggregated faithfulness score and individual faithfulness scores for inputs of queries, contexts and responses. Uses LLMEvaluator under the hood.

How did you test it?

New unit tests and the following local example:

from haystack import Pipeline
from haystack.components.evaluators import FaithfulnessEvaluator

QUESTIONS = ["Which is the most popular global sport?", "Who created the Python language?"]
CONTEXTS = [
    [
        "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."
    ],
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
RESPONSES = [
    "Football is the most popular sport with around 4 billion followers worldwide.",
    "Python is a high-level general-purpose programming language that was created by George Lucas.",
]

pipeline = Pipeline()
evaluator = FaithfulnessEvaluator()
pipeline.add_component("evaluator", evaluator)

results = pipeline.run({"evaluator": {"questions": QUESTIONS, "contexts": CONTEXTS, "responses": RESPONSES}})

print(results["evaluator"])
# {'results': [{'statements': ["Football is undoubtedly the world's most popular sport.", 'Football has around 4
# billion followers worldwide.'], 'statement_scores': [1, 1], 'name': 'llm', 'score': 1.0}, {'statements': ['Python
# is a high-level general-purpose programming language.', 'Python was created by George Lucas.'], 'statement_scores':
# [1, 0], 'score': 0.5}], 'score': 0.75, 'individual_scores': [1.0, 0.5]}

Notes for the reviewer

We can discuss a good name separately. FaithfulnessEvaluator, GroundednessEvaluator or HallucinationEvaluator are good candidates. deepset Cloud has a groundedness metric already: https://docs.cloud.deepset.ai/docs/use-groundedness-observability

In contrast to the original issue description this PR doesn't calculate a binary score per answer but per statement in answer. This calculation is more complex but it's also more meaningful and standard in other eval frameworks.

Other frameworks do the splitting of an answer into statements in separate prompts. Here are examples:

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

…ck into faithfulness-evaluator

shadeMe

LGTM!

coveralls · 2024-04-05T00:33:17Z

Pull Request Test Coverage Report for Build 8557979991

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.05%) to 89.407%

Totals
Change from base Build 8557433583:	0.05%
Covered Lines:	5697
Relevant Lines:	6372

💛 - Coveralls

draft FaithfulnessEvaluator

74d98d5

github-actions Bot added topic:tests 2.x type:documentation Improvements on the docs labels Mar 26, 2024

reno

5bf8979

julian-risch added the topic:eval label Mar 26, 2024

julian-risch added 2 commits April 4, 2024 14:38

calculate score per statement and aggregate

c8f403f

Update release note

1d72f94

julian-risch marked this pull request as ready for review April 4, 2024 12:49

julian-risch requested review from a team as code owners April 4, 2024 12:49

julian-risch requested review from dfokina, shadeMe and silvanocerza and removed request for a team and silvanocerza April 4, 2024 12:49

julian-risch added 2 commits April 4, 2024 15:11

update default values in tests and fix import path

78c40f9

Merge branch 'faithfulness-evaluator' of github.com:deepset-ai/haysta…

9b1a8b3

…ck into faithfulness-evaluator

shadeMe suggested changes Apr 4, 2024

View reviewed changes

Comment thread haystack/components/evaluators/faithfulness.py Outdated

Comment thread haystack/components/evaluators/faithfulness.py Outdated

remove instructions, inputs, outputs params

c839841

julian-risch requested a review from shadeMe April 4, 2024 14:02

julian-risch added 2 commits April 4, 2024 16:08

remove unused imports

8cea93a

add expected format example to docstring

71307e1

shadeMe reviewed Apr 4, 2024

View reviewed changes

Comment thread test/components/evaluators/test_faithfulness_evaluator.py Outdated

remove name 'llm' from tests and docstring

f7e3e8e

julian-risch requested a review from shadeMe April 4, 2024 14:57

Merge branch 'main' into faithfulness-evaluator

07d963d

shadeMe approved these changes Apr 4, 2024

View reviewed changes

Merge branch 'main' into faithfulness-evaluator

a106d78

julian-risch enabled auto-merge (squash) April 4, 2024 15:58

sort imports

b424849

julian-risch merged commit 9d02dc6 into main Apr 4, 2024

julian-risch deleted the faithfulness-evaluator branch April 4, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add FaithfulnessEvaluator component#7424

feat: Add FaithfulnessEvaluator component#7424
julian-risch merged 13 commits into
mainfrom
faithfulness-evaluator

julian-risch commented Mar 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shadeMe left a comment

Uh oh!

coveralls commented Apr 5, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

julian-risch commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shadeMe left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 8557979991

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

julian-risch commented Mar 26, 2024 •

edited

Loading

coveralls commented Apr 5, 2024 •

edited

Loading