Skip to content

feat: Add FaithfulnessEvaluator component#7424

Merged
julian-risch merged 13 commits into
mainfrom
faithfulness-evaluator
Apr 4, 2024
Merged

feat: Add FaithfulnessEvaluator component#7424
julian-risch merged 13 commits into
mainfrom
faithfulness-evaluator

Conversation

@julian-risch
Copy link
Copy Markdown
Member

@julian-risch julian-risch commented Mar 26, 2024

Related Issues

Proposed Changes:

  • Added a new component FaithfulnessEvaluator that returns one aggregated faithfulness score and individual faithfulness scores for inputs of queries, contexts and responses. Uses LLMEvaluator under the hood.

How did you test it?

New unit tests and the following local example:

from haystack import Pipeline
from haystack.components.evaluators import FaithfulnessEvaluator

QUESTIONS = ["Which is the most popular global sport?", "Who created the Python language?"]
CONTEXTS = [
    [
        "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."
    ],
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
RESPONSES = [
    "Football is the most popular sport with around 4 billion followers worldwide.",
    "Python is a high-level general-purpose programming language that was created by George Lucas.",
]

pipeline = Pipeline()
evaluator = FaithfulnessEvaluator()
pipeline.add_component("evaluator", evaluator)

results = pipeline.run({"evaluator": {"questions": QUESTIONS, "contexts": CONTEXTS, "responses": RESPONSES}})

print(results["evaluator"])
# {'results': [{'statements': ["Football is undoubtedly the world's most popular sport.", 'Football has around 4
# billion followers worldwide.'], 'statement_scores': [1, 1], 'name': 'llm', 'score': 1.0}, {'statements': ['Python
# is a high-level general-purpose programming language.', 'Python was created by George Lucas.'], 'statement_scores':
# [1, 0], 'score': 0.5}], 'score': 0.75, 'individual_scores': [1.0, 0.5]}

Notes for the reviewer

We can discuss a good name separately. FaithfulnessEvaluator, GroundednessEvaluator or HallucinationEvaluator are good candidates. deepset Cloud has a groundedness metric already: https://docs.cloud.deepset.ai/docs/use-groundedness-observability

In contrast to the original issue description this PR doesn't calculate a binary score per answer but per statement in answer. This calculation is more complex but it's also more meaningful and standard in other eval frameworks.

Other frameworks do the splitting of an answer into statements in separate prompts. Here are examples:

Checklist

@julian-risch julian-risch marked this pull request as ready for review April 4, 2024 12:49
@julian-risch julian-risch requested review from a team as code owners April 4, 2024 12:49
@julian-risch julian-risch requested review from dfokina, shadeMe and silvanocerza and removed request for a team and silvanocerza April 4, 2024 12:49
Comment thread haystack/components/evaluators/faithfulness.py Outdated
Comment thread haystack/components/evaluators/faithfulness.py Outdated
@julian-risch julian-risch requested a review from shadeMe April 4, 2024 14:02
Comment thread test/components/evaluators/test_faithfulness_evaluator.py Outdated
@julian-risch julian-risch requested a review from shadeMe April 4, 2024 14:57
Copy link
Copy Markdown
Contributor

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@julian-risch julian-risch enabled auto-merge (squash) April 4, 2024 15:58
@julian-risch julian-risch merged commit 9d02dc6 into main Apr 4, 2024
@julian-risch julian-risch deleted the faithfulness-evaluator branch April 4, 2024 16:34
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Apr 5, 2024

Pull Request Test Coverage Report for Build 8557979991

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 89.407%

Totals Coverage Status
Change from base Build 8557433583: 0.05%
Covered Lines: 5697
Relevant Lines: 6372

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM Eval - Implement Faithfulness/Factual Accuracy metric

3 participants