Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Merged
merged 12 commits into from Aug 12, 2021
Merged

Conversation

Timoeller
Copy link
Contributor

@Timoeller Timoeller commented Aug 11, 2021

SAS is added as part of the Eval Pipeline. 馃

Intro

The SAS metric correlates better with human judgement of correct answers as it does not rely on string overlaps.

Example:
Prediction = "30%"
Label = "thirty percent"
EM and F1 are both 0, so overly pessimistic
SAS = 0.95 with model "cross-encoder/stsb-roberta-large", which is a much more realistic similarity score.

We showed much higher correlation of SAS score with human judgment in an upcoming paper, link TBD

Tech Details

Bi or Cross-Encoders can both be used. The model can be configured by a model string pointing to SentenceTransformers or HF modelhub models. There currently is no check if the chosen model can be used for SAS, but a decent multilingual default is chosen sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Reader Scores on Tutorial 5 (Evaluation) are:

has answer queries: 24
top 1 EM: 0.1667
top k EM: 0.4583
top 1 F1: 0.3671
top k F1: 0.6425

sentence-transformers/paraphrase-multilingual-mpnet-base-v2
top 1 SAS: 0.6161
top k SAS: 0.7915

cross-encoder/stsb-roberta-large
top 1 SAS: 0.4067
top k SAS: 0.6665

So the scoring also changes depending on the underlying model used. The cross encoder with a large roberta model correlates on average much more with human judgement than the sentence transformers used in this example.

closes #1241

@Timoeller Timoeller requested review from julian-risch and tholor and removed request for julian-risch August 12, 2021 08:14
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I adjusted a few docstrings and namings to be more consistent with the rest of Haystack.
Left a few comments for open points.

haystack/eval.py Outdated Show resolved Hide resolved
haystack/eval.py Show resolved Hide resolved
haystack/eval.py Outdated Show resolved Hide resolved
haystack/eval.py Show resolved Hide resolved
haystack/eval.py Show resolved Hide resolved
@Timoeller Timoeller requested a review from tholor August 12, 2021 10:05
@tholor tholor changed the title Add Semantic Answer Similarity Add new QA eval metric: Semantic Answer Similarity (SAS) Aug 12, 2021
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Timoeller Timoeller merged commit 07bd3c5 into master Aug 12, 2021
@Timoeller Timoeller deleted the sas_eval branch August 12, 2021 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Semantic Answer similarity evaluation
3 participants