Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Timoeller · 2021-08-11T18:12:48Z

SAS is added as part of the Eval Pipeline. 🦜

Intro

The SAS metric correlates better with human judgement of correct answers as it does not rely on string overlaps.

Example:
Prediction = "30%"
Label = "thirty percent"
EM and F1 are both 0, so overly pessimistic
SAS = 0.95 with model "cross-encoder/stsb-roberta-large", which is a much more realistic similarity score.

We showed much higher correlation of SAS score with human judgment in an upcoming paper, link TBD

Tech Details

Bi or Cross-Encoders can both be used. The model can be configured by a model string pointing to SentenceTransformers or HF modelhub models. There currently is no check if the chosen model can be used for SAS, but a decent multilingual default is chosen sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Reader Scores on Tutorial 5 (Evaluation) are:

has answer queries: 24
top 1 EM: 0.1667
top k EM: 0.4583
top 1 F1: 0.3671
top k F1: 0.6425

sentence-transformers/paraphrase-multilingual-mpnet-base-v2
top 1 SAS: 0.6161
top k SAS: 0.7915

cross-encoder/stsb-roberta-large
top 1 SAS: 0.4067
top k SAS: 0.6665

So the scoring also changes depending on the underlying model used. The cross encoder with a large roberta model correlates on average much more with human judgement than the sentence transformers used in this example.

closes #1241

tholor

Nice!
I adjusted a few docstrings and namings to be more consistent with the rest of Haystack.
Left a few comments for open points.

haystack/eval.py

tholor

LGTM

Timoeller added 3 commits August 11, 2021 20:07

init

399639a

Add type annotation

33d4cdd

Add test case, fix mypy

8d696b0

Timoeller requested review from julian-risch and tholor and removed request for julian-risch August 12, 2021 08:14

adjust docstrings. rename model path

5b9be74

tholor reviewed Aug 12, 2021

View reviewed changes

haystack/eval.py Outdated Show resolved Hide resolved

haystack/eval.py Show resolved Hide resolved

haystack/eval.py Outdated Show resolved Hide resolved

haystack/eval.py Show resolved Hide resolved

haystack/eval.py Show resolved Hide resolved

tholor and others added 8 commits August 12, 2021 11:31

satisfy mypy

86f83cf

Change return type of sas function + docstring

73f9242

Merge branch 'sas_eval' of github.com:deepset-ai/haystack into sas_eval

3daeafb

Adjust return type

c6a38c8

Adjust docstring

3a7abbb

Add german model to docstring

d80d364

Delete unsused fcts

66d0fae

reformat

e1d6c64

Timoeller requested a review from tholor August 12, 2021 10:05

tholor changed the title ~~Add Semantic Answer Similarity~~ Add new QA eval metric: Semantic Answer Similarity (SAS) Aug 12, 2021

tholor approved these changes Aug 12, 2021

View reviewed changes

Timoeller merged commit 07bd3c5 into master Aug 12, 2021

Timoeller deleted the sas_eval branch August 12, 2021 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Timoeller commented Aug 11, 2021 •

edited

tholor left a comment

tholor left a comment

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Add new QA eval metric: Semantic Answer Similarity (SAS) #1338

Conversation

Timoeller commented Aug 11, 2021 • edited

SAS is added as part of the Eval Pipeline. 🦜

Intro

Tech Details

Reader Scores on Tutorial 5 (Evaluation) are:

tholor left a comment

Choose a reason for hiding this comment

tholor left a comment

Choose a reason for hiding this comment

Timoeller commented Aug 11, 2021 •

edited