New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new QA eval metric: Semantic Answer Similarity (SAS) #1338
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Timoeller
requested review from
julian-risch and
tholor
and removed request for
julian-risch
August 12, 2021 08:14
tholor
reviewed
Aug 12, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I adjusted a few docstrings and namings to be more consistent with the rest of Haystack.
Left a few comments for open points.
tholor
changed the title
Add Semantic Answer Similarity
Add new QA eval metric: Semantic Answer Similarity (SAS)
Aug 12, 2021
tholor
approved these changes
Aug 12, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SAS is added as part of the Eval Pipeline. 馃
Intro
The SAS metric correlates better with human judgement of correct answers as it does not rely on string overlaps.
Example:
Prediction = "30%"
Label = "thirty percent"
EM and F1 are both 0, so overly pessimistic
SAS = 0.95 with model "cross-encoder/stsb-roberta-large", which is a much more realistic similarity score.
We showed much higher correlation of SAS score with human judgment in an upcoming paper, link TBD
Tech Details
Bi or Cross-Encoders can both be used. The model can be configured by a model string pointing to SentenceTransformers or HF modelhub models. There currently is no check if the chosen model can be used for SAS, but a decent multilingual default is chosen
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Reader Scores on Tutorial 5 (Evaluation) are:
has answer queries: 24
top 1 EM: 0.1667
top k EM: 0.4583
top 1 F1: 0.3671
top k F1: 0.6425
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
top 1 SAS: 0.6161
top k SAS: 0.7915
cross-encoder/stsb-roberta-large
top 1 SAS: 0.4067
top k SAS: 0.6665
So the scoring also changes depending on the underlying model used. The cross encoder with a large roberta model correlates on average much more with human judgement than the sentence transformers used in this example.
closes #1241