-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small score difference for identical outputs #32
Comments
Hey, Nicola thanks for reporting this... The way bootstrap resampling works will not be affected by very small segment level differences because for the If two systems have the exact same translations I can ensure you that you will have Btw the root of this problem comes from the layerwise normalization we do which can be affected by the |
why do not apply a smooth decision for the ties count?
with a reasonably small value for |
that's a reasonable idea for the system-level score. I'll fix it along with the other issues you reported |
@nicolabertoldi thanks for the issues! I believe everything is working properly now. Please tell me if not |
🐛 Bug
Using comet-compare
I noticed that the exact same outputs of two different systems receive different score, although almost negligible.
But, although negligibly different, these scores are non consider a tie; and hence there is an impact on the number of wins/losses reported.
To Reproduce
comet-compare -s SRC -r REF -x SysX -y SysY --to_json JJJ
Actually, I am using directly the function "compare_command()" included in "cli/compare.py"
Expected behaviour
Either an identical score for identical outputs, or a bit more flexible counts of wins/losses/ties
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8
The text was updated successfully, but these errors were encountered: