small score difference for identical outputs #32

nicolabertoldi · 2021-10-21T14:18:35Z

🐛 Bug

Using comet-compare
I noticed that the exact same outputs of two different systems receive different score, although almost negligible.
But, although negligibly different, these scores are non consider a tie; and hence there is an impact on the number of wins/losses reported.

    "ties (%)": 0.0,
    "x_wins (%)": 1.0,
    "y_wins (%)": 0.0

{
    "src": "Nedávno prohrál s Raonicem v Brisbane Open.",
    "system_x": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.8726277947425842
    },
    "system_y": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.872564971446991
    },
    "ref": "He recently lost against Raonic in the Brisbane Open."
},

To Reproduce

comet-compare -s SRC -r REF -x SysX -y SysY --to_json JJJ

Actually, I am using directly the function "compare_command()" included in "cli/compare.py"

Expected behaviour

Either an identical score for identical outputs, or a bit more flexible counts of wins/losses/ties

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

The text was updated successfully, but these errors were encountered:

ricardorei · 2021-10-21T16:17:59Z

Hey, Nicola thanks for reporting this... The way bootstrap resampling works will not be affected by very small segment level differences because for the wins, losses, ties count we will average across several segments and we will do this several times across different splits.

If two systems have the exact same translations I can ensure you that you will have x_win = 0.5 and y_win = 0.5 or at least something really close to that which is not statistically significant.

Btw the root of this problem comes from the layerwise normalization we do which can be affected by the batch_size. In practice, it's not desirable but negligible as you said.

nicolabertoldi · 2021-10-21T16:24:15Z

why do not apply a smooth decision for the ties count?
Something like

        if subsample_x_scr > subsample_y_scr + epsilon:
            win_count[0] += 1
        elif subsample_y_scr > subsample_x_scr + epsilon:
            win_count[1] += 1
        else:
            win_count[2] += 1

with a reasonably small value for epsilon

ricardorei · 2021-10-21T16:26:27Z

that's a reasonable idea for the system-level score. I'll fix it along with the other issues you reported

ricardorei · 2021-10-21T18:01:34Z

@nicolabertoldi thanks for the issues! I believe everything is working properly now. Please tell me if not

nicolabertoldi added the bug Something isn't working label Oct 21, 2021

ricardorei closed this as completed Oct 21, 2021

ricardorei added enhancement New feature or request and removed bug Something isn't working labels Oct 21, 2021

ricardorei reopened this Oct 21, 2021

ricardorei added a commit that referenced this issue Oct 21, 2021

bug fixes compare cmd (#31, #32 #33)

51a6961

ricardorei closed this as completed Oct 21, 2021

ricardorei mentioned this issue May 2, 2022

Comet compare for multiple systems. #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small score difference for identical outputs #32

small score difference for identical outputs #32

nicolabertoldi commented Oct 21, 2021

ricardorei commented Oct 21, 2021

nicolabertoldi commented Oct 21, 2021

ricardorei commented Oct 21, 2021

ricardorei commented Oct 21, 2021

small score difference for identical outputs #32

small score difference for identical outputs #32

Comments

nicolabertoldi commented Oct 21, 2021

🐛 Bug

To Reproduce

Expected behaviour

Screenshots

Environment

ricardorei commented Oct 21, 2021

nicolabertoldi commented Oct 21, 2021

ricardorei commented Oct 21, 2021

ricardorei commented Oct 21, 2021