Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comet compare for multiple systems. #74

Merged
merged 4 commits into from
May 6, 2022

Conversation

SamuelLarkin
Copy link
Contributor

Hi,
I would like to propose to modify compare.py to allow comparing more than 2 systems at once. The main changes are that we need to accept possibly more than 2 systems, perform bootstrap resampling, in parallel on all systems and calculate pairwise T-tests.

Doing pairwise T-test for multiple systems can be expensive as there is a substantial over head when we start the script to load the imported modules. Also, calculating comet scores needs only to be done once per systems. The actual T-test is quite fast.

@ricardorei
Copy link
Collaborator

ricardorei commented Apr 28, 2022

Thanks @SamuelLarkin!

This seems a great feature I'll try to merge it today or tomorrow and release a new version with it!

@SamuelLarkin
Copy link
Contributor Author

SamuelLarkin commented Apr 28, 2022

I would feel more comfortable if you perform some tests of your own first ;)
Note that the output format has changed thus you may also want to have a back and forth to have it to your liking.

@ricardorei
Copy link
Collaborator

I will I will! I have a few local tests for the CLI with multiple systems from last year's WMT shared task.

@ricardorei
Copy link
Collaborator

@SamuelLarkin I had a chance to test this and I have found that the results between the previous implementation and the new one are not exactly the same.

I used the following files for testing:
reference.txt
source.txt
system-1.txt
system-2.txt

System-1 has a COMET score of 0.6515 while system-2 has a score of 0.6604. In the current implementation, these two systems are statistically "tied" with a p-value of 0.1808. In this new implementation, system-2 is significantly better...

@ricardorei
Copy link
Collaborator

Also, since we are now having multiple systems and not just systems x and y I would like to propose a small change to your CLI.

Instead of systems we could use the -t/--translations flag similar to the comet-score command. The resulting CLI would work as follows:

comet-compare -s data/test_samples/source.txt -r data/test_samples/reference.txt -t data/test_samples/system-1.txt data/test_samples/system-2.txt

or

comet-compare -d wmt20:de-en -t data/test_samples/system-1.txt data/test_samples/system-2.txt

@ricardorei
Copy link
Collaborator

In the display, we could also avoid having repeated results between system-1 -> system-2 and then system-2 -> system-1. This is very minor but it seems redundant to keep both.

Possible improvement: Can we present the results in the end in a matrix format? like a summary of the results?

@SamuelLarkin
Copy link
Contributor Author

Thanks @ricardorei for your feedbacks.
I'll work on your propose changes.

Note that I came across a scenario where, using the original comet-compare if I scored system1 against system2 they came up as a tie but if I compare system2 against system1, I simply reversed their order, then one of them was better but I can't recall which one.

I'm not too surprised that the new scores are not equal to the previous scores as I had to change the bootstrap resampling and probably the new method is not picking the same samples but it is worth checking.

@SamuelLarkin
Copy link
Contributor Author

@ricardorei
I would appreciate if you could provide me with an example of what you want in the final table. By that I mean, to you want a single value per cell aka statistic or a cell that contains x-mean, y-mean, ties, x_wins, y_wins, statistic & p_value?

@SamuelLarkin
Copy link
Contributor Author

SamuelLarkin commented May 2, 2022

Documentation: Different Scores

Comparing previous code with new code, the comet scores are slightly different. This will definitively yield different system scores.

Generate comet Scores

#!/bin/bash

function new {
   mkdir -p new
   ../comet-compare.py \
      --to_json new/info.json \
      --num_splits 3 \
      --sample_ratio 0.6 \
      --gpus 0 \
      --model "wmt20-comet-da" \
      -s source.txt \
      -r reference.txt \
      -t system-?.txt \
      > new/t_test.results \
      2> new/t_test.log
}

function old {
   mkdir -p old
   ../orig/comet-compare-orig.py \
      --to_json old/info.json \
      --num_splits 3 \
      --sample_ratio 0.6 \
      --gpus 0 \
      --model "wmt20-comet-da" \
      -s source.txt \
      -r reference.txt \
      -x system-1.txt \
      -y system-2.txt \
      > old/t_test.results \
      2> old/t_test.log
}

new
old

Compare Segment Scores

Comparing segment scores for system X

vimdiff \
   <(jq .translations[0].scores[] new/info.json) \
   <(jq '.[1:]|.[].system_x.score' old/info.json)

@ricardorei
Copy link
Collaborator

@ricardorei I would appreciate if you could provide me with an example of what you want in the final table. By that I mean, to you want a single value per cell aka statistic or a cell that contains x-mean, y-mean, ties, x_wins, y_wins, statistic & p_value?

I was thinking of something simple just with a True/False from the null hypothesis.

@ricardorei
Copy link
Collaborator

ricardorei commented May 2, 2022

COMET scores can have small differences because of in-batch layer normalization (#32) but it should not have a huge effect on statistical significance especially with stats.ttest_rel.

@ricardorei
Copy link
Collaborator

The difference I observed was that with the old command the Paired T-test result had a statistic of -1.3394 and a p-value of 0.18 between system-1 and system-2 while the new command had a statistic result of 14.1933 and a p-value of 0.000

@SamuelLarkin
Copy link
Contributor Author

SamuelLarkin commented May 3, 2022

@ricardorei
About the different scores. I tried running the original code and the new code with --disable_length_batching and now the output scores are almost identical. The only different score is for the last sentence. In the new code, instead of scoring system_x then scoring system_y, I changed that and now the code pools all examples from all systems into one sample set then it calls model.predict() only once for all systems. Batching by length will clearly change the mini batches. This could explain the issue #66 about not getting the same scores if you use gpu vs cpu. Clearly, batching by length changes the output.

I think the new code is not what is causing the discrepancies in the output scores.

I think there is a bug with batching by length as batching by length is simply there for efficient processing and should NOT change the output.

@SamuelLarkin
Copy link
Contributor Author

@ricardorei
I've added a summary table. Let me know what you think about it.

@ricardorei
Copy link
Collaborator

I tested everything with and without lengh_caching and everything looks fine for me. There are tiny score changes (expected because of mini-batch embeddings normalization). This does not change statistical significance testing!

Example:


....
Segment 770     system_x score: 0.7316  system_y score: 0.8107
Segment 771     system_x score: 0.4470  system_y score: 0.5310
Segment 772     system_x score: 0.6764  system_y score: 0.6583
Segment 773     system_x score: 0.3163  system_y score: 0.4440
Segment 774     system_x score: 0.7526  system_y score: 0.7095
Segment 775     system_x score: 0.7381  system_y score: 0.9035
Segment 776     system_x score: 0.2816  system_y score: 0.6336
Segment 777     system_x score: 0.6057  system_y score: 0.5642
Segment 778     system_x score: 0.6142  system_y score: 0.6437
Segment 779     system_x score: 0.3659  system_y score: 0.3668
Segment 780     system_x score: 0.5476  system_y score: 0.5236
Segment 781     system_x score: 0.7021  system_y score: 0.7247
Segment 782     system_x score: 0.6602  system_y score: 0.6840
Segment 783     system_x score: 0.6912  system_y score: 0.3699
Segment 784     system_x score: 0.3618  system_y score: 0.4356

Bootstrap Resampling Results:
x-mean: 0.6511
y-mean: 0.6600
ties (%):       0.0633
x_wins (%):     0.1800
y_wins (%):     0.7567

Paired T-Test Results:
statistic:      -1.3429
p_value:        0.1797
Null hypothesis can't be rejected

And with length batching:

comet-compare -s data/source.txt -x data/system-1.txt -y data/system-2.txt -r data/reference.txt --disable_length_batching 

...

Segment 770     system_x score: 0.7322  system_y score: 0.8111
Segment 771     system_x score: 0.4470  system_y score: 0.5310
Segment 772     system_x score: 0.6760  system_y score: 0.6577
Segment 773     system_x score: 0.3157  system_y score: 0.4434
Segment 774     system_x score: 0.7529  system_y score: 0.7102
Segment 775     system_x score: 0.7392  system_y score: 0.9039
Segment 776     system_x score: 0.2827  system_y score: 0.6357
Segment 777     system_x score: 0.6079  system_y score: 0.5662
Segment 778     system_x score: 0.6144  system_y score: 0.6437
Segment 779     system_x score: 0.3772  system_y score: 0.3779
Segment 780     system_x score: 0.5483  system_y score: 0.5246
Segment 781     system_x score: 0.7022  system_y score: 0.7248
Segment 782     system_x score: 0.6659  system_y score: 0.6888
Segment 783     system_x score: 0.6931  system_y score: 0.3731
Segment 784     system_x score: 0.3628  system_y score: 0.4360

Bootstrap Resampling Results:
x-mean: 0.6515
y-mean: 0.6603
ties (%):       0.0600
x_wins (%):     0.1800
y_wins (%):     0.7600

Paired T-Test Results:
statistic:      -1.3410
p_value:        0.1803
Null hypothesis can't be rejected.
Both systems have equal averages.

@ricardorei
Copy link
Collaborator

When I run your code I get something also in line with these scores:

x_name: data/system-1.txt
y_name: data/system-2.txt

Bootstrap Resampling Results:
x-mean: 0.6511
y-mean: 0.6600
ties (%):       0.0433
x_wins (%):     0.1867
y_wins (%):     0.7700

Paired T-Test Results:
statistic:      -14.1481
p_value:        0.0000

The only difference is the p_value (close to 0) and the statistic (-14.1481)... Then I found the problem!

@ricardorei
Copy link
Collaborator

The stats.ttest_rel function should be called with the sentence-level scores and not with the mean scores over each fold!

There is the difference... basically for ttest you just need to take the seg_scores, split it into systems and call the function. The ttest does not have to do with bootstrap. They are independent.

@ricardorei
Copy link
Collaborator

In any case, all the rest seems good and I like the table at the end! Thanks for doing this @SamuelLarkin !

@ricardorei ricardorei merged commit 636a4aa into Unbabel:master May 6, 2022
@ricardorei
Copy link
Collaborator

Meanwhile, I had some time and fixed this issue. Now it's 100% consistent with the last implementation and supports the multiple system comparison!

Thanks for your contribution @SamuelLarkin ! This was a feature that was requested before and that its very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants