What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

luofuli · 2019-12-30T11:34:07Z

Hi,

I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for MRPC, STS and QQP. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?

I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:

STS: Pearson
MRPC: F1:
QQP: Accuracy

However, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:

STS: Spearman
MRPC: Accuracy
QQP: Accuracy

To conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.

The text was updated successfully, but these errors were encountered:

lematt1991 · 2019-12-30T14:36:15Z

CC @myleott @ngoyal2707

ngoyal2707 · 2020-01-07T19:15:20Z

Following are the metrics used for the 3 tasks you mentioned:

STS: Pearson
MRPC: ACC
QQP: ACC

luofuli · 2020-01-10T07:50:38Z

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

ngoyal2707 · 2020-01-21T15:33:55Z

@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Acc for dev and F1 for test.
The system are still comparable as all systems report the same measures. But we will update the next version to make it clear.

Thanks!

luofuli · 2020-01-25T07:21:36Z

Thank you very much.@ngoyal2707

Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (d73e543)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: fairinternal/fairseq-py#1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88

Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (facebookresearch@d73e543)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88

Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (fairinternal/fairseq-py@625e501)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: fairinternal/fairseq-py#1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88

luofuli added needs triage question labels Dec 30, 2019

lematt1991 removed the needs triage label Dec 30, 2019

lematt1991 assigned myleott and ngoyal2707 Jan 6, 2020

ngoyal2707 closed this as completed Jan 7, 2020

myleott reopened this Jan 10, 2020

ngoyal2707 closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

luofuli commented Dec 30, 2019 •

edited

lematt1991 commented Dec 30, 2019

ngoyal2707 commented Jan 7, 2020

luofuli commented Jan 10, 2020

ngoyal2707 commented Jan 21, 2020

luofuli commented Jan 25, 2020 •

edited

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

Comments

luofuli commented Dec 30, 2019 • edited

lematt1991 commented Dec 30, 2019

ngoyal2707 commented Jan 7, 2020

luofuli commented Jan 10, 2020

ngoyal2707 commented Jan 21, 2020

luofuli commented Jan 25, 2020 • edited

luofuli commented Dec 30, 2019 •

edited

luofuli commented Jan 25, 2020 •

edited