-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561
Comments
Following are the metrics used for the 3 tasks you mentioned:
|
However, in the Table 5 of the RoBERTa paper, |
@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Thanks! |
Thank you very much.@ngoyal2707 |
Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (d73e543)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: fairinternal/fairseq-py#1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (facebookresearch@d73e543)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
Summary: Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (fairinternal/fairseq-py@625e501)). This fixes it. Note: it was correct in the original version of the code for the paper. Pull Request resolved: fairinternal/fairseq-py#1561 Test Plan: - confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`): ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} after: 2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) 2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"} 2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"} 2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"} ``` - with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant: ``` CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save before: 2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744) (...) after: 2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744) (...) ``` - confirmed that old checkpoints can be loaded and produce identical valid ppl: ``` python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 no sharing: before: 2021-01-19 07:07:54 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:10 | INFO | valid | | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed: before: 2021-01-19 07:08:50 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 07:30:45 | INFO | valid | | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104 shared_kv_compressed + shared_layer_kv_compressed: before: 2021-01-19 07:09:26 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:09:36 | INFO | valid | | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104 using a really old checkpoint with sharing (trained on commit cf4219b): before: | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 after: 2021-01-19 08:34:07 | INFO | valid | | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104 ``` Reviewed By: madian9 Differential Revision: D25938236 Pulled By: myleott fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
Hi,
I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for
MRPC
,STS
andQQP
. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:
STS
: PearsonMRPC
: F1:QQP
: AccuracyHowever, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:
STS
: SpearmanMRPC
: AccuracyQQP
: AccuracyTo conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.
The text was updated successfully, but these errors were encountered: