Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

Closed
luofuli opened this issue Dec 30, 2019 · 5 comments
Closed

What's the evaluation metric for each dataset on GLUE of RoBERTa? #1561

luofuli opened this issue Dec 30, 2019 · 5 comments
Assignees
Labels

Comments

@luofuli
Copy link

luofuli commented Dec 30, 2019

Hi,

I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for MRPC, STS and QQP. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?

I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:

  • STS: Pearson
  • MRPC: F1:
  • QQP: Accuracy

However, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:

  • STS: Spearman
  • MRPC: Accuracy
  • QQP: Accuracy

To conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.

@lematt1991
Copy link
Contributor

CC @myleott @ngoyal2707

@ngoyal2707
Copy link
Contributor

Following are the metrics used for the 3 tasks you mentioned:

  • STS: Pearson
  • MRPC: ACC
  • QQP: ACC

@luofuli
Copy link
Author

luofuli commented Jan 10, 2020

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

image

@myleott myleott reopened this Jan 10, 2020
@ngoyal2707
Copy link
Contributor

@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Acc for dev and F1 for test.
The system are still comparable as all systems report the same measures. But we will update the next version to make it clear.

Thanks!

@luofuli
Copy link
Author

luofuli commented Jan 25, 2020

Thank you very much.@ngoyal2707

facebook-github-bot pushed a commit that referenced this issue Jan 20, 2021
Summary:
Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (d73e543)). This fixes it.

Note: it was correct in the original version of the code for the paper.

Pull Request resolved: fairinternal/fairseq-py#1561

Test Plan:
- confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`):
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}

after:
2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}
```
- with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant:
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)

after:
2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744)
(...)
```
- confirmed that old checkpoints can be loaded and produce identical valid ppl:
```
python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100

no sharing:
  before:
      2021-01-19 07:07:54 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:10 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed:
  before:
      2021-01-19 07:08:50 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:45 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed + shared_layer_kv_compressed:
  before:
      2021-01-19 07:09:26 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:09:36 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104

using a really old checkpoint with sharing (trained on commit cf4219b):
  before:
       | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:34:07 | INFO | valid |  | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
```

Reviewed By: madian9

Differential Revision: D25938236

Pulled By: myleott

fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
harkash pushed a commit to harkash/fairseq that referenced this issue Feb 23, 2021
Summary:
Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (facebookresearch@d73e543)). This fixes it.

Note: it was correct in the original version of the code for the paper.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1561

Test Plan:
- confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`):
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}

after:
2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}
```
- with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant:
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)

after:
2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744)
(...)
```
- confirmed that old checkpoints can be loaded and produce identical valid ppl:
```
python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100

no sharing:
  before:
      2021-01-19 07:07:54 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:10 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed:
  before:
      2021-01-19 07:08:50 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:45 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed + shared_layer_kv_compressed:
  before:
      2021-01-19 07:09:26 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:09:36 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104

using a really old checkpoint with sharing (trained on commit cf4219b):
  before:
       | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:34:07 | INFO | valid |  | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
```

Reviewed By: madian9

Differential Revision: D25938236

Pulled By: myleott

fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
sshleifer pushed a commit that referenced this issue Apr 7, 2021
Summary:
Parameter sharing (both `--untie-weights-roberta` and `--shared-layer-kv-compressed`) was broken by one of my earlier refactors (D22411012 (fairinternal/fairseq-py@625e501)). This fixes it.

Note: it was correct in the original version of the code for the paper.

Pull Request resolved: fairinternal/fairseq-py#1561

Test Plan:
- confirmed that training gives identical losses as before when not using any param sharing (including `--untie-weights-roberta`):
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --untie-weights-roberta --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:37:21 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:41:56 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "11813.8", "ups": "2.88", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12002.2", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11894.2", "ups": "2.9", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:41:57 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "11834.9", "ups": "2.89", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}

after:
2021-01-19 06:39:20 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)
2021-01-19 06:39:22 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "15.893", "ppl": "60870.7", "wps": "0", "ups": "0", "wpb": "4096", "bsz": "8", "num_updates": "1", "lr": "0.0001", "gnorm": "7.716", "train_wall": "1", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "13.176", "ppl": "9252.9", "wps": "12094.7", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "2", "lr": "0.0001", "gnorm": "6.988", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "11.049", "ppl": "2119.22", "wps": "12290", "ups": "3", "wpb": "4096", "bsz": "8", "num_updates": "3", "lr": "0.0001", "gnorm": "8.008", "train_wall": "0", "wall": "1"}
2021-01-19 06:39:23 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "9.044", "ppl": "527.7", "wps": "11990.4", "ups": "2.93", "wpb": "4096", "bsz": "8", "num_updates": "4", "lr": "0.0001", "gnorm": "7.893", "train_wall": "0", "wall": "2"}
2021-01-19 06:39:24 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "7.526", "ppl": "184.27", "wps": "12073.8", "ups": "2.95", "wpb": "4096", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "6.949", "train_wall": "0", "wall": "2"}
```
- with input embedding and output LM head param sharing, the `num. model params` now goes down (as expected), whereas before it stayed constant:
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_masked_lm --arch linformer_roberta_base --user-dir examples/linformer/linformer_src/ --criterion masked_lm --batch-size 8 --optimizer adam --lr 0.0001 --log-format json --log-interval 1 --max-update 5 --disable-validation --no-save

before:
2021-01-19 06:44:58 | INFO | fairseq_cli.train | num. model params: 164,465,744 (num. trained: 164,465,744)
(...)

after:
2021-01-19 06:43:03 | INFO | fairseq_cli.train | num. model params: 126,065,744 (num. trained: 126,065,744)
(...)
```
- confirmed that old checkpoints can be loaded and produce identical valid ppl:
```
python -m fairseq_cli.validate --path $MODEL --user-dir examples/linformer/linformer_src/ --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100

no sharing:
  before:
      2021-01-19 07:07:54 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:10 | INFO | valid |  | valid on 'valid' subset | loss 5.485 | ppl 44.8 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed:
  before:
      2021-01-19 07:08:50 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 07:30:45 | INFO | valid |  | valid on 'valid' subset | loss 5.355 | ppl 40.94 | wps 0 | wpb 53248 | bsz 104

shared_kv_compressed + shared_layer_kv_compressed:
  before:
      2021-01-19 07:09:26 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:09:36 | INFO | valid |  | valid on 'valid' subset | loss 5.482 | ppl 44.7 | wps 0 | wpb 53248 | bsz 104

using a really old checkpoint with sharing (trained on commit cf4219b):
  before:
       | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
  after:
      2021-01-19 08:34:07 | INFO | valid |  | valid on 'valid' subset | loss 5.548 | ppl 46.8 | wps 0 | wpb 53248 | bsz 104
```

Reviewed By: madian9

Differential Revision: D25938236

Pulled By: myleott

fbshipit-source-id: 4d515e5c8e0601476856ae27eb46c64c30033c88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants