Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: fairinternal/fairseq-py#826

Differential Revision: D16830402

Pulled By: myleott

fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc
  • Loading branch information
myleott authored and facebook-github-bot committed Aug 15, 2019
1 parent 1d44cc8 commit ac66df4
Show file tree
Hide file tree
Showing 4 changed files with 129 additions and 100 deletions.
50 changes: 38 additions & 12 deletions examples/roberta/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

https://arxiv.org/abs/1907.11692

### Introduction
## Introduction

RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.

### What's New:

- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

### Pre-trained models
## Pre-trained models

Model | Description | # params | Download
---|---|---|---
Expand All @@ -19,36 +19,62 @@ Model | Description | # params | Download
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](wsc/README.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)

### Results
## Results

##### Results on GLUE tasks (dev set, single model, single-task finetuning)
**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
_(dev set, single model, single-task finetuning)_

Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
---|---|---|---|---|---|---|---|---
`roberta.base` | 87.6 | 92.8 | 91.9 | 78.7 | 94.8 | 90.2 | 63.6 | 91.2
`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
`roberta.large.mnli` | 90.2 | - | - | - | - | - | - | -

##### Results on SuperGLUE tasks (dev set, single model, single-task finetuning)
**[SuperGLUE (Wang et al., 2019)](https://super.gluebenchmark.com/)**
_(dev set, single model, single-task finetuning)_

Model | BoolQ | CB | COPA | MultiRC | RTE | WiC | WSC
---|---|---|---|---|---|---|---
`roberta.large` | 86.9 | 98.2 | 94.0 | 85.7 | 89.5 | 75.6 | -
`roberta.large.wsc` | - | - | - | - | - | - | 91.3

##### Results on SQuAD (dev set)
**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
_(dev set, no additional data used)_

Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
---|---|---
`roberta.large` | 88.9/94.6 | 86.5/89.4

##### Results on Reading Comprehension (RACE, test set)
**[RACE (Lai et al., 2017)](http://www.qizhexie.com/data/RACE_leaderboard.html)**
_(test set)_

Model | Accuracy | Middle | High
---|---|---|---
`roberta.large` | 83.2 | 86.5 | 81.3

### Example usage
**[HellaSwag (Zellers et al., 2019)](https://rowanzellers.com/hellaswag/)**
_(test set)_

Model | Overall | In-domain | Zero-shot | ActivityNet | WikiHow
---|---|---|---|---|---
`roberta.large` | 85.2 | 87.3 | 83.1 | 74.6 | 90.9

**[Commonsense QA (Talmor et al., 2019)](https://www.tau-nlp.org/commonsenseqa)**
_(test set)_

Model | Accuracy
---|---
`roberta.large` (single model) | 72.1
`roberta.large` (ensemble) | 72.5

**[Winogrande (Sakaguchi et al., 2019)](https://arxiv.org/abs/1907.10641)**
_(test set)_

Model | Accuracy
---|---
`roberta.large` | 78.1

## Example usage

##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
```python
Expand Down Expand Up @@ -124,7 +150,7 @@ roberta.cuda()
roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
```

### Advanced usage
## Advanced usage

#### Filling masks:

Expand Down Expand Up @@ -216,19 +242,19 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
# Expected output: 0.9060
```

### Finetuning
## Finetuning

- [Finetuning on GLUE](README.glue.md)
- [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
- [Finetuning on Winograd Schema Challenge (WSC)](wsc/README.md)
- [Finetuning on Commonsense QA (CQA)](commonsense_qa/README.md)
- Finetuning on SQuAD: coming soon

### Pretraining using your own data
## Pretraining using your own data

See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

### Citation
## Citation

```bibtex
@article{liu2019roberta,
Expand Down
2 changes: 1 addition & 1 deletion examples/roberta/README.pretraining.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This tutorial will walk you through pretraining RoBERTa over your own data.

### 1) Preprocess the data.
### 1) Preprocess the data

Data should be preprocessed following the [language modeling format](/examples/language_model).

Expand Down
36 changes: 24 additions & 12 deletions examples/scaling_nmt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,45 +11,57 @@ Model | Description | Dataset | Download

## Training a new model on WMT'16 En-De

Please first download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).
First download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8).

Then:

1. Extract the WMT'16 En-De data:
##### 1. Extract the WMT'16 En-De data
```bash
TEXT=wmt16_en_de_bpe32k
mkdir -p $TEXT
tar -xzvf wmt16_en_de.tar.gz -C $TEXT
```

2. Preprocess the dataset with a joined dictionary:
##### 2. Preprocess the dataset with a joined dictionary
```bash
fairseq-preprocess --source-lang en --target-lang de \
fairseq-preprocess \
--source-lang en --target-lang de \
--trainpref $TEXT/train.tok.clean.bpe.32000 \
--validpref $TEXT/newstest2013.tok.bpe.32000 \
--testpref $TEXT/newstest2014.tok.bpe.32000 \
--destdir data-bin/wmt16_en_de_bpe32k \
--nwordssrc 32768 --nwordstgt 32768 \
--joined-dictionary
--joined-dictionary \
--workers 20
```

3. Train a model:
##### 3. Train a model
```bash
fairseq-train data-bin/wmt16_en_de_bpe32k \
fairseq-train \
data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16
```

Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.
Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.

If you want to train the above model with big batches (assuming your machine has 8 GPUs):
- add `--update-freq 16` to simulate training on 8*16=128 GPUs
- add `--update-freq 16` to simulate training on 8x16=128 GPUs
- increase the learning rate; 0.001 works well for big batches

##### 4. Evaluate
```bash
fairseq-generate \
data-bin/wmt16_en_de_bpe32k \
--path checkpoints/checkpoint_best.pt \
--beam 4 --lenpen 0.6 --remove-bpe
```

## Citation

```bibtex
Expand Down

0 comments on commit ac66df4

Please sign in to comment.