Editing RoBERTa max sequence length #1011

nelson-liu · 2019-08-13T02:57:37Z

Hi!

Is it possible to use the pretrained RoBERTa with a smaller max sequence length? I'm looking for the analogue of --max_seq_length in the original BERT code; the use case is that I'd like to try to reduce my GPU memory usage during fine-tuning (more info at https://github.com/google-research/bert#out-of-memory-issues ).

Sorry if this is a silly question, I just haven't been able to find the setting / am not sure if it's supported. Thanks!!

The text was updated successfully, but these errors were encountered:

ngoyal2707 · 2019-08-13T03:11:06Z

You don't need to change max sequence length to reduce the memory usage. You can define --max-tokens based on your GPU memory usage, and fairseq will dynamically determine the batch_size.
Typically 4400 tokens can fit on a 32gb V100 for roberta-large when using fp16 training.

Also you can reduce batch-size (i.e. --max-sentences) and increase --update-freq to fit in the GPU memory

nelson-liu · 2019-08-13T03:27:59Z

Thanks for the response! I'm still trying to fine-tune on RACE; I unfortunately can't fit even an 1-element batch into my 11 GB GPU, so it's hard to progress much from here. When using --max-tokens 384, I run into:

Traceback (most recent call last):
  File "/homes/gws/nfliu/miniconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/homes/gws/nfliu/git/fairseq/fairseq_cli/train.py", line 284, in distributed_main
    main(args, init_distributed=True)
  File "/homes/gws/nfliu/git/fairseq/fairseq_cli/train.py", line 68, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/homes/gws/nfliu/git/fairseq/fairseq/checkpoint_utils.py", line 126, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(epoch=0)
  File "/homes/gws/nfliu/git/fairseq/fairseq/trainer.py", line 216, in get_train_iterator
    epoch=epoch,
  File "/homes/gws/nfliu/git/fairseq/fairseq/tasks/fairseq_task.py", line 153, in get_batch_iterator
    epoch=epoch,
  File "/homes/gws/nfliu/git/fairseq/fairseq/data/iterators.py", line 150, in __init__
    self.frozen_batches = tuple(batch_sampler)
  File "/homes/gws/nfliu/git/fairseq/fairseq/data/data_utils.py", line 221, in batch_by_size
    "limit of {}!".format(idx, sample_len, max_tokens)
AssertionError: sentence at index 56217 of size 429 exceeds max_tokens limit of 384!

Now that I think about it, maybe what I want is a --max-context-length (to match --max-option-length) so both the context and the option can be truncated? Or I should just go get more GPU memory, since 11 << 32 :)

myleott · 2019-08-13T10:13:01Z

I don’t think --max-context-length applies here. What you want is to decouple --max-positions (which is the number of positional embeddings and needs to stay at 512) from the max number of tokens per sample. You can add a new option or hardcode 384 here: https://github.com/pytorch/fairseq/blob/577e4fa78a295fd7cd3ee7e9fd4b936ca800ebea/fairseq/tasks/sentence_ranking.py#L120

Note that the code runs the network separately for each “choice”, so it’s implicitly storing 5x as many activations as the number of tokens in any one instance.

You could also try working off of the roberta.base model instead of the large one. I’d expect you’ll get worse results with less context, so it’s possible that base+larger context outperforms large+smaller context (not sure...).

nelson-liu · 2019-08-13T11:51:03Z

Makes sense, thanks @myleott !

Summary: Pull Request resolved: fairinternal/fairseq-py#1011 Pull Request resolved: facebookresearch#1620 Make Fairseq transformer scriptable. Discussion points on possible code refactoring: (1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too. (2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py) Differential Revision: D19234599 fbshipit-source-id: 64b8a64995bd2bf9a24f6b0665609a2856dad840

Summary: Pull Request resolved: fairinternal/fairseq-py#1011 Pull Request resolved: #1620 Make Fairseq transformer scriptable. Discussion points on possible code refactoring: (1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too. (2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py) Reviewed By: myleott Differential Revision: D19234599 fbshipit-source-id: db3dd364ecf3ae14fb7ac8c0928bd0ebe250f19d

Summary: Pull Request resolved: fairinternal/fairseq-py#1011 Pull Request resolved: facebookresearch#1620 Make Fairseq transformer scriptable. Discussion points on possible code refactoring: (1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too. (2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py) Reviewed By: myleott Differential Revision: D19234599 fbshipit-source-id: db3dd364ecf3ae14fb7ac8c0928bd0ebe250f19d

nelson-liu closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editing RoBERTa max sequence length #1011

Editing RoBERTa max sequence length #1011

nelson-liu commented Aug 13, 2019

ngoyal2707 commented Aug 13, 2019

nelson-liu commented Aug 13, 2019

myleott commented Aug 13, 2019 •

edited

Loading

nelson-liu commented Aug 13, 2019

Editing RoBERTa max sequence length #1011

Editing RoBERTa max sequence length #1011

Comments

nelson-liu commented Aug 13, 2019

ngoyal2707 commented Aug 13, 2019

nelson-liu commented Aug 13, 2019

myleott commented Aug 13, 2019 • edited Loading

nelson-liu commented Aug 13, 2019

myleott commented Aug 13, 2019 •

edited

Loading