Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing RoBERTa max sequence length #1011

Closed
nelson-liu opened this issue Aug 13, 2019 · 4 comments
Closed

Editing RoBERTa max sequence length #1011

nelson-liu opened this issue Aug 13, 2019 · 4 comments

Comments

@nelson-liu
Copy link
Contributor

Hi!

Is it possible to use the pretrained RoBERTa with a smaller max sequence length? I'm looking for the analogue of --max_seq_length in the original BERT code; the use case is that I'd like to try to reduce my GPU memory usage during fine-tuning (more info at https://github.com/google-research/bert#out-of-memory-issues ).

Sorry if this is a silly question, I just haven't been able to find the setting / am not sure if it's supported. Thanks!!

@ngoyal2707
Copy link
Contributor

You don't need to change max sequence length to reduce the memory usage. You can define --max-tokens based on your GPU memory usage, and fairseq will dynamically determine the batch_size.
Typically 4400 tokens can fit on a 32gb V100 for roberta-large when using fp16 training.

Also you can reduce batch-size (i.e. --max-sentences) and increase --update-freq to fit in the GPU memory

@nelson-liu
Copy link
Contributor Author

Thanks for the response! I'm still trying to fine-tune on RACE; I unfortunately can't fit even an 1-element batch into my 11 GB GPU, so it's hard to progress much from here. When using --max-tokens 384, I run into:

Traceback (most recent call last):
  File "/homes/gws/nfliu/miniconda3/envs/fairseq/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/homes/gws/nfliu/git/fairseq/fairseq_cli/train.py", line 284, in distributed_main
    main(args, init_distributed=True)
  File "/homes/gws/nfliu/git/fairseq/fairseq_cli/train.py", line 68, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/homes/gws/nfliu/git/fairseq/fairseq/checkpoint_utils.py", line 126, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(epoch=0)
  File "/homes/gws/nfliu/git/fairseq/fairseq/trainer.py", line 216, in get_train_iterator
    epoch=epoch,
  File "/homes/gws/nfliu/git/fairseq/fairseq/tasks/fairseq_task.py", line 153, in get_batch_iterator
    epoch=epoch,
  File "/homes/gws/nfliu/git/fairseq/fairseq/data/iterators.py", line 150, in __init__
    self.frozen_batches = tuple(batch_sampler)
  File "/homes/gws/nfliu/git/fairseq/fairseq/data/data_utils.py", line 221, in batch_by_size
    "limit of {}!".format(idx, sample_len, max_tokens)
AssertionError: sentence at index 56217 of size 429 exceeds max_tokens limit of 384!

Now that I think about it, maybe what I want is a --max-context-length (to match --max-option-length) so both the context and the option can be truncated? Or I should just go get more GPU memory, since 11 << 32 :)

@myleott
Copy link
Contributor

myleott commented Aug 13, 2019

I don’t think --max-context-length applies here. What you want is to decouple --max-positions (which is the number of positional embeddings and needs to stay at 512) from the max number of tokens per sample. You can add a new option or hardcode 384 here: https://github.com/pytorch/fairseq/blob/577e4fa78a295fd7cd3ee7e9fd4b936ca800ebea/fairseq/tasks/sentence_ranking.py#L120

Note that the code runs the network separately for each “choice”, so it’s implicitly storing 5x as many activations as the number of tokens in any one instance.

You could also try working off of the roberta.base model instead of the large one. I’d expect you’ll get worse results with less context, so it’s possible that base+larger context outperforms large+smaller context (not sure...).

@nelson-liu
Copy link
Contributor Author

Makes sense, thanks @myleott !

cndn added a commit to cndn/fairseq that referenced this issue Jan 29, 2020
Summary:
Pull Request resolved: fairinternal/fairseq-py#1011

Pull Request resolved: facebookresearch#1620

Make Fairseq transformer scriptable. Discussion points on possible code refactoring:

(1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too.

(2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py)

Differential Revision: D19234599

fbshipit-source-id: 64b8a64995bd2bf9a24f6b0665609a2856dad840
facebook-github-bot pushed a commit that referenced this issue Jan 30, 2020
Summary:
Pull Request resolved: fairinternal/fairseq-py#1011

Pull Request resolved: #1620

Make Fairseq transformer scriptable. Discussion points on possible code refactoring:

(1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too.

(2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py)

Reviewed By: myleott

Differential Revision: D19234599

fbshipit-source-id: db3dd364ecf3ae14fb7ac8c0928bd0ebe250f19d
louismartin pushed a commit to louismartin/fairseq that referenced this issue Mar 24, 2020
Summary:
Pull Request resolved: fairinternal/fairseq-py#1011

Pull Request resolved: facebookresearch#1620

Make Fairseq transformer scriptable. Discussion points on possible code refactoring:

(1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too.

(2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py)

Reviewed By: myleott

Differential Revision: D19234599

fbshipit-source-id: db3dd364ecf3ae14fb7ac8c0928bd0ebe250f19d
moussaKam pushed a commit to moussaKam/language-adaptive-pretraining that referenced this issue Sep 29, 2020
Summary:
Pull Request resolved: fairinternal/fairseq-py#1011

Pull Request resolved: facebookresearch#1620

Make Fairseq transformer scriptable. Discussion points on possible code refactoring:

(1) Original decoder output is a tuple (x, {"attn": attn, "inner_states": inner_states}). TorchScript does not support dictionary with values of different types (attn: Tensor, inner_states: List[Tensor]). Current workaround is to use [attn] for attention field and access via output["attn"][0] in downstream. This is currently used in fairspeq custom transformer code. Another (maybe) cleaner alternative is to use namedtuple for decoder output but involves tons of downstream changes too.

(2) Currently TorchScript doesn't support **kwargs. Some unused arguments might get passed in due to polymorphism. Now the only workaround I can think of is to add possible unused arguments, (e.g. line 666 in transformer.py)

Reviewed By: myleott

Differential Revision: D19234599

fbshipit-source-id: db3dd364ecf3ae14fb7ac8c0928bd0ebe250f19d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants