Does BART support more than 1024 tokens in inference of summarization task? #1685

JunhyunB · 2020-02-08T13:32:56Z

❓ Questions and Help

Does BART support more than 1024 tokens in inference of summarization task?
For the long text like novel, does BART use all of the input to generate summary?
or just use first 1024 tokens and ignore others?

Before asking:

search the issues.
search the docs.

What is your question?

Code

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

The text was updated successfully, but these errors were encountered:

myleott · 2020-02-09T13:09:15Z

@ngoyal2707

ngoyal2707 · 2020-02-10T17:33:02Z

@JunhyunB Only inference with longer document won't work because the summarization model was finetuned on seqlen of 1024.

What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on 512 seq_len and during fine tuning, we use 1024 seq_len. You can raise it further (let's say 2048) and finetune.

For above, you would need to adjust positional embeddings by either:

learning them from start.
copy 512 from pretrained bart to first 512 of your 2048 positional embedding.

I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)

loganlebanoff · 2020-02-11T19:27:25Z

On the readme for CNN/DM, it says to use MAX_TOKENS=2048, but @ngoyal2707, you say it is 1024, and also here too #1474. Is the readme incorrect?

yinhanliu · 2020-02-12T22:26:19Z

You can reset the position embedding to new length (ex 2048) and copy 1024 from model (the second half will be random initialized, while the first half is trained)... this is a common trick in summarization.

loganlebanoff · 2020-02-13T15:54:10Z

Thanks for the response yinhanliu. I wanted to know though, which hyperparameter setting was used to get the best results when fine-tuning on CNN/DM. Was it 1024 or 2048?

yinhanliu · 2020-02-13T23:41:14Z

@loganlebanoff we only used 1024. Never tried 2048.

Change MAX_TOKENS=2048 --> 1024, as per yinhanliu in facebookresearch#1685

loganlebanoff · 2020-02-14T17:39:36Z

Thanks! I've created a pull request to fix the CNN/DM fine-tuning readme.

ngoyal2707 · 2020-02-14T17:41:26Z

max_tokens, max_sentences, tokens_per_sample are different args. max_sentences is bsz, max_tokens is maximum allowed tokens in a batch and tokens_per_sample is max seq length in one instance. Current readme instructions are correct

loganlebanoff · 2020-02-20T18:18:03Z

Ok thanks, I understand the difference now

loganlebanoff · 2020-02-27T01:49:39Z

copy 512 from pretrained bart to first 512 of your 2048 positional embedding.

@ngoyal2707 I want to increase the max sequence length to be 2048 as you said. Can you give some hint as to how to do this? I see that the size of the positional embedding matrix is 1026 (rather than 1024) in the pretrained BART.

state['model']['encoder.embed_positions.weight'].shape
Out[37]: torch.Size([1026, 1024])
state['model']['encoder.embed_positions.weight']
Out[38]: 
tensor([[-0.0043, -0.0042,  0.0029,  ...,  0.0149,  0.0098,  0.0102],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0497, -0.2086, -0.1076,  ..., -0.1564, -0.0135,  0.0566],
        ...,
        [ 0.0027,  0.0022, -0.0051,  ...,  0.0007,  0.0089, -0.0124],
        [ 0.0046, -0.0024,  0.0026,  ..., -0.0050, -0.0112, -0.0063],
        [-0.0056, -0.0084,  0.0082,  ..., -0.0017, -0.0039,  0.0105]],
       dtype=torch.float16)

and similarly, the size is 2050 for the model I will be finetuning.

self.get_model().state_dict()['encoder.embed_positions.weight'].shape
Out[46]: torch.Size([2050, 1024])

Would I copy the over the parameters from [2 : 1026] to the second half, [1026 : 2050]?

yinhanliu · 2020-02-27T04:38:56Z

I only tried once on this and I kept [1026:2050] random. 1026 was because bos + source (1024) + eos.

…

On Wed, Feb 26, 2020 at 5:49 PM Logan Thien Lebanoff < ***@***.***> wrote: copy 512 from pretrained bart to first 512 of your 2048 positional embedding. @ngoyal2707 <https://github.com/ngoyal2707> I want to increase the max sequence length to be 2048 as you said. Can you give some hint as to how to do this? I see that the size of the positional embedding matrix is 1026 (rather than 1024) in the pretrained BART. state['model']['encoder.embed_positions.weight'].shape Out[37]: torch.Size([1026, 1024]) state['model']['encoder.embed_positions.weight'] Out[38]: tensor([[-0.0043, -0.0042, 0.0029, ..., 0.0149, 0.0098, 0.0102], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0497, -0.2086, -0.1076, ..., -0.1564, -0.0135, 0.0566], ..., [ 0.0027, 0.0022, -0.0051, ..., 0.0007, 0.0089, -0.0124], [ 0.0046, -0.0024, 0.0026, ..., -0.0050, -0.0112, -0.0063], [-0.0056, -0.0084, 0.0082, ..., -0.0017, -0.0039, 0.0105]], dtype=torch.float16) and similarly, the size is 2050 for the model I will be finetuning. self.get_model().state_dict()['encoder.embed_positions.weight'].shape Out[46]: torch.Size([2050, 1024]) Would I copy the over the parameters from [2 : 1026] to the second half, [1026 : 2050]? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1685>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJQ6TRMYTMJAB3MOFFLR3X3RE4L3JANCNFSM4KR2REBQ> .

-- Best Regards, Yinhan Liu Graduate Student at the University of Texas at Austin

loganlebanoff · 2020-02-28T14:43:22Z

Thanks. I saw that the positions always start at 2, so I copied [2 : 1026] to [1026 : 2050]. I compared it to random initialization of the second half, and I got better scores on my specific application when copying vs random. Thanks again!

sajastu · 2020-03-07T01:05:11Z

What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on 512 seq_len and during fine tuning, we use 1024 seq_len. You can raise it further (let's say 2048) and finetune.

@ngoyal2707 I'd like to finetune BART on quite a different domain where the average sequence length of input documents is about 8000 tokens. Does BART support the lengths in this order? If not, is there a work-around to handle these cases?

marcelgwerder · 2020-04-02T14:40:53Z

Are the state['model']['encoder.embed_positions.weight'] weights the only ones I would have to resize and copy when trying to fine tune with max_source_positions=2048?

With below modification, I can start the training but I'm not convinced if that makes sense:

state['model']['encoder.embed_positions.weight'] = torch.cat([
    state['model']['encoder.embed_positions.weight'][:1025].clone(), 
    state['model']['encoder.embed_positions.weight'][1:].clone()
], 0)

Is this at all related to setting --encoder-embed-dim?

fabrahman · 2020-04-29T02:16:31Z

@JunhyunB Only inference with longer document won't work because the summarization model was finetuned on seqlen of 1024.

What you can do, is finetune the model with longer seq_len on your custom training data. In fact, that is similar to what we do. We preatrained bart on 512 seq_len and during fine tuning, we use 1024 seq_len. You can raise it further (let's say 2048) and finetune.

For above, you would need to adjust positional embeddings by either:
1. learning them from start.

2. copy `512` from pretrained bart to first `512` of your `2048` positional embedding.
I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)

@ngoyal2707 Hi, would you please point me to where in your code for finetuning BART you copy 512 from pretrained bart?
I need to finetune BART for a task similar to abstractive summerization but have longer sequences. Thanks

fabrahman · 2020-04-29T15:33:11Z

Thanks. I saw that the positions always start at 2, so I copied [2 : 1026] to [1026 : 2050]. I compared it to random initialization of the second half, and I got better scores on my specific application when copying vs random. Thanks again!

@loganlebanoff Would you please share what exact changes you made to finetune this model on new dataset with longer sequence? appreciate that.

loganlebanoff · 2020-04-29T23:21:55Z

After this line: https://github.com/pytorch/fairseq/blob/411531734df8c7294e82c68e9d42177382f362ef/fairseq/trainer.py#L202

I added the following code:

encoder_pos = state['model']['encoder.embed_positions.weight']
to_append = encoder_pos[2:]
new_encoder_pos = torch.cat((encoder_pos, to_append))
state['model']['encoder.embed_positions.weight'] = new_encoder_pos

fabrahman · 2020-04-29T23:55:13Z

Thanks for the reply @loganlebanoff . And also changing max_source_positions to 2048, right?
Did you realize you gain from this trick for modeling long sequences?
And just a quick Q, why not using encoder_pos[1:-1] ?

loganlebanoff · 2020-05-01T18:30:04Z

Right, yes I changed max_source_positions to 2048. I still used it on CNN/DM, but a different setup than doing regular summarization. For my setup, I got slightly better performance by copying the positional embeddings to the last 1024 rather than randomizing the last 1024 (for both settings, I used max_source_positions=2048).

I took a look at https://github.com/pytorch/fairseq/blob/7a6519f84fed06947bbf161c7b66c9099bc4ce53/fairseq/utils.py#L191
Which says positions start at padding_idx+1, and when debugging, the padding_idx was 1. So I assume positions start at 2. This was confirmed for me when I took a look at the positions variable that's created, and it seems to start at 2. Padding is 1. I'm not sure what the 0 index is for...

In[3]: positions
Out[3]: 
tensor([[  2,   3,   4,  ..., 769, 770, 771],
        [  1,   1,   1,  ..., 697, 698, 699],
        [  1,   1,   1,  ..., 689, 690, 691],
        ...,
        [  1,   1,   1,  ..., 467, 468, 469],
        [  1,   1,   1,  ..., 463, 464, 465],
        [  1,   1,   1,  ..., 354, 355, 356]], device='cuda:0')

waisyousofi · 2022-05-25T11:47:28Z

I would recommend 2, but that might require slight code changes. (lemme know if you need some help with that)

Can you please show, how can we increase it to take more than 1024 input tokens.

JunhyunB added needs triage question labels Feb 8, 2020

myleott removed the needs triage label Feb 9, 2020

myleott assigned ngoyal2707 Feb 9, 2020

ngoyal2707 closed this as completed Feb 10, 2020

loganlebanoff added a commit to loganlebanoff/fairseq that referenced this issue Feb 14, 2020

Fix MAX_TOKENS in readme

bdc69cd

Change MAX_TOKENS=2048 --> 1024, as per yinhanliu in facebookresearch#1685

loganlebanoff mentioned this issue Feb 14, 2020

Fix MAX_TOKENS in readme #1706

Closed

4 tasks

tuhinjubcse mentioned this issue May 11, 2020

Use BART for longer documents huggingface/transformers#4277

Closed

SeanBannister mentioned this issue Jun 2, 2020

token indices sequence length is longer than the specified maximum sequence length huggingface/transformers#1791

Closed

saichandrapandraju mentioned this issue Jul 23, 2021

Size of sample is invalid since max_positions=(1024, 1024) wasiahmad/PLBART#14

Closed

cjnghn mentioned this issue Oct 2, 2021

크롤러 본문 길이 검증 로직 추가 요청 osamhack2021/ai_web_RISKOUT_BTS#83

Closed

zhangyao1994 mentioned this issue Sep 28, 2022

Did you train BART with the input sequence length up to 2048 tokens? Yale-LILY/QMSum#13

Open

StevenLau6 mentioned this issue Sep 30, 2022

Positional embedding weight luyang-huang96/LongDocSum#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does BART support more than 1024 tokens in inference of summarization task? #1685

Does BART support more than 1024 tokens in inference of summarization task? #1685

JunhyunB commented Feb 8, 2020 •

edited

myleott commented Feb 9, 2020

ngoyal2707 commented Feb 10, 2020

loganlebanoff commented Feb 11, 2020

yinhanliu commented Feb 12, 2020

loganlebanoff commented Feb 13, 2020

yinhanliu commented Feb 13, 2020

loganlebanoff commented Feb 14, 2020

ngoyal2707 commented Feb 14, 2020 •

edited

loganlebanoff commented Feb 20, 2020

loganlebanoff commented Feb 27, 2020

yinhanliu commented Feb 27, 2020 via email

loganlebanoff commented Feb 28, 2020 •

edited

sajastu commented Mar 7, 2020

marcelgwerder commented Apr 2, 2020 •

edited

fabrahman commented Apr 29, 2020

fabrahman commented Apr 29, 2020 •

edited

loganlebanoff commented Apr 29, 2020

fabrahman commented Apr 29, 2020

loganlebanoff commented May 1, 2020

waisyousofi commented May 25, 2022 •

edited

Does BART support more than 1024 tokens in inference of summarization task? #1685

Does BART support more than 1024 tokens in inference of summarization task? #1685

Comments

JunhyunB commented Feb 8, 2020 • edited

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

myleott commented Feb 9, 2020

ngoyal2707 commented Feb 10, 2020

loganlebanoff commented Feb 11, 2020

yinhanliu commented Feb 12, 2020

loganlebanoff commented Feb 13, 2020

yinhanliu commented Feb 13, 2020

loganlebanoff commented Feb 14, 2020

ngoyal2707 commented Feb 14, 2020 • edited

loganlebanoff commented Feb 20, 2020

loganlebanoff commented Feb 27, 2020

yinhanliu commented Feb 27, 2020 via email

loganlebanoff commented Feb 28, 2020 • edited

sajastu commented Mar 7, 2020

marcelgwerder commented Apr 2, 2020 • edited

fabrahman commented Apr 29, 2020

fabrahman commented Apr 29, 2020 • edited

loganlebanoff commented Apr 29, 2020

fabrahman commented Apr 29, 2020

loganlebanoff commented May 1, 2020

waisyousofi commented May 25, 2022 • edited

JunhyunB commented Feb 8, 2020 •

edited

ngoyal2707 commented Feb 14, 2020 •

edited

loganlebanoff commented Feb 28, 2020 •

edited

marcelgwerder commented Apr 2, 2020 •

edited

fabrahman commented Apr 29, 2020 •

edited

waisyousofi commented May 25, 2022 •

edited