bart.base tokenizer same as bart.large? #2242

sshleifer · 2020-06-15T13:58:30Z

Thanks for releasing bart.base!
When I download open it, I see that the embedding shape implies a vocab size of 51,201, whereas bart.large had embedding size 50,264.

The first 50,260 lines of the dict.txt files are identical, but then they diverge.
bart.large/dict.txt ends in madeupword0002 0
whereas bart.base/dict.txt continues to madeupword0938 0.

I'm assuming that these extra entries are identical, the same tokenizer can be used for both models, and that the bart.base embeddings can be truncated to 50,264. Is that correct? Thanks!

The text was updated successfully, but these errors were encountered:

ngoyal2707 · 2020-06-15T14:50:11Z

Hey, yes, those extra symbols are just dummy tokens for making the matrix more efficient on GPU. you can delete those tokens and adjust the embed_tokens matrix manually.

sshleifer · 2020-06-15T17:04:31Z

Maybe I should leave them? What is the efficiency gain?

sshleifer added needs triage question labels Jun 15, 2020

sshleifer changed the title ~~bart.base tokenizer~~ bart.base tokenizer same as bart.large? Jun 15, 2020

This was referenced Jun 15, 2020

[🚀 Feature request] upload bart-base checkpoint huggingface/transformers#5004

Closed

Add bart-base huggingface/transformers#5014

Merged

ngoyal2707 closed this as completed Jun 15, 2020

patil-suraj mentioned this issue Jan 21, 2021

Mismatch of the mask token id of BART between fairseq and huggingface huggingface/transformers#9731

Closed

sshleifer mentioned this issue Feb 25, 2021

Bart CNN example breaks #3254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bart.base tokenizer same as bart.large? #2242

bart.base tokenizer same as bart.large? #2242

sshleifer commented Jun 15, 2020 •

edited

ngoyal2707 commented Jun 15, 2020

sshleifer commented Jun 15, 2020

bart.base tokenizer same as bart.large? #2242

bart.base tokenizer same as bart.large? #2242

Comments

sshleifer commented Jun 15, 2020 • edited

ngoyal2707 commented Jun 15, 2020

sshleifer commented Jun 15, 2020

sshleifer commented Jun 15, 2020 •

edited