Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bart.base tokenizer same as bart.large? #2242

Closed
sshleifer opened this issue Jun 15, 2020 · 2 comments
Closed

bart.base tokenizer same as bart.large? #2242

sshleifer opened this issue Jun 15, 2020 · 2 comments

Comments

@sshleifer
Copy link
Contributor

sshleifer commented Jun 15, 2020

Thanks for releasing bart.base!
When I download open it, I see that the embedding shape implies a vocab size of 51,201, whereas bart.large had embedding size 50,264.

The first 50,260 lines of the dict.txt files are identical, but then they diverge.
bart.large/dict.txt ends in madeupword0002 0
whereas bart.base/dict.txt continues to madeupword0938 0.

I'm assuming that these extra entries are identical, the same tokenizer can be used for both models, and that the bart.base embeddings can be truncated to 50,264. Is that correct? Thanks!

@sshleifer sshleifer changed the title bart.base tokenizer bart.base tokenizer same as bart.large? Jun 15, 2020
@ngoyal2707
Copy link
Contributor

Hey, yes, those extra symbols are just dummy tokens for making the matrix more efficient on GPU. you can delete those tokens and adjust the embed_tokens matrix manually.

@sshleifer
Copy link
Contributor Author

Maybe I should leave them? What is the efficiency gain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants