-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Tokenizer if custom vocab was added #157
Comments
There are new fast tokenizers for BERT implemented in rust: huggingface/transformers#2211 We should have a look if they are compatible and solve the issue with custom vocab here. |
Yes, please check it out and let us know here. Repo is at https://github.com/huggingface/tokenizers |
Hey @julien-c, I finally found some time to test. Seems really promising! Great work! ✔️ Same tokenization behaviour as BertTokenizer (see test) The only blocker for us right now: |
In the
|
Hi @salmanmashayekh, sorry for the late reply. I haven't seen your comment until now. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. |
Hi,
and then profiling the speed in jupyter:
|
Hey, I guess this is currently a huggingface transformers issue, since FARM does not support GPT2 as of now. I also do not think it is related to FARM, since our implementation of tokenizers just calls underlying HF transformers implementations. |
Describe the bug
Tokenizer becomes very slow with large custom vocab.
Additional context
This was introduced after switching to the Tokenizers from the transformers repo
There are related issues reported in the transformers repo:
To Reproduce
System:
The text was updated successfully, but these errors were encountered: