Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface Tokenisers Integration #38

Closed
eiennohito opened this issue Sep 26, 2021 · 1 comment · Fixed by #167
Closed

Huggingface Tokenisers Integration #38

eiennohito opened this issue Sep 26, 2021 · 1 comment · Fixed by #167
Labels
python Python binding-related
Milestone

Comments

@eiennohito
Copy link
Collaborator

eiennohito commented Sep 26, 2021

Probably, the easiest way to the integration would be to use CustomPretokeniser hook in the python bindings.
https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L547

Good things:

  • Python will handle dynamic library loading for us

Bad things:

  • Some performance overhead on python method calls
  • Some performance overhead on string copying

Anything else would require significant additions into the Huggingface Tokeniser framework.
Let's prototype this integration and maybe switch to another one if the overhead of this one will be too high.

Plan:

Add create_pretokenizer() method of Dictionary class (Python Binding) which will return an object which is compatible with HuggingFace.

Depends on #166

@eiennohito eiennohito added this to the 0.7 milestone Oct 20, 2021
@eiennohito eiennohito modified the milestones: 0.7, 0.6.1 Nov 11, 2021
@eiennohito eiennohito added the python Python binding-related label Nov 12, 2021
@eiennohito
Copy link
Collaborator Author

HuggingFace seems to reuse single pretokenizer instance between several threads, implementation should be able to deal with that, e.g. by creating a thread_local real instance internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Python binding-related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant