You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anything else would require significant additions into the Huggingface Tokeniser framework.
Let's prototype this integration and maybe switch to another one if the overhead of this one will be too high.
Plan:
Add create_pretokenizer() method of Dictionary class (Python Binding) which will return an object which is compatible with HuggingFace.
HuggingFace seems to reuse single pretokenizer instance between several threads, implementation should be able to deal with that, e.g. by creating a thread_local real instance internally.
Probably, the easiest way to the integration would be to use CustomPretokeniser hook in the python bindings.
https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L547
Good things:
Bad things:
Anything else would require significant additions into the Huggingface Tokeniser framework.
Let's prototype this integration and maybe switch to another one if the overhead of this one will be too high.
Plan:
Add
create_pretokenizer()
method of Dictionary class (Python Binding) which will return an object which is compatible with HuggingFace.Depends on #166
The text was updated successfully, but these errors were encountered: