Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HuggingFace-compatible PreTokenizer #167

Merged
merged 7 commits into from
Nov 26, 2021
Merged

Conversation

eiennohito
Copy link
Collaborator

@eiennohito eiennohito commented Nov 15, 2021

Depends on PR #176 (includes its comments)

Fixes #38 Fixes #166 #Fixes #168

Created as Dictionary.pre_tokenizer([mode]), where the default mode is C.

Implementation can be used from multiple threads, with a single PreTokenizer instance shared between all threads. It will create thread local Sudachi tokenizer and use it to perform the actual tokenization. Also, implementation releases GIL while doing the analysis, so it should achieve some speedup when used multithreaded.

@eiennohito eiennohito added the python Python binding-related label Nov 15, 2021
@eiennohito eiennohito added this to the 0.6.1 milestone Nov 15, 2021
@eiennohito
Copy link
Collaborator Author

Added custom handler

@eiennohito
Copy link
Collaborator Author

rebased on develop branch

@eiennohito eiennohito merged commit 6a2d20d into develop Nov 26, 2021
@eiennohito eiennohito deleted the 38-pretokenizer branch November 26, 2021 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Python binding-related
Projects
None yet
2 participants