-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy Tokenizer Boundary Issue. #7592
Comments
You can get what you want like this:
For more details, see the docs on tokenizer exceptions. Ah wait, this doesn't work as-is with Stanza - let me see how to apply it. It should work since you're using the spaCy tokenizer. |
Hm, so that was more complicated than I expected. My example above works with the standard spaCy tokenizer, but it turns out the Stanza tokenizer has a somewhat different implementation and doesn't have the hooks for tokenizer exceptions. So what you can do is replace the tokenizer in your pipeline with a standard spaCy tokenizer, like below.
That said, you need to check your performance with this setup - I think it will make tokenization different from what the Stanza models were trained with, which could affect performance, though it seems likely the overall changes will be minor enough it shouldn't matter. |
This example that replaces the tokenizer with If you use If you really want customized spacy tokenization with the stanza pipeline, then you'll have to provide pretokenized texts (whitespace tokenization) with the |
Ah, I think you could customize the tokenizer directly. It's a little buried, but it looks like you can access it as this to modify the suffixes: nlp.tokenizer.snlp.processors["tokenize"]._variant.nlp.tokenizer |
This issue has been automatically closed because it was answered and there was no follow-up discussion. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I am using spacy tokenizer within stanza pipeline. In some of the sentences, spacy tokenizer does not tokenize sentence ending point '.' as seperate token which in my case is needed.
Here is my code;
The result is;
I want last two tokens as 'K' and '.' .
Can i do that?
The text was updated successfully, but these errors were encountered: