-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offset misalignment in NER StanzaLanguage Tokenizer #33
Comments
I encounter the same problem using the german language model. When there are synaeresises in the input text the model replaces them by the two originating words, but apparently doesn't update the input's character offset. The input
While everything is fine as long as the synaeresis
This seems to be a problem with the spacy wrapper since the
It seems like the german model does not have any issues with special characters/punctuaion as @aishwarya-agrawal has encountered.
|
@redadmiral Please try this Please notice the space at the beginning of the sentence |
Oh, okay – this leads to the same warning you encountered:
|
Gives the output:
On printing the two texts i.e.
snlp_doc.text, doc.text
Getting following texts:
Because of which above error is coming and we are losing the identified entities
Even with basic configs mentioned in readme:
The text was updated successfully, but these errors were encountered: