You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there are multiple white space characters between tokens, Tokenizer will raise a warning and the entity will not be extracted. It looks like stanza does not treat the extra white space as a token, while spaCy would.
importstanzafromspacy_stanzaimportStanzaLanguagesnlp=stanza.Pipeline(lang='en')
nlp=StanzaLanguage(snlp)
text="There are two spaces between these words"doc=nlp(text)
>>>UserWarning: Can't set named entities because the character offsets don'tmaptovalidtokensproducedbytheStanzatokenizer:
Words: ['There', 'are', 'two', 'spaces', 'between', 'these', 'words']
Entities: [('two', 'CARDINAL', 12, 15)]
print(len(doc.ents)) >>>0
The text was updated successfully, but these errors were encountered:
If there are multiple white space characters between tokens,
Tokenizer
will raise a warning and the entity will not be extracted. It looks like stanza does not treat the extra white space as a token, while spaCy would.The text was updated successfully, but these errors were encountered: