Extra spaces causes token mis-alignment #30

maxmealy · 2020-03-27T18:21:41Z

If there are multiple white space characters between tokens, Tokenizer will raise a warning and the entity will not be extracted. It looks like stanza does not treat the extra white space as a token, while spaCy would.

import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang='en')
nlp = StanzaLanguage(snlp)
text = "There  are  two  spaces  between  these  words"
doc = nlp(text)
>>> UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['There', 'are', 'two', 'spaces', 'between', 'these', 'words']
Entities: [('two', 'CARDINAL', 12, 15)]
print(len(doc.ents)) >>> 0

The text was updated successfully, but these errors were encountered:

adrianeboyd mentioned this issue Jun 25, 2020

Rewrite alignment to preserve whitespace tokens #41

Merged

ines closed this as completed in #41 Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra spaces causes token mis-alignment #30

Extra spaces causes token mis-alignment #30

maxmealy commented Mar 27, 2020

Extra spaces causes token mis-alignment #30

Extra spaces causes token mis-alignment #30

Comments

maxmealy commented Mar 27, 2020