Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

jsalbr · 2020-07-04T12:05:41Z

Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms.
The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline.
The obvious workaround is this one: text = re.sub(r'\s+', ' ', text)

# spacy.__version__ # 2.3.0
# stanza.__version__ # 1.0.1
# spacy_stanza.__version__ # 0.2.3

text = "The FHLBB was insolvent and its\nassets were transferred. "
# works if \n in text is replaced by a space

import spacy, stanza, spacy_stanza
from spacy_stanza import StanzaLanguage

# stanza nlp works fine
stanza_nlp = stanza.Pipeline('en'), processors={'tokenize': 'spacy'})
doc = stanza_nlp(text)

# spacy stanza throws assertion 
spacy_stanza_nlp = StanzaLanguage(stanza_nlp)
doc = spacy_stanza_nlp.make_doc(text)

Here the trace:

---------------------------------------------------------------------------
AssertionError                  Traceback (most recent call last)
---> 22 doc = spacy_stanza_nlp.make_doc(text)

.../spacy-stanza/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

.../spacy-stanza/spacy_stanza/language.py in __call__(self, text)
    193             else:
    194                 token = snlp_tokens[i + offset]
--> 195                 assert word == token.text
    196 
    197                 pos.append(self.vocab.strings.add(token.upos or ""))

AssertionError:

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-07-06T06:46:24Z

Thanks for the report! That is indeed a bug in the updated alignment code, which mistakenly assumed that you wouldn't get whitespace tokens back from the stanza models (since I wasn't aware of the extra spacy tokenizer option). The fix in #44 should address this and it will be in the next release (v0.2.4). If you want to install from source in the meanwhile (wait until after the PR is merged!), you can run:

pip install https://github.com/explosion/spacy-stanza/archive/master.zip

Note that the stanza models can't really handle the whitespace token, though. I get the token analysis:

{
  "id": "7",
  "text": "\n",
  "lemma": "\n",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 8,
  "deprel": "compound",
  "misc": "start_char=31|end_char=32"
},

So you might want to consider replacing extra whitespace with single spaces anyway, at least with the provided stanza models.

jsalbr · 2020-07-06T12:46:49Z

Thanks Adriane for fixing. The workaround is simple once you've found the reason ;-)

adrianeboyd closed this as completed Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

jsalbr commented Jul 4, 2020

adrianeboyd commented Jul 6, 2020

jsalbr commented Jul 6, 2020

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

Comments

jsalbr commented Jul 4, 2020

adrianeboyd commented Jul 6, 2020

jsalbr commented Jul 6, 2020