You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms.
The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline.
The obvious workaround is this one: text = re.sub(r'\s+', ' ', text)
# spacy.__version__ # 2.3.0# stanza.__version__ # 1.0.1# spacy_stanza.__version__ # 0.2.3text="The FHLBB was insolvent and its\nassets were transferred. "# works if \n in text is replaced by a spaceimportspacy, stanza, spacy_stanzafromspacy_stanzaimportStanzaLanguage# stanza nlp works finestanza_nlp=stanza.Pipeline('en'), processors={'tokenize': 'spacy'})
doc=stanza_nlp(text)
# spacy stanza throws assertion spacy_stanza_nlp=StanzaLanguage(stanza_nlp)
doc=spacy_stanza_nlp.make_doc(text)
Here the trace:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
---> 22 doc = spacy_stanza_nlp.make_doc(text)
.../spacy-stanza/spacy_stanza/language.py in make_doc(self, text)
65 these will be mapped to token vectors.
66 """
---> 67 doc = self.tokenizer(text)
68 if self.svecs is not None:
69 doc.user_token_hooks["vector"] = self.token_vector
.../spacy-stanza/spacy_stanza/language.py in __call__(self, text)
193 else:
194 token = snlp_tokens[i + offset]
--> 195 assert word == token.text
196
197 pos.append(self.vocab.strings.add(token.upos or ""))
AssertionError:
The text was updated successfully, but these errors were encountered:
Thanks for the report! That is indeed a bug in the updated alignment code, which mistakenly assumed that you wouldn't get whitespace tokens back from the stanza models (since I wasn't aware of the extra spacy tokenizer option). The fix in #44 should address this and it will be in the next release (v0.2.4). If you want to install from source in the meanwhile (wait until after the PR is merged!), you can run:
Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms.
The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline.
The obvious workaround is this one:
text = re.sub(r'\s+', ' ', text)
Here the trace:
The text was updated successfully, but these errors were encountered: