Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

Closed
jsalbr opened this issue Jul 4, 2020 · 2 comments

Comments

@jsalbr
Copy link

jsalbr commented Jul 4, 2020

Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms.
The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline.
The obvious workaround is this one: text = re.sub(r'\s+', ' ', text)

# spacy.__version__ # 2.3.0
# stanza.__version__ # 1.0.1
# spacy_stanza.__version__ # 0.2.3

text = "The FHLBB was insolvent and its\nassets were transferred. "
# works if \n in text is replaced by a space

import spacy, stanza, spacy_stanza
from spacy_stanza import StanzaLanguage

# stanza nlp works fine
stanza_nlp = stanza.Pipeline('en'), processors={'tokenize': 'spacy'})
doc = stanza_nlp(text)

# spacy stanza throws assertion 
spacy_stanza_nlp = StanzaLanguage(stanza_nlp)
doc = spacy_stanza_nlp.make_doc(text)

Here the trace:

---------------------------------------------------------------------------
AssertionError                  Traceback (most recent call last)
---> 22 doc = spacy_stanza_nlp.make_doc(text)

.../spacy-stanza/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

.../spacy-stanza/spacy_stanza/language.py in __call__(self, text)
    193             else:
    194                 token = snlp_tokens[i + offset]
--> 195                 assert word == token.text
    196 
    197                 pos.append(self.vocab.strings.add(token.upos or ""))

AssertionError: 
@adrianeboyd
Copy link
Contributor

Thanks for the report! That is indeed a bug in the updated alignment code, which mistakenly assumed that you wouldn't get whitespace tokens back from the stanza models (since I wasn't aware of the extra spacy tokenizer option). The fix in #44 should address this and it will be in the next release (v0.2.4). If you want to install from source in the meanwhile (wait until after the PR is merged!), you can run:

pip install https://github.com/explosion/spacy-stanza/archive/master.zip

Note that the stanza models can't really handle the whitespace token, though. I get the token analysis:

{
  "id": "7",
  "text": "\n",
  "lemma": "\n",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 8,
  "deprel": "compound",
  "misc": "start_char=31|end_char=32"
},

So you might want to consider replacing extra whitespace with single spaces anyway, at least with the provided stanza models.

@jsalbr
Copy link
Author

jsalbr commented Jul 6, 2020

Thanks Adriane for fixing. The workaround is simple once you've found the reason ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants