You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from __future__ import unicode_literals
import en_core_web_sm
import re
nlp = en_core_web_sm.load()
TEST_CODE = nlp.vocab.add_flag(re.compile('abc[0-9]+').match)
doc = nlp('abc123')
When running this code, spacy will fail with the following error
Traceback (most recent call last):
File "untitled.py", line 9, in <module>
doc = nlp('123a')
File "/home/.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 337, in __call__
doc = self.make_doc(text)
File "/home.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 365, in make_doc
return self.tokenizer(text)
File "tokenizer.pyx", line 120, in spacy.tokenizer.Tokenizer.__call__
File "tokenizer.pyx", line 161, in spacy.tokenizer.Tokenizer._tokenize
File "tokenizer.pyx", line 240, in spacy.tokenizer.Tokenizer._attach_tokens
File "vocab.pyx", line 134, in spacy.vocab.Vocab.get
File "vocab.pyx", line 170, in spacy.vocab.Vocab._new_lexeme
TypeError: an integer is required
This error is only triggered when including a regular expression that has both numbers and letters. Including one or the other passes without issues.
Thanks for the report. I'm still understanding this in detail but it looks to me like the error comes from re.match() returning a Python object, when the Vocab is expecting the function to return a boolean. The following should work:
I suspect the match over letters and numbers is a red herring, and what matters is whether the token has been seen before. I'm still a bit confused though.
Thank you @honnibal. I was following this example (IS_DEFINITELY) but I should have read more into the documentation for add_flag. I modified your code example a little bit
Consider the following code
When running this code, spacy will fail with the following error
This error is only triggered when including a regular expression that has both numbers and letters. Including one or the other passes without issues.
Info about spaCy
The text was updated successfully, but these errors were encountered: