Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule based matching breaks when using a regular expression that includes letters and numbers #1950

Closed
jjtapia opened this issue Feb 7, 2018 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@jjtapia
Copy link

jjtapia commented Feb 7, 2018

Consider the following code

from __future__ import unicode_literals
import en_core_web_sm
import re


nlp = en_core_web_sm.load()

TEST_CODE = nlp.vocab.add_flag(re.compile('abc[0-9]+').match)
doc = nlp('abc123')

When running this code, spacy will fail with the following error

Traceback (most recent call last):
  File "untitled.py", line 9, in <module>
    doc = nlp('123a')
  File "/home/.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 337, in __call__
    doc = self.make_doc(text)
  File "/home.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 365, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 120, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 161, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 240, in spacy.tokenizer.Tokenizer._attach_tokens
  File "vocab.pyx", line 134, in spacy.vocab.Vocab.get
  File "vocab.pyx", line 170, in spacy.vocab.Vocab._new_lexeme
TypeError: an integer is required

This error is only triggered when including a regular expression that has both numbers and letters. Including one or the other passes without issues.

Info about spaCy

  • spaCy version: 2.0.7
  • Platform: Linux-4.4.0-21-generic-x86_64-with-LinuxMint-18-sarah
  • Python version: 3.6.1
@honnibal
Copy link
Member

honnibal commented Feb 8, 2018

Thanks for the report. I'm still understanding this in detail but it looks to me like the error comes from re.match() returning a Python object, when the Vocab is expecting the function to return a boolean. The following should work:

TEST_CODE = nlp.vocab.add_flag(lambda string: bool(re.compile('abc[0-9]+').match))

I suspect the match over letters and numbers is a red herring, and what matters is whether the token has been seen before. I'm still a bit confused though.

@jjtapia
Copy link
Author

jjtapia commented Feb 8, 2018

Thank you @honnibal. I was following this example (IS_DEFINITELY) but I should have read more into the documentation for add_flag. I modified your code example a little bit

TEST_CODE = nlp.vocab.add_flag(lambda string: bool(re.compile('abc[0-9]+').match(string)))

and this worked for my use case.

@ines
Copy link
Member

ines commented Feb 9, 2018

@jjtapia Thanks, looks like this is actually a mistake in our docs! Fixing! 👍

@ines ines closed this as completed in e9f67be Feb 9, 2018
@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants