Rule based matching breaks when using a regular expression that includes letters and numbers #1950

jjtapia · 2018-02-07T19:22:39Z

Consider the following code

from __future__ import unicode_literals
import en_core_web_sm
import re


nlp = en_core_web_sm.load()

TEST_CODE = nlp.vocab.add_flag(re.compile('abc[0-9]+').match)
doc = nlp('abc123')

When running this code, spacy will fail with the following error

Traceback (most recent call last):
  File "untitled.py", line 9, in <module>
    doc = nlp('123a')
  File "/home/.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 337, in __call__
    doc = self.make_doc(text)
  File "/home.../venv2b/lib/python3.6/site-packages/spacy/language.py", line 365, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 120, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 161, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 240, in spacy.tokenizer.Tokenizer._attach_tokens
  File "vocab.pyx", line 134, in spacy.vocab.Vocab.get
  File "vocab.pyx", line 170, in spacy.vocab.Vocab._new_lexeme
TypeError: an integer is required

This error is only triggered when including a regular expression that has both numbers and letters. Including one or the other passes without issues.

Info about spaCy

spaCy version: 2.0.7
Platform: Linux-4.4.0-21-generic-x86_64-with-LinuxMint-18-sarah
Python version: 3.6.1

The text was updated successfully, but these errors were encountered:

honnibal · 2018-02-08T10:14:43Z

Thanks for the report. I'm still understanding this in detail but it looks to me like the error comes from re.match() returning a Python object, when the Vocab is expecting the function to return a boolean. The following should work:

TEST_CODE = nlp.vocab.add_flag(lambda string: bool(re.compile('abc[0-9]+').match))

I suspect the match over letters and numbers is a red herring, and what matters is whether the token has been seen before. I'm still a bit confused though.

jjtapia · 2018-02-08T19:36:01Z

Thank you @honnibal. I was following this example (IS_DEFINITELY) but I should have read more into the documentation for add_flag. I modified your code example a little bit

TEST_CODE = nlp.vocab.add_flag(lambda string: bool(re.compile('abc[0-9]+').match(string)))

and this worked for my use case.

ines · 2018-02-09T09:19:26Z

@jjtapia Thanks, looks like this is actually a mistake in our docs! Fixing! 👍

lock · 2018-05-07T23:55:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

jjtapia mentioned this issue Feb 7, 2018

Spacy is breaking when combining custom tokenizer's token_match with rule-based matching #1947

Closed

honnibal added the bug Bugs and behaviour differing from documentation label Feb 8, 2018

ines closed this as completed in e9f67be Feb 9, 2018

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule based matching breaks when using a regular expression that includes letters and numbers #1950

Rule based matching breaks when using a regular expression that includes letters and numbers #1950

jjtapia commented Feb 7, 2018 •

edited

Loading

honnibal commented Feb 8, 2018

jjtapia commented Feb 8, 2018 •

edited

Loading

ines commented Feb 9, 2018

lock bot commented May 7, 2018

Rule based matching breaks when using a regular expression that includes letters and numbers #1950

Rule based matching breaks when using a regular expression that includes letters and numbers #1950

Comments

jjtapia commented Feb 7, 2018 • edited Loading

Info about spaCy

honnibal commented Feb 8, 2018

jjtapia commented Feb 8, 2018 • edited Loading

ines commented Feb 9, 2018

lock bot commented May 7, 2018

jjtapia commented Feb 7, 2018 •

edited

Loading

jjtapia commented Feb 8, 2018 •

edited

Loading