Times such as "7pm" tokenized wrong #736

matthayes · 2017-01-12T04:45:55Z

There appears to be a bug in how times are tokenized for English.

nlp = spacy.load("en")
doc = nlp("We're meeting at 7pm.")

for token in doc:
    print(token, token.pos_, token.lemma_)

This produces:

We PRON -PRON-
're VERB 're
meeting VERB meet
at ADP at
IS_TITLE PROPN is_title
pm NOUN pm
. PUNCT .

Instead of IS_TITLE PROPN is_title I was expecting 7 NUM 7, which is what you get if you used 7 pm instead (with a space in between). I see that TOKENIZER_EXCEPTIONS includes a number of exceptions to handle this type of case so I'm confused why it doesn't work. Also it seems that the "7" should be preserved instead of being replaced with IS_TITLE.

Your Environment

Operating System: Mac OSX 10.11.6
Python Version Used: 3.5.2
spaCy Version Used: 1.5.0
Environment Information: English data version appears to be 1.1.0 given that I see the path spacy/data/en-1.1.0 under site-packages.

The text was updated successfully, but these errors were encountered:

matthayes · 2017-01-12T04:53:23Z

It appears that the number in the time is somehow being mapped to the ith element from IDS in attrs.pyx:

IDS = {
    "": NULL_ATTR,
    "IS_ALPHA": IS_ALPHA,
    "IS_ASCII": IS_ASCII,
    "IS_DIGIT": IS_DIGIT,
    "IS_LOWER": IS_LOWER,
    "IS_PUNCT": IS_PUNCT,
    "IS_SPACE": IS_SPACE,
    "IS_TITLE": IS_TITLE,
    "IS_UPPER": IS_UPPER,

For example, "8am" becomes IS_UPPER.

matthayes · 2017-01-12T05:13:59Z

I think the issue is in language_data.py. The hour here should be converted to a string. I'm assuming when it is a number it becomes a lookup into IDS.

        exc["%dam" % hour] = [
            {ORTH: hour},
            {ORTH: "am", LEMMA: "a.m."}
        ]

When I add this special case to override the existing rule it works:

nlp.tokenizer.add_special_case(
    '7pm',
    [
        {
            ORTH: '7',
            LEMMA: '7',
            POS: 'NUM'
        },
        {
            ORTH: 'pm',
            LEMMA: 'p.m.',
            POS: 'NOUN'
        }
    ])

honnibal · 2017-01-12T09:50:59Z

Thanks, your analysis is definitely correct. Fixing.

ines · 2017-01-12T10:50:11Z

Issue fixed and regression test passes! The fix should be included in the next release (coming later today).

Previous versions of spacy (< 1.6.0) have a bug that can cause issues in parsing numbers (see explosion/spaCy#736). Please update spacy to latest version.

lock · 2018-05-09T04:38:57Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

matthayes changed the title ~~Times such as "7am" tokenized wrong~~ Times such as "7pm" tokenized wrong Jan 12, 2017

honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2017

ines added a commit that referenced this issue Jan 12, 2017

Add regression test for #736

ec7739b

ines added the lang / en English language data and models label Jan 12, 2017

honnibal added a commit that referenced this issue Jan 12, 2017

Fix Issue #736: Times were being tokenized with incorrect string values.

fba67fa

ines added a commit that referenced this issue Jan 12, 2017

Fix and pass regression test for #736

c5914c6

ines closed this as completed Jan 12, 2017

soldni added a commit to Georgetown-IR-Lab/QuickUMLS that referenced this issue Jan 23, 2017

Upped requirements for spacy

64ba8e3

Previous versions of spacy (< 1.6.0) have a bug that can cause issues in parsing numbers (see explosion/spaCy#736). Please update spacy to latest version.

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Times such as "7pm" tokenized wrong #736

Times such as "7pm" tokenized wrong #736

matthayes commented Jan 12, 2017

matthayes commented Jan 12, 2017

matthayes commented Jan 12, 2017

honnibal commented Jan 12, 2017

ines commented Jan 12, 2017

lock bot commented May 9, 2018

Times such as "7pm" tokenized wrong #736

Times such as "7pm" tokenized wrong #736

Comments

matthayes commented Jan 12, 2017

Your Environment

matthayes commented Jan 12, 2017

matthayes commented Jan 12, 2017

honnibal commented Jan 12, 2017

ines commented Jan 12, 2017

lock bot commented May 9, 2018