Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Times such as "7pm" tokenized wrong #736

Closed
matthayes opened this issue Jan 12, 2017 · 5 comments
Closed

Times such as "7pm" tokenized wrong #736

matthayes opened this issue Jan 12, 2017 · 5 comments
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models

Comments

@matthayes
Copy link

There appears to be a bug in how times are tokenized for English.

nlp = spacy.load("en")
doc = nlp("We're meeting at 7pm.")

for token in doc:
    print(token, token.pos_, token.lemma_)

This produces:

We PRON -PRON-
're VERB 're
meeting VERB meet
at ADP at
IS_TITLE PROPN is_title
pm NOUN pm
. PUNCT .

Instead of IS_TITLE PROPN is_title I was expecting 7 NUM 7, which is what you get if you used 7 pm instead (with a space in between). I see that TOKENIZER_EXCEPTIONS includes a number of exceptions to handle this type of case so I'm confused why it doesn't work. Also it seems that the "7" should be preserved instead of being replaced with IS_TITLE.

Your Environment

  • Operating System: Mac OSX 10.11.6
  • Python Version Used: 3.5.2
  • spaCy Version Used: 1.5.0
  • Environment Information: English data version appears to be 1.1.0 given that I see the path spacy/data/en-1.1.0 under site-packages.
@matthayes
Copy link
Author

It appears that the number in the time is somehow being mapped to the ith element from IDS in attrs.pyx:

IDS = {
    "": NULL_ATTR,
    "IS_ALPHA": IS_ALPHA,
    "IS_ASCII": IS_ASCII,
    "IS_DIGIT": IS_DIGIT,
    "IS_LOWER": IS_LOWER,
    "IS_PUNCT": IS_PUNCT,
    "IS_SPACE": IS_SPACE,
    "IS_TITLE": IS_TITLE,
    "IS_UPPER": IS_UPPER,

For example, "8am" becomes IS_UPPER.

@matthayes matthayes changed the title Times such as "7am" tokenized wrong Times such as "7pm" tokenized wrong Jan 12, 2017
@matthayes
Copy link
Author

I think the issue is in language_data.py. The hour here should be converted to a string. I'm assuming when it is a number it becomes a lookup into IDS.

        exc["%dam" % hour] = [
            {ORTH: hour},
            {ORTH: "am", LEMMA: "a.m."}
        ]

When I add this special case to override the existing rule it works:

nlp.tokenizer.add_special_case(
    '7pm',
    [
        {
            ORTH: '7',
            LEMMA: '7',
            POS: 'NUM'
        },
        {
            ORTH: 'pm',
            LEMMA: 'p.m.',
            POS: 'NOUN'
        }
    ])

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2017
@honnibal
Copy link
Member

Thanks, your analysis is definitely correct. Fixing.

ines added a commit that referenced this issue Jan 12, 2017
@ines ines added the lang / en English language data and models label Jan 12, 2017
ines added a commit that referenced this issue Jan 12, 2017
@ines
Copy link
Member

ines commented Jan 12, 2017

Issue fixed and regression test passes! The fix should be included in the next release (coming later today).

@ines ines closed this as completed Jan 12, 2017
soldni added a commit to Georgetown-IR-Lab/QuickUMLS that referenced this issue Jan 23, 2017
Previous versions of spacy (< 1.6.0) have a bug that can cause
issues in parsing numbers (see explosion/spaCy#736).
Please update spacy to latest version.
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

3 participants