Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenisation exceptions for English pronoun "She" #718

Closed
liane-brainnwave opened this issue Jan 3, 2017 · 9 comments · May be fixed by baby636/spaCy#55
Closed

Tokenisation exceptions for English pronoun "She" #718

liane-brainnwave opened this issue Jan 3, 2017 · 9 comments · May be fixed by baby636/spaCy#55

Comments

@liane-brainnwave
Copy link

liane-brainnwave commented Jan 3, 2017

tokenizer_exceptions.py for English contains several entries for the pronoun 'She' with ORTH value 'i'. for example:

"She's": [
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

Unless I am missing something, this should be:

"She's": [
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

@ines
Copy link
Member

ines commented Jan 3, 2017

Thanks – you're absolutely correct. Looks like this was a copy-paste mistake that wasn't spotted. Fixing this now!

Also currently working on making the English exceptions a little easier to read – for example, ordering them properly etc.

@ines ines closed this as completed in 1bd53bb Jan 3, 2017
@liane-brainnwave
Copy link
Author

Thanks ines. Re-ordering / grouping the exceptions would certainly help.

@ines
Copy link
Member

ines commented Jan 3, 2017

For the pronouns, I'm thinking about doing something like:

for pron in ["you", "You", "we", "We", "they", "They"]:
    EXC[pron + "'re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "'re", LEMMA: "be", NORM: "are"}
    ]

    EXC[pron + "re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "re", LEMMA: "be", NORM: "are"}
    ]

This obviously doesn't work for all exceptions and languages, and not everything follows such a nice pattern. But for the tokens that do, I think it could help with making sure everything's consistent. (Plus, it'd easily cut a couple of hundred lines.)

@oroszgy
Copy link
Contributor

oroszgy commented Jan 3, 2017

I know that @honnibal don't want to overuse regexps in spaCy, but I think that in these case that would not just be more efficient but is easier to maintain as well.

If we can utilize the return value of rule_match from #700 we can simplify the handling of such exceptions significantly. What do you think?

@ines
Copy link
Member

ines commented Jan 3, 2017

@oroszgy Hm, so I agree regexes would be more elegant, but even with the new rule_match, all we can do is match the exception – but there's no way to actually return a list of split tokens for each individual exception. Or am I misunderstanding your suggestion?

ines added a commit that referenced this issue Jan 3, 2017
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
ines added a commit that referenced this issue Jan 3, 2017
ines added a commit that referenced this issue Jan 3, 2017
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
@oroszgy
Copy link
Contributor

oroszgy commented Jan 3, 2017

@ines I meant, that we could simply change the semantics of the token_match thing so that it can return tokens if there is a match. Doing so the tokenizer exception handling would be a bit easier.

Instead of token_match(text: str) -> bool we can have a token_match(text: str) -> Union[None, List[TokenAnalysis]] where the TokenAnalysis would be something like this for "and/or":

{ORTH: "and/or", LEMMA: "and/or", TAG: "CC"}

@honnibal
Copy link
Member

honnibal commented Jan 3, 2017

I really think it's better to express the exceptions as literal strings. We're hashing these, so it's (super fast) O(1) operation per-chunk to check the exceptions. I do think we'd hit a performance problem from moving this to regular expressions.

The expansion logic also gets quite tricky if you're writing an expression. Let's say you write a pattern like this:

(I|he|she) ('d) (nt|n't)

You need to look up the regex capture groups to fill out the ORTH fields, and ultimately the lemmas. In this case there's a natural 1-to-1 mapping between the groups and the tokens, but this need not be so. If you have character classes for case sensitivity, you've got some problems to deal with.

In order to make the expansion logic sane, you'll be tempted to separate out the matches into different expressions. Once you do that, you're not really any further ahead on generality/compactness than you are with the explicit exceptions.

@oroszgy
Copy link
Contributor

oroszgy commented Jan 4, 2017

Honnibal, thanks for the explanation! For some reason I didn't notice that for loops are executed at loading time and not running time. Of course, in this case hashing is way more effective than regular expressions.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants