Tokenisation exceptions for English pronoun "She" #718

liane-brainnwave · 2017-01-03T10:14:57Z

tokenizer_exceptions.py for English contains several entries for the pronoun 'She' with ORTH value 'i'. for example:

"She's": [
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

Unless I am missing something, this should be:

"She's": [
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]

ines · 2017-01-03T10:26:15Z

Thanks – you're absolutely correct. Looks like this was a copy-paste mistake that wasn't spotted. Fixing this now!

Also currently working on making the English exceptions a little easier to read – for example, ordering them properly etc.

liane-brainnwave · 2017-01-03T10:51:27Z

Thanks ines. Re-ordering / grouping the exceptions would certainly help.

ines · 2017-01-03T12:03:26Z

For the pronouns, I'm thinking about doing something like:

for pron in ["you", "You", "we", "We", "they", "They"]:
    EXC[pron + "'re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "'re", LEMMA: "be", NORM: "are"}
    ]

    EXC[pron + "re"] = [
        {ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
        {ORTH: "re", LEMMA: "be", NORM: "are"}
    ]

This obviously doesn't work for all exceptions and languages, and not everything follows such a nice pattern. But for the tokens that do, I think it could help with making sure everything's consistent. (Plus, it'd easily cut a couple of hundred lines.)

oroszgy · 2017-01-03T14:07:32Z

I know that @honnibal don't want to overuse regexps in spaCy, but I think that in these case that would not just be more efficient but is easier to maintain as well.

If we can utilize the return value of rule_match from #700 we can simplify the handling of such exceptions significantly. What do you think?

ines · 2017-01-03T16:09:03Z

@oroszgy Hm, so I agree regexes would be more elegant, but even with the new rule_match, all we can do is match the exception – but there's no way to actually return a list of split tokens for each individual exception. Or am I misunderstanding your suggestion?

Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.

This reverts commit b19cfcc.

Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.

oroszgy · 2017-01-03T19:22:18Z

@ines I meant, that we could simply change the semantics of the token_match thing so that it can return tokens if there is a match. Doing so the tokenizer exception handling would be a bit easier.

Instead of token_match(text: str) -> bool we can have a token_match(text: str) -> Union[None, List[TokenAnalysis]] where the TokenAnalysis would be something like this for "and/or":

{ORTH: "and/or", LEMMA: "and/or", TAG: "CC"}

honnibal · 2017-01-03T19:40:37Z

I really think it's better to express the exceptions as literal strings. We're hashing these, so it's (super fast) O(1) operation per-chunk to check the exceptions. I do think we'd hit a performance problem from moving this to regular expressions.

The expansion logic also gets quite tricky if you're writing an expression. Let's say you write a pattern like this:

(I|he|she) ('d) (nt|n't)

You need to look up the regex capture groups to fill out the ORTH fields, and ultimately the lemmas. In this case there's a natural 1-to-1 mapping between the groups and the tokens, but this need not be so. If you have character classes for case sensitivity, you've got some problems to deal with.

In order to make the expansion logic sane, you'll be tempted to separate out the matches into different expressions. Once you do that, you're not really any further ahead on generality/compactness than you are with the explicit exceptions.

oroszgy · 2017-01-04T07:30:51Z

Honnibal, thanks for the explanation! For some reason I didn't notice that for loops are executed at loading time and not running time. Of course, in this case hashing is way more effective than regular expressions.

lock · 2018-05-09T05:38:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines closed this as completed in 1bd53bb Jan 3, 2017

ines added a commit that referenced this issue Jan 3, 2017

Reorganise English tokenizer exceptions (as discussed in #718)

b19cfcc

Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.

ines added a commit that referenced this issue Jan 3, 2017

Revert "Reorganise English tokenizer exceptions (as discussed in #718)"

461cbb9

This reverts commit b19cfcc.

ines added a commit that referenced this issue Jan 3, 2017

Reorganise English tokenizer exceptions (as discussed in #718)

35b39f5

Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenisation exceptions for English pronoun "She" #718

Tokenisation exceptions for English pronoun "She" #718

liane-brainnwave commented Jan 3, 2017 •

edited

Loading

ines commented Jan 3, 2017

liane-brainnwave commented Jan 3, 2017

ines commented Jan 3, 2017

oroszgy commented Jan 3, 2017

ines commented Jan 3, 2017

oroszgy commented Jan 3, 2017

honnibal commented Jan 3, 2017

oroszgy commented Jan 4, 2017

lock bot commented May 9, 2018

Tokenisation exceptions for English pronoun "She" #718

Tokenisation exceptions for English pronoun "She" #718

Comments

liane-brainnwave commented Jan 3, 2017 • edited Loading

ines commented Jan 3, 2017

liane-brainnwave commented Jan 3, 2017

ines commented Jan 3, 2017

oroszgy commented Jan 3, 2017

ines commented Jan 3, 2017

oroszgy commented Jan 3, 2017

honnibal commented Jan 3, 2017

oroszgy commented Jan 4, 2017

lock bot commented May 9, 2018

liane-brainnwave commented Jan 3, 2017 •

edited

Loading