Skip to content

Lemma exception tables not reflected in lookup or rule-based lemmatizers #6980

@tarskiandhutch

Description

@tarskiandhutch

Issues related to the implementation of lemmatizer exceptions have appeared on the spaCy Github since at least 2016. From what I can tell, none of them address explicitly the fact that even exceptions from the default lookup tables (or their predecessor data structures) do not seem to be consistently applied by a given lemmatizer. If a similar question has been addressed before, then perhaps the new lemmatizer-as-pipe arrangement merits a renewed explanation.

Is the below expected behavior? If so, then how does one ensure that a given lemmatizer uses the exceptions specified in the lemma_exc lookup table?

How to reproduce the behaviour

After extracting the lemmatizer pipe, we can check in the associated lookup tables that wunderkinder (noun) is keyed to the lemma wunderkind and forbore (verb) to the lemma forbear. But even when the POS is tagged correctly, they are not lemmatized by the pipeline.

>>> import spacy
>>> nlp = spacy.load("en_core_web_lg")
>>> lemmatizer = nlp.get_pipe("lemmatizer")
>>> lemmatizer.lookups.get_table("lemma_exc")["noun"]["wunderkinder"]
['wunderkind']
>>> lemmatizer.lookups.get_table("lemma_exc")["verb"]["forbore"]
['forbear']
>>> doc_1 = nlp("I've known several wunderkinder in my life.")
>>> doc_2 = nlp("He never forbore a smile.")
>>> (doc_1[4].lemma_, doc_1[4].pos_)
('wunderkinder', 'NOUN')
>>> (doc_2[2].lemma_, doc_2[2].pos_)
('forbore', 'VERB')

Furthermore, neither the rule-based nor the lookup-based lemmatizer takes the the table entries for wunderkinder and forbore into account.

>>> lemmatizer.lookup_lemmatize(doc_1[4])
['wunderkinder']
>>> lemmatizer.lookup_lemmatize(doc_2[2])
['forbore']
>>> lemmatizer.rule_lemmatize(doc_1[4])
['wunderkinder']
>>> lemmatizer.rule_lemmatize(doc_2[2])
['forbore']

The above examples are just for illustration purposes. I am asking this question because, in order to rig up a custom lemmatization pipe for my applications, I would like to understand how and when lemma exceptions are applied.

Your Environment

  • spaCy version: 3.0.1
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.0
  • Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / lemmatizerFeature: Rule-based and lookup lemmatizationresolvedThe issue was addressed / answered

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions