Issues related to the implementation of lemmatizer exceptions have appeared on the spaCy Github since at least 2016. From what I can tell, none of them address explicitly the fact that even exceptions from the default lookup tables (or their predecessor data structures) do not seem to be consistently applied by a given lemmatizer. If a similar question has been addressed before, then perhaps the new lemmatizer-as-pipe arrangement merits a renewed explanation.
Is the below expected behavior? If so, then how does one ensure that a given lemmatizer uses the exceptions specified in the lemma_exc lookup table?
How to reproduce the behaviour
After extracting the lemmatizer pipe, we can check in the associated lookup tables that wunderkinder (noun) is keyed to the lemma wunderkind and forbore (verb) to the lemma forbear. But even when the POS is tagged correctly, they are not lemmatized by the pipeline.
>>> import spacy
>>> nlp = spacy.load("en_core_web_lg")
>>> lemmatizer = nlp.get_pipe("lemmatizer")
>>> lemmatizer.lookups.get_table("lemma_exc")["noun"]["wunderkinder"]
['wunderkind']
>>> lemmatizer.lookups.get_table("lemma_exc")["verb"]["forbore"]
['forbear']
>>> doc_1 = nlp("I've known several wunderkinder in my life.")
>>> doc_2 = nlp("He never forbore a smile.")
>>> (doc_1[4].lemma_, doc_1[4].pos_)
('wunderkinder', 'NOUN')
>>> (doc_2[2].lemma_, doc_2[2].pos_)
('forbore', 'VERB')
Furthermore, neither the rule-based nor the lookup-based lemmatizer takes the the table entries for wunderkinder and forbore into account.
>>> lemmatizer.lookup_lemmatize(doc_1[4])
['wunderkinder']
>>> lemmatizer.lookup_lemmatize(doc_2[2])
['forbore']
>>> lemmatizer.rule_lemmatize(doc_1[4])
['wunderkinder']
>>> lemmatizer.rule_lemmatize(doc_2[2])
['forbore']
The above examples are just for illustration purposes. I am asking this question because, in order to rig up a custom lemmatization pipe for my applications, I would like to understand how and when lemma exceptions are applied.
Your Environment
- spaCy version: 3.0.1
- Platform: macOS-10.15.7-x86_64-i386-64bit
- Python version: 3.8.0
- Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0)
Issues related to the implementation of lemmatizer exceptions have appeared on the spaCy Github since at least 2016. From what I can tell, none of them address explicitly the fact that even exceptions from the default lookup tables (or their predecessor data structures) do not seem to be consistently applied by a given lemmatizer. If a similar question has been addressed before, then perhaps the new lemmatizer-as-pipe arrangement merits a renewed explanation.
Is the below expected behavior? If so, then how does one ensure that a given lemmatizer uses the exceptions specified in the
lemma_exclookup table?How to reproduce the behaviour
After extracting the lemmatizer pipe, we can check in the associated lookup tables that wunderkinder (noun) is keyed to the lemma wunderkind and forbore (verb) to the lemma forbear. But even when the POS is tagged correctly, they are not lemmatized by the pipeline.
Furthermore, neither the rule-based nor the lookup-based lemmatizer takes the the table entries for wunderkinder and forbore into account.
The above examples are just for illustration purposes. I am asking this question because, in order to rig up a custom lemmatization pipe for my applications, I would like to understand how and when lemma exceptions are applied.
Your Environment