de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

hatzel · 2021-12-03T14:44:42Z

How to reproduce the behaviour

import spacy

nlp = spacy.load("de_dep_news_trf")
assert nlp("Du ißt Äpfel")[1].lemma_ == 'essen'
print(nlp("Du isst Äpfel")[1].lemma_)

This prints isst where essen would be expected.

Looks like the model just uses a lookup table which doesn't contain the 1996 changes to German spelling conventions. Same effect is observable for frißt/frisst as well.

Your Environment

de-dep-news-trf @ https://github.com/explosion/spacy-models/releases/download/de_dep_news_trf-3.2.0/de_dep_news_trf-3.2.0-py3-none-any.whl
spacy==3.2.0
spacy-alignments==0.8.4
spacy-legacy==3.0.8
spacy-loggers==1.0.1
spacy-transformers==1.1.2

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2021-12-06T13:57:51Z

Yes, the lookup table has a lot of issues. Our main German training corpus (TIGER) is also old enough that it primarily contains old spellings, so I'm not sure we even have a good basis to train/evaluate the lemmatizer without some manual effort to update both.

If you'd like, you can try out the new experiment edit tree lemmatizer, which works a lot better than the lookup table: https://explosion.ai/blog/edit-tree-lemmatizer/. We hope to replace the lookup tables with this lemmatizer in many of the trained pipelines in the near future.

TIGER is available for research use (but will contain old spellings), or you can look at UD corpora or similar resources if you need training data (although I'm not sure what kinds of sources are included or how high the quality of the lemma annotation is in each corpus).

polm added lang / de German language data and models feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Dec 4, 2021

polm mentioned this issue Jan 4, 2022

German adjectives ending on -e are not lemmatized using the lookup lemmatizer #4622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

hatzel commented Dec 3, 2021

adrianeboyd commented Dec 6, 2021

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

Comments

hatzel commented Dec 3, 2021

How to reproduce the behaviour

Your Environment

adrianeboyd commented Dec 6, 2021