Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

Open
hatzel opened this issue Dec 3, 2021 · 1 comment
Open

de_dep_news_trf uses 1990s Spelling Convention in Lemmatization #9799

hatzel opened this issue Dec 3, 2021 · 1 comment
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / de German language data and models

Comments

@hatzel
Copy link

hatzel commented Dec 3, 2021

How to reproduce the behaviour

import spacy

nlp = spacy.load("de_dep_news_trf")
assert nlp("Du ißt Äpfel")[1].lemma_ == 'essen'
print(nlp("Du isst Äpfel")[1].lemma_)

This prints isst where essen would be expected.

Looks like the model just uses a lookup table which doesn't contain the 1996 changes to German spelling conventions. Same effect is observable for frißt/frisst as well.

Your Environment

de-dep-news-trf @ https://github.com/explosion/spacy-models/releases/download/de_dep_news_trf-3.2.0/de_dep_news_trf-3.2.0-py3-none-any.whl
spacy==3.2.0
spacy-alignments==0.8.4
spacy-legacy==3.0.8
spacy-loggers==1.0.1
spacy-transformers==1.1.2

@polm polm added lang / de German language data and models feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Dec 4, 2021
@adrianeboyd
Copy link
Contributor

Yes, the lookup table has a lot of issues. Our main German training corpus (TIGER) is also old enough that it primarily contains old spellings, so I'm not sure we even have a good basis to train/evaluate the lemmatizer without some manual effort to update both.

If you'd like, you can try out the new experiment edit tree lemmatizer, which works a lot better than the lookup table: https://explosion.ai/blog/edit-tree-lemmatizer/. We hope to replace the lookup tables with this lemmatizer in many of the trained pipelines in the near future.

TIGER is available for research use (but will contain old spellings), or you can look at UD corpora or similar resources if you need training data (although I'm not sure what kinds of sources are included or how high the quality of the lemma annotation is in each corpus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / de German language data and models
Projects
None yet
Development

No branches or pull requests

3 participants