💫 Improve rule-based lemmatization and replace lookups #2668

ines · 2018-08-14T15:59:45Z

This is an enhancement issue that's been high on our list for a while, and something we'd love to tackle for as many languages as possible. At the moment, spaCy only implements rule-based lemmatization for very few languages. Many of the other languages to support lemmatization, but only use lookup tables, which are less reliable and don't always produce good results. They're also large and add significant bloat to the library.

Existing rule-based lemmatizers

Language	Source
English	`lang/en/lemmatizer`
Greek	`lang/el/lemmatizer`
Norwegian	`lang/nb/lemmatizer`

Most problematic lookup lemmatizers

Just my subjective selection, but these are the languages we want to prioritise. Of course, we'd also appreciate contributions to any of the other languages!

Language	Related issues
German	#2486, #2368, #2120
~~French~~	~~#2659, #2251, #2079~~
Spanish	#2710

The text was updated successfully, but these errors were encountered:

mauryaland · 2018-10-01T11:49:46Z

Hello,

I will do a PR soon to introduce a rule-based lemmatization for French language based on the following lemme dictionary which is under this license. I have created nouns, adjectives and adverbs files, largely inspired by the English lemmatizer structure, and would like to let the lookup table do the job for the verbs.

However, since I introduced those files, the lemmatizer does not look at the lookup table and therefore does not return the infinitive form of the verb when calling the lemma_ method on a verb token. I don't know where this behaviour comes from. I have checked several scripts from lemmatizer.py to language.py but I have no clue.

Could you please take a look? Here is the link of the concerned branch.

Thank you in advance for your help,

Amaury

sj660 · 2019-01-01T01:11:37Z

In Spanish, the lemmatizer appears to treat verbs with clitic pronouns attached as separate words. I know this is tough to implement, but it's very important, especially for speech transcripts. For example, I get

Da -> Dar (Correct)
Dáme -> Dáme (incorrect should be "dar" and yo/me)
Dámelo -> Dámelo (incorrect should be "dar" and yo/me and lo/la[Pron])

oroszgy · 2019-01-05T20:18:07Z

@ines have you looked at lemmy? I beleive it would be a great addition for many languages: it employs a simple ML method similar to that described here. I've successfully built a lemmatizer for Hungarian that has a .96 ACC on the CONLL17 test set.

SimonF89 · 2019-06-02T07:18:36Z

Does anyone knows a better lemmatizer for german? One of my students wants to make some experiments with german sentences and therefor needs a lemmatizer. Since spacy has some buggs, we would like to use an other lemmatizer till spacy is capable. Any recommendations?

ines · 2019-06-02T11:09:03Z

@snape6666 Maybe check out this one: https://github.com/Liebeck/spacy-iwnlp It's a spaCy pipeline component, so you'll get the same API experience. (Also see iwnlp-py.)

jsalbr · 2019-10-20T18:49:40Z

@ines: A big problem for me are lemmas of nouns that are verbs, like lemma of "Bäume" is "bäumen". I have created a quick fix for the lookup for German nouns based on Wiktionary. It replaces all entries in lang.de.lemmatizer.LOOKUP that

start with a capital letter but have a lower-case lemma
appear as a word form in the list of wiktionary nouns
The new lemma is the wiktionary lemma.

I have prepared two files, one for just the changes and a complete fix for lang.de.lemmatizer.py: https://github.com/jsalbr/spacy-lemmatizer-de-fix/blob/master/lemmatizer_de_changes.py
https://github.com/jsalbr/spacy-lemmatizer-de-fix/blob/master/lemmatizer_de.py

I could create a pull request for that, if you are not almost done with the German rule-based lemmatizer that includes pos tags.

ines · 2019-10-20T21:17:15Z

@jsalbr Thanks a lot! And yes, a PR would be great 👍 The only small difference now is that as of spaCy v2.2, the lemma rules and lookups data have moved into a separate package, spacy-lookups-data, and are now stored as JSON. Here's the German lookups file: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_lemma_lookup.json

(A nice advantage of the new setup is that the data is much smaller and can easily be modified at runtime. It also allows us to release new data updates without having to release new versions of spaCy.)

jsalbr · 2019-10-22T13:07:32Z

@ines: Sorry, I tried now for quite a while to figure out how to work with the new lookups but my changes to de_lemma_lookup.json do not take any effect. Have both packages in my python path.

ines · 2019-10-22T13:21:16Z

@jsalbr How are you creating the language class? (The models ship with their own serialized tables, so you'll need to test with a blank German language class). Also, the files are provided and read in via entry points, so the spacy-lookups-data package needs to be installed as a package (e.g. via python setup.py develop).

jsalbr · 2019-10-22T15:30:35Z

@ines: Thank, I finally managed to create the blank language class and created the PR for the correction of German nous (explosion/spacy-lookups-data#2). But how can use the new lookup file together with a pretrained model model file, i.e. how can I re-instantiate / modified the serialized tables?

ines · 2019-10-22T16:22:10Z

@jsalbr Thanks, I'll take a look! And you can always modify or replace the "lemma_lookup" table in nlp.vocab.lookups at runtime, see here for the API details: https://spacy.io/api/lookups When you save out the model, it'll then also serialize the modified rules (just like when you update the tokenization rules etc.)

jsalbr · 2019-10-23T09:56:39Z

@ines: Thanks, I had some mistake in my code, now it works. What I still do not fully understand is how the use of spacy_lookups_data in spacy is intended. Of course I can read the lemma look into some dict and replace it in the vocabs.lookup. Is this the way it's supposed to be done?

Anyway, I paste here the code how to modify the lemma lookups for anybody else who wants to tamper with the default lemmas and for Ines to comment, if I missed something.

# variant 1: update lemma lookup table (good for small fixes)
nlp = spacy.load('de')
nlp.vocab.lookups.get_table('lemma_lookup').set("Haus", "Haus")
# get("Haus") now returns "Haus" instead of "hausen"

# variant 2: completely replace lemma lookup table
# e.g. with file content of spacy_lookups_data.de['lemma_lookup']
my_lemma_lookup = # some dict with { token: lemma }
nlp.vocab.lookups.remove_table('lemma_lookup')
nlp.vocab.lookups.add_table('lemma_lookup', my_lemma_lookup)

# test
text = "Dieser Gärtner wohnt im Haus."
doc = nlp(text)
print(" ".join([f"{t}/{t.lemma_}/{t.pos_}" for t in doc if not t.is_punct]))

ines · 2019-10-23T12:39:53Z

@jsalbr I hope I understand your question correctly! The spacy-lookups-data package advertises entry points that spaCy reads when setting up the language subclass (see here). For example, when you create German (language code de), the entry point for the German tables is loaded. So if the lookups package is installed in the same environment, the data is populated automatically.

When you serialize a model by saving out the nlp object to disk, the tables are saved with the model package. This allows keeping the data in sync – you typically don't want a small update to the data package to have unintended effects on your model.

sontek · 2020-01-12T15:42:11Z

Is there a place I can go to track the rule based Spanish work? I'm having major issues with Spanish and have been for months but my customers are starting to complain. I've had to switch to stanfordnlp to get accuracy but it took our calls from seconds to minutes which isn't ideal either. I'd love a place to track and help with any work going into spanish rule system.

#2710
#4341
#4255
#4253

Is there a branch or anything thats being worked on that I could help contribute to? Is there anyone doing the work that needs funding to finish it?

lmorillas · 2020-01-17T12:32:26Z

Is there a place I can go to track the rule based Spanish work? I'm having major issues with Spanish and have been for months but my customers are starting to complain. I've had to switch to stanfordnlp to get accuracy but it took our calls from seconds to minutes which isn't ideal either. I'd love a place to track and help with any work going into spanish rule system.

#2710
#4341
#4255
#4253

Is there a branch or anything thats being worked on that I could help contribute to? Is there anyone doing the work that needs funding to finish it?

Guadalupe Romero @_guadiromero said that she was working on it via twitter (https://twitter.com/_guadiromero/status/1213211033541758979) But I'have not more info.
I need to improve the spanish pos & lemmatization too

benitocm · 2020-02-17T18:20:11Z

Hi,

I am also interested in those issues:

#2710
#4341
#4255
#4253

Just to be in the loop

pbouda · 2020-04-07T13:06:34Z

Changing lemma_lookup does not seem to work for English, am I doing someting wrong?

import spacy
nlp = spacy.load('en_core_web_lg')
nlp.vocab.lookups.get_table('lemma_lookup').set("data", "data")

text = "This is new personal data to process."
doc = nlp(text)
print(" ".join([f"{t}/{t.lemma_}/{t.pos_}" for t in doc if not t.is_punct]))

Output:

This/this/DET is/be/AUX new/new/ADJ personal/personal/ADJ data/datum/NOUN to/to/PART process/process/VERB

pbouda · 2020-04-14T12:43:38Z

FYI, it works if I add it to the lemma_exc table like this:

nlp.vocab.lookups.get_table("lemma_exc")
table["noun"]["data"] = ["data"]

adrianeboyd · 2020-04-14T12:50:50Z

@pbouda Yes, if you have a pipeline with a tagger like one of the provided English models it uses rules + exc, but if it doesn't have a tagger it uses lookup. (This is confusing!)

pbouda · 2020-04-14T13:19:05Z

@adrianeboyd Yes, I spend now a day or so to go through the code and find out what's happening during lemmatization in general, definitely not the simplest part of spacy...

gtoffoli · 2021-04-19T12:24:42Z

Hi, I've just added the related post #7824 : Improved Italian lemmatizer: ongoing work or plans?
I'm looking for up-to-date information and I'm available to do some work, if needed.

cayorodriguez · 2021-04-30T09:23:59Z

Any news on when can we expect a trainable lemmatizer, now that we have transformers available, that looks at contexts and distinguishes senses? It would be a huge addition to spacy, specially for applications where dimensionality reduction is important.

polm · 2021-07-28T04:07:32Z

Howdy folks, since all the issues linked at the top of this have been resolved a while ago, and since we are gradually moving away from using issues for future enhancements, I am going to close this.

If you have questions about or are working on a specific lemmatizer it's better to open a thread in Discussions now.

About the trainable lemmatizer in particular, we don't have an implementation yet. You can read more about the current state of affairs in #6903 or #8686. While this is a feature we would love to have, and we would welcome PRs, we are not actively working on it at the moment.

giopina · 2021-10-12T10:37:45Z

(Not sure if this is the right thread for this)

I'm having issues with the lemmatizer for German (both in v3.0.6 and v2.3.2, using both de_core_news_lg and de_dep_news_trf).
Basically, the lemmatizer gets confused by the capitalization of the verb, and can't assign the right lemma to it (while the POS tagger is actually correct).

Code:

import spacy
nlp = spacy.load("de_core_news_lg")
doc = nlp('Meldet sich keiner')
for token in doc:
    print(token,token.lemma_,token.pos_,token.tag_)

Output:

Meldet Meldet VERB VVFIN
sich sich PRON PRF
keiner kein PRON PIS

(the lemma of meldet should be melden)

Is there some general solution/fix to this issue?

polm · 2021-10-13T03:34:53Z

@giopina You should open another issue for this, probably in Discussions.

github-actions · 2021-11-13T00:01:36Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

This was referenced Aug 14, 2018

Problems and errors in German lemmatizer #2486

Closed

[Lemmatisation] : french - some strange error #2659

Closed

Non-english (french) models a bit long to load #2679

Closed

ines mentioned this issue Aug 29, 2018

Spanish lemmatization #2710

Closed

ines mentioned this issue Oct 15, 2018

Question/Feature Request: Reducing spaCy package size #2851

Closed

ines removed the lang / fr French language data and models label Nov 6, 2018

giannisdaras mentioned this issue Nov 17, 2018

nlp(u'BUSINESS')[0].lemma_ == 'busines' #2900

Closed

ines mentioned this issue Feb 21, 2019

German pronouns are lemmatized wrong #3296

Closed

ines changed the title ~~💫 Rule-based lemmatizers for other languages~~ 💫 Improve rule-based lemmatization and replace lookups Jun 1, 2019

This was referenced Jun 1, 2019

Lemmatization of the adjective "more" at start of sentence in SpaCy 2.1 #3411

Closed

Incorrect lemmatization of some words #3665

Closed

ines removed lang / de German language data and models lang / es Spanish language data and models help wanted Contributions welcome! labels Jun 1, 2019

This was referenced Jun 1, 2019

lemma of who's is always who' #3732

Closed

German-Model: "sein" --> token.lemma == "mein" ??? #3793

Closed

This was referenced Jul 17, 2019

Installing spacy with fewer languages to save disk space #3983

Closed

Package size issue: consider a different format for lemmatizer dictionaries #3258

Closed

ines mentioned this issue Sep 8, 2019

Spanish is giving the infinitive verb as the lemma for a SCONJ #4255

Closed

pablodms mentioned this issue Sep 9, 2019

Rule-based lemmatization in Spanish #4265

Closed

3 tasks

jsalbr mentioned this issue Oct 22, 2019

Corrections to German nouns based on Wiktionary explosion/spacy-lookups-data#2

Merged

adrianeboyd mentioned this issue Nov 11, 2019

German adjectives ending on -e are not lemmatized using the lookup lemmatizer #4622

Open

meghasfdc mentioned this issue Feb 18, 2021

[Snyk] Security upgrade prismjs from 1.15.0 to 1.23.0 meghasfdc/spaCy#20

Open

polm closed this as completed Jul 28, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Improve rule-based lemmatization and replace lookups #2668

💫 Improve rule-based lemmatization and replace lookups #2668

ines commented Aug 14, 2018 •

edited

Loading

mauryaland commented Oct 1, 2018

sj660 commented Jan 1, 2019

oroszgy commented Jan 5, 2019

SimonF89 commented Jun 2, 2019

ines commented Jun 2, 2019

jsalbr commented Oct 20, 2019

ines commented Oct 20, 2019

jsalbr commented Oct 22, 2019

ines commented Oct 22, 2019

jsalbr commented Oct 22, 2019

ines commented Oct 22, 2019

jsalbr commented Oct 23, 2019

ines commented Oct 23, 2019

sontek commented Jan 12, 2020

lmorillas commented Jan 17, 2020

benitocm commented Feb 17, 2020

pbouda commented Apr 7, 2020

pbouda commented Apr 14, 2020

adrianeboyd commented Apr 14, 2020

pbouda commented Apr 14, 2020

gtoffoli commented Apr 19, 2021 •

edited

Loading

cayorodriguez commented Apr 30, 2021

polm commented Jul 28, 2021

giopina commented Oct 12, 2021

polm commented Oct 13, 2021

github-actions bot commented Nov 13, 2021

💫 Improve rule-based lemmatization and replace lookups #2668

💫 Improve rule-based lemmatization and replace lookups #2668

Comments

ines commented Aug 14, 2018 • edited Loading

Existing rule-based lemmatizers

Most problematic lookup lemmatizers

mauryaland commented Oct 1, 2018

sj660 commented Jan 1, 2019

oroszgy commented Jan 5, 2019

SimonF89 commented Jun 2, 2019

ines commented Jun 2, 2019

jsalbr commented Oct 20, 2019

ines commented Oct 20, 2019

jsalbr commented Oct 22, 2019

ines commented Oct 22, 2019

jsalbr commented Oct 22, 2019

ines commented Oct 22, 2019

jsalbr commented Oct 23, 2019

ines commented Oct 23, 2019

sontek commented Jan 12, 2020

lmorillas commented Jan 17, 2020

benitocm commented Feb 17, 2020

pbouda commented Apr 7, 2020

pbouda commented Apr 14, 2020

adrianeboyd commented Apr 14, 2020

pbouda commented Apr 14, 2020

gtoffoli commented Apr 19, 2021 • edited Loading

cayorodriguez commented Apr 30, 2021

polm commented Jul 28, 2021

giopina commented Oct 12, 2021

polm commented Oct 13, 2021

github-actions bot commented Nov 13, 2021

ines commented Aug 14, 2018 •

edited

Loading

gtoffoli commented Apr 19, 2021 •

edited

Loading