Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Improve rule-based lemmatization and replace lookups #2668

Closed
ines opened this issue Aug 14, 2018 · 26 comments
Closed

💫 Improve rule-based lemmatization and replace lookups #2668

ines opened this issue Aug 14, 2018 · 26 comments
Labels
enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization

Comments

@ines
Copy link
Member

ines commented Aug 14, 2018

This is an enhancement issue that's been high on our list for a while, and something we'd love to tackle for as many languages as possible. At the moment, spaCy only implements rule-based lemmatization for very few languages. Many of the other languages to support lemmatization, but only use lookup tables, which are less reliable and don't always produce good results. They're also large and add significant bloat to the library.

Existing rule-based lemmatizers

Language Source
English lang/en/lemmatizer
Greek lang/el/lemmatizer
Norwegian lang/nb/lemmatizer

Most problematic lookup lemmatizers

Just my subjective selection, but these are the languages we want to prioritise. Of course, we'd also appreciate contributions to any of the other languages!

Language Related issues
German #2486, #2368, #2120
French #2659, #2251, #2079
Spanish #2710
@ines ines added enhancement Feature requests and improvements help wanted Contributions welcome! lang / de German language data and models lang / fr French language data and models lang / es Spanish language data and models feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Aug 14, 2018
@mauryaland
Copy link
Contributor

Hello,

I will do a PR soon to introduce a rule-based lemmatization for French language based on the following lemme dictionary which is under this license. I have created nouns, adjectives and adverbs files, largely inspired by the English lemmatizer structure, and would like to let the lookup table do the job for the verbs.

However, since I introduced those files, the lemmatizer does not look at the lookup table and therefore does not return the infinitive form of the verb when calling the lemma_ method on a verb token. I don't know where this behaviour comes from. I have checked several scripts from lemmatizer.py to language.py but I have no clue.

Could you please take a look? Here is the link of the concerned branch.

Thank you in advance for your help,

Amaury

@sj660
Copy link

sj660 commented Jan 1, 2019

In Spanish, the lemmatizer appears to treat verbs with clitic pronouns attached as separate words. I know this is tough to implement, but it's very important, especially for speech transcripts. For example, I get

Da -> Dar (Correct)
Dáme -> Dáme (incorrect should be "dar" and yo/me)
Dámelo -> Dámelo (incorrect should be "dar" and yo/me and lo/la[Pron])

@oroszgy
Copy link
Contributor

oroszgy commented Jan 5, 2019

@ines have you looked at lemmy? I beleive it would be a great addition for many languages: it employs a simple ML method similar to that described here. I've successfully built a lemmatizer for Hungarian that has a .96 ACC on the CONLL17 test set.

@ines ines changed the title 💫 Rule-based lemmatizers for other languages 💫 Improve rule-based lemmatization and replace lookups Jun 1, 2019
@ines ines removed lang / de German language data and models lang / es Spanish language data and models help wanted Contributions welcome! labels Jun 1, 2019
@SimonF89
Copy link

SimonF89 commented Jun 2, 2019

Does anyone knows a better lemmatizer for german? One of my students wants to make some experiments with german sentences and therefor needs a lemmatizer. Since spacy has some buggs, we would like to use an other lemmatizer till spacy is capable. Any recommendations?

@ines
Copy link
Member Author

ines commented Jun 2, 2019

@snape6666 Maybe check out this one: https://github.com/Liebeck/spacy-iwnlp It's a spaCy pipeline component, so you'll get the same API experience. (Also see iwnlp-py.)

@jsalbr
Copy link

jsalbr commented Oct 20, 2019

@ines: A big problem for me are lemmas of nouns that are verbs, like lemma of "Bäume" is "bäumen". I have created a quick fix for the lookup for German nouns based on Wiktionary. It replaces all entries in lang.de.lemmatizer.LOOKUP that

  • start with a capital letter but have a lower-case lemma
  • appear as a word form in the list of wiktionary nouns
    The new lemma is the wiktionary lemma.

I have prepared two files, one for just the changes and a complete fix for lang.de.lemmatizer.py: https://github.com/jsalbr/spacy-lemmatizer-de-fix/blob/master/lemmatizer_de_changes.py
https://github.com/jsalbr/spacy-lemmatizer-de-fix/blob/master/lemmatizer_de.py

I could create a pull request for that, if you are not almost done with the German rule-based lemmatizer that includes pos tags.

@ines
Copy link
Member Author

ines commented Oct 20, 2019

@jsalbr Thanks a lot! And yes, a PR would be great 👍 The only small difference now is that as of spaCy v2.2, the lemma rules and lookups data have moved into a separate package, spacy-lookups-data, and are now stored as JSON. Here's the German lookups file: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/de_lemma_lookup.json

(A nice advantage of the new setup is that the data is much smaller and can easily be modified at runtime. It also allows us to release new data updates without having to release new versions of spaCy.)

@jsalbr
Copy link

jsalbr commented Oct 22, 2019

@ines: Sorry, I tried now for quite a while to figure out how to work with the new lookups but my changes to de_lemma_lookup.json do not take any effect. Have both packages in my python path.

@ines
Copy link
Member Author

ines commented Oct 22, 2019

@jsalbr How are you creating the language class? (The models ship with their own serialized tables, so you'll need to test with a blank German language class). Also, the files are provided and read in via entry points, so the spacy-lookups-data package needs to be installed as a package (e.g. via python setup.py develop).

@jsalbr
Copy link

jsalbr commented Oct 22, 2019

@ines: Thank, I finally managed to create the blank language class and created the PR for the correction of German nous (explosion/spacy-lookups-data#2). But how can use the new lookup file together with a pretrained model model file, i.e. how can I re-instantiate / modified the serialized tables?

@ines
Copy link
Member Author

ines commented Oct 22, 2019

@jsalbr Thanks, I'll take a look! And you can always modify or replace the "lemma_lookup" table in nlp.vocab.lookups at runtime, see here for the API details: https://spacy.io/api/lookups When you save out the model, it'll then also serialize the modified rules (just like when you update the tokenization rules etc.)

@jsalbr
Copy link

jsalbr commented Oct 23, 2019

@ines: Thanks, I had some mistake in my code, now it works. What I still do not fully understand is how the use of spacy_lookups_data in spacy is intended. Of course I can read the lemma look into some dict and replace it in the vocabs.lookup. Is this the way it's supposed to be done?

Anyway, I paste here the code how to modify the lemma lookups for anybody else who wants to tamper with the default lemmas and for Ines to comment, if I missed something.

# variant 1: update lemma lookup table (good for small fixes)
nlp = spacy.load('de')
nlp.vocab.lookups.get_table('lemma_lookup').set("Haus", "Haus")
# get("Haus") now returns "Haus" instead of "hausen"

# variant 2: completely replace lemma lookup table
# e.g. with file content of spacy_lookups_data.de['lemma_lookup']
my_lemma_lookup = # some dict with { token: lemma }
nlp.vocab.lookups.remove_table('lemma_lookup')
nlp.vocab.lookups.add_table('lemma_lookup', my_lemma_lookup)

# test
text = "Dieser Gärtner wohnt im Haus."
doc = nlp(text)
print(" ".join([f"{t}/{t.lemma_}/{t.pos_}" for t in doc if not t.is_punct]))

@ines
Copy link
Member Author

ines commented Oct 23, 2019

@jsalbr I hope I understand your question correctly! The spacy-lookups-data package advertises entry points that spaCy reads when setting up the language subclass (see here). For example, when you create German (language code de), the entry point for the German tables is loaded. So if the lookups package is installed in the same environment, the data is populated automatically.

When you serialize a model by saving out the nlp object to disk, the tables are saved with the model package. This allows keeping the data in sync – you typically don't want a small update to the data package to have unintended effects on your model.

@sontek
Copy link

sontek commented Jan 12, 2020

Is there a place I can go to track the rule based Spanish work? I'm having major issues with Spanish and have been for months but my customers are starting to complain. I've had to switch to stanfordnlp to get accuracy but it took our calls from seconds to minutes which isn't ideal either. I'd love a place to track and help with any work going into spanish rule system.

#2710
#4341
#4255
#4253

Is there a branch or anything thats being worked on that I could help contribute to? Is there anyone doing the work that needs funding to finish it?

@lmorillas
Copy link

Is there a place I can go to track the rule based Spanish work? I'm having major issues with Spanish and have been for months but my customers are starting to complain. I've had to switch to stanfordnlp to get accuracy but it took our calls from seconds to minutes which isn't ideal either. I'd love a place to track and help with any work going into spanish rule system.

#2710
#4341
#4255
#4253

Is there a branch or anything thats being worked on that I could help contribute to? Is there anyone doing the work that needs funding to finish it?

Guadalupe Romero @_guadiromero said that she was working on it via twitter (https://twitter.com/_guadiromero/status/1213211033541758979) But I'have not more info.
I need to improve the spanish pos & lemmatization too

@benitocm
Copy link

Hi,

I am also interested in those issues:

#2710
#4341
#4255
#4253

Just to be in the loop

@pbouda
Copy link

pbouda commented Apr 7, 2020

Changing lemma_lookup does not seem to work for English, am I doing someting wrong?

import spacy
nlp = spacy.load('en_core_web_lg')
nlp.vocab.lookups.get_table('lemma_lookup').set("data", "data")

text = "This is new personal data to process."
doc = nlp(text)
print(" ".join([f"{t}/{t.lemma_}/{t.pos_}" for t in doc if not t.is_punct]))

Output:

This/this/DET is/be/AUX new/new/ADJ personal/personal/ADJ data/datum/NOUN to/to/PART process/process/VERB

@pbouda
Copy link

pbouda commented Apr 14, 2020

FYI, it works if I add it to the lemma_exc table like this:

nlp.vocab.lookups.get_table("lemma_exc")
table["noun"]["data"] = ["data"]

@adrianeboyd
Copy link
Contributor

@pbouda Yes, if you have a pipeline with a tagger like one of the provided English models it uses rules + exc, but if it doesn't have a tagger it uses lookup. (This is confusing!)

@pbouda
Copy link

pbouda commented Apr 14, 2020

@adrianeboyd Yes, I spend now a day or so to go through the code and find out what's happening during lemmatization in general, definitely not the simplest part of spacy...

@gtoffoli
Copy link
Contributor

gtoffoli commented Apr 19, 2021

Hi, I've just added the related post #7824 : Improved Italian lemmatizer: ongoing work or plans?
I'm looking for up-to-date information and I'm available to do some work, if needed.

@cayorodriguez
Copy link
Contributor

Any news on when can we expect a trainable lemmatizer, now that we have transformers available, that looks at contexts and distinguishes senses? It would be a huge addition to spacy, specially for applications where dimensionality reduction is important.

@polm
Copy link
Contributor

polm commented Jul 28, 2021

Howdy folks, since all the issues linked at the top of this have been resolved a while ago, and since we are gradually moving away from using issues for future enhancements, I am going to close this.

If you have questions about or are working on a specific lemmatizer it's better to open a thread in Discussions now.

About the trainable lemmatizer in particular, we don't have an implementation yet. You can read more about the current state of affairs in #6903 or #8686. While this is a feature we would love to have, and we would welcome PRs, we are not actively working on it at the moment.

@polm polm closed this as completed Jul 28, 2021
@giopina
Copy link

giopina commented Oct 12, 2021

(Not sure if this is the right thread for this)

I'm having issues with the lemmatizer for German (both in v3.0.6 and v2.3.2, using both de_core_news_lg and de_dep_news_trf).
Basically, the lemmatizer gets confused by the capitalization of the verb, and can't assign the right lemma to it (while the POS tagger is actually correct).

Code:

import spacy
nlp = spacy.load("de_core_news_lg")
doc = nlp('Meldet sich keiner')
for token in doc:
    print(token,token.lemma_,token.pos_,token.tag_)

Output:

Meldet Meldet VERB VVFIN
sich sich PRON PRF
keiner kein PRON PIS

(the lemma of meldet should be melden)

Is there some general solution/fix to this issue?

@polm
Copy link
Contributor

polm commented Oct 13, 2021

@giopina You should open another issue for this, probably in Discussions.

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization
Projects
None yet
Development

No branches or pull requests