-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Improve rule-based lemmatization and replace lookups #2668
Comments
Hello, I will do a PR soon to introduce a rule-based lemmatization for French language based on the following lemme dictionary which is under this license. I have created nouns, adjectives and adverbs files, largely inspired by the English lemmatizer structure, and would like to let the lookup table do the job for the verbs. However, since I introduced those files, the lemmatizer does not look at the lookup table and therefore does not return the infinitive form of the verb when calling the lemma_ method on a verb token. I don't know where this behaviour comes from. I have checked several scripts from lemmatizer.py to language.py but I have no clue. Could you please take a look? Here is the link of the concerned branch. Thank you in advance for your help, Amaury |
In Spanish, the lemmatizer appears to treat verbs with clitic pronouns attached as separate words. I know this is tough to implement, but it's very important, especially for speech transcripts. For example, I get Da -> Dar (Correct) |
@ines have you looked at lemmy? I beleive it would be a great addition for many languages: it employs a simple ML method similar to that described here. I've successfully built a lemmatizer for Hungarian that has a .96 ACC on the CONLL17 test set. |
Does anyone knows a better lemmatizer for german? One of my students wants to make some experiments with german sentences and therefor needs a lemmatizer. Since spacy has some buggs, we would like to use an other lemmatizer till spacy is capable. Any recommendations? |
@snape6666 Maybe check out this one: https://github.com/Liebeck/spacy-iwnlp It's a spaCy pipeline component, so you'll get the same API experience. (Also see |
@ines: A big problem for me are lemmas of nouns that are verbs, like lemma of "Bäume" is "bäumen". I have created a quick fix for the lookup for German nouns based on Wiktionary. It replaces all entries in lang.de.lemmatizer.LOOKUP that
I have prepared two files, one for just the changes and a complete fix for lang.de.lemmatizer.py: https://github.com/jsalbr/spacy-lemmatizer-de-fix/blob/master/lemmatizer_de_changes.py I could create a pull request for that, if you are not almost done with the German rule-based lemmatizer that includes pos tags. |
@jsalbr Thanks a lot! And yes, a PR would be great 👍 The only small difference now is that as of spaCy v2.2, the lemma rules and lookups data have moved into a separate package, (A nice advantage of the new setup is that the data is much smaller and can easily be modified at runtime. It also allows us to release new data updates without having to release new versions of spaCy.) |
@ines: Sorry, I tried now for quite a while to figure out how to work with the new lookups but my changes to de_lemma_lookup.json do not take any effect. Have both packages in my python path. |
@jsalbr How are you creating the language class? (The models ship with their own serialized tables, so you'll need to test with a blank |
@ines: Thank, I finally managed to create the blank language class and created the PR for the correction of German nous (explosion/spacy-lookups-data#2). But how can use the new lookup file together with a pretrained model model file, i.e. how can I re-instantiate / modified the serialized tables? |
@jsalbr Thanks, I'll take a look! And you can always modify or replace the |
@ines: Thanks, I had some mistake in my code, now it works. What I still do not fully understand is how the use of spacy_lookups_data in spacy is intended. Of course I can read the lemma look into some dict and replace it in the vocabs.lookup. Is this the way it's supposed to be done? Anyway, I paste here the code how to modify the lemma lookups for anybody else who wants to tamper with the default lemmas and for Ines to comment, if I missed something. # variant 1: update lemma lookup table (good for small fixes)
nlp = spacy.load('de')
nlp.vocab.lookups.get_table('lemma_lookup').set("Haus", "Haus")
# get("Haus") now returns "Haus" instead of "hausen"
# variant 2: completely replace lemma lookup table
# e.g. with file content of spacy_lookups_data.de['lemma_lookup']
my_lemma_lookup = # some dict with { token: lemma }
nlp.vocab.lookups.remove_table('lemma_lookup')
nlp.vocab.lookups.add_table('lemma_lookup', my_lemma_lookup)
# test
text = "Dieser Gärtner wohnt im Haus."
doc = nlp(text)
print(" ".join([f"{t}/{t.lemma_}/{t.pos_}" for t in doc if not t.is_punct])) |
@jsalbr I hope I understand your question correctly! The When you serialize a model by saving out the |
Is there a place I can go to track the rule based Spanish work? I'm having major issues with Spanish and have been for months but my customers are starting to complain. I've had to switch to stanfordnlp to get accuracy but it took our calls from seconds to minutes which isn't ideal either. I'd love a place to track and help with any work going into spanish rule system. Is there a branch or anything thats being worked on that I could help contribute to? Is there anyone doing the work that needs funding to finish it? |
Guadalupe Romero @_guadiromero said that she was working on it via twitter (https://twitter.com/_guadiromero/status/1213211033541758979) But I'have not more info. |
Changing
Output:
|
FYI, it works if I add it to the
|
@pbouda Yes, if you have a pipeline with a tagger like one of the provided English models it uses |
@adrianeboyd Yes, I spend now a day or so to go through the code and find out what's happening during lemmatization in general, definitely not the simplest part of spacy... |
Hi, I've just added the related post #7824 : Improved Italian lemmatizer: ongoing work or plans? |
Any news on when can we expect a trainable lemmatizer, now that we have transformers available, that looks at contexts and distinguishes senses? It would be a huge addition to spacy, specially for applications where dimensionality reduction is important. |
Howdy folks, since all the issues linked at the top of this have been resolved a while ago, and since we are gradually moving away from using issues for future enhancements, I am going to close this. If you have questions about or are working on a specific lemmatizer it's better to open a thread in Discussions now. About the trainable lemmatizer in particular, we don't have an implementation yet. You can read more about the current state of affairs in #6903 or #8686. While this is a feature we would love to have, and we would welcome PRs, we are not actively working on it at the moment. |
(Not sure if this is the right thread for this) I'm having issues with the lemmatizer for German (both in v3.0.6 and v2.3.2, using both Code:
Output:
(the lemma of meldet should be melden) Is there some general solution/fix to this issue? |
@giopina You should open another issue for this, probably in Discussions. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This is an enhancement issue that's been high on our list for a while, and something we'd love to tackle for as many languages as possible. At the moment, spaCy only implements rule-based lemmatization for very few languages. Many of the other languages to support lemmatization, but only use lookup tables, which are less reliable and don't always produce good results. They're also large and add significant bloat to the library.
Existing rule-based lemmatizers
lang/en/lemmatizer
lang/el/lemmatizer
lang/nb/lemmatizer
Most problematic lookup lemmatizers
Just my subjective selection, but these are the languages we want to prioritise. Of course, we'd also appreciate contributions to any of the other languages!
French#2659, #2251, #2079The text was updated successfully, but these errors were encountered: