Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional inflection data for RU & UK #3

Open
1over137 opened this issue Oct 12, 2021 · 25 comments
Open

Additional inflection data for RU & UK #3

1over137 opened this issue Oct 12, 2021 · 25 comments
Labels
question Further information is requested

Comments

@1over137
Copy link
Contributor

Hi,
I'm the author of SSM which is a language learning utility for quickly making vocabulary flashcards. Thanks for this project! Without this it would have been difficult to provide multilingual lemmatization, which is an essential aspect of this tool.

However, I found that this is not particularly accurate for Russian. PyMorphy2 is a lemmatizer for Russian that I used in other projects. It's very fast and accurate in my experience, much more than spacy or anything else. Any chance you can include PyMorphy2's data in this library?

@adbar
Copy link
Owner

adbar commented Oct 13, 2021

Hi @1over137, thanks for your feedback! Yes, results for some languages are not particularly good.

What makes PyMorphy2 special are a series of heuristics targeting these two languages in particular, it's not as if one could copy the list they have.

That's a clear limitation of the approach used by simplemma but I'd like to keep maintenance and code structure at an easily manageable level...

@1over137
Copy link
Contributor Author

Hi @1over137, thanks for your feedback! Yes, results for some languages are not particularly good.

What makes PyMorphy2 special are a series of heuristics targeting these two languages in particular, it's not as if one could copy the list they have.

That's a clear limitation of the approach used by simplemma but I'd like to keep maintenance and code structure at an easily manageable level...

Hm, would it be possible to pass a huge number of tokens through PyMorphy2 (let's say from a corpus) and generate such a list yourself? Or would that list be too large?

@adbar
Copy link
Owner

adbar commented Oct 13, 2021

I think it would better to understand what's wrong first. Are there any systematic errors that we could correct?

@1over137
Copy link
Contributor Author

I think it would better to understand what's wrong first. Are there any systematic errors that we could correct?

Coverage is quite low for certain classes of words like participles (mainly adjectives coming from verbs, ending in мый, ющий in the nominative), especially those with a reflexive ending (-ся) (Just an impression from clicking on words on pages)

Also, it's quite easy to create combine words in Russian to make new ones which may not be on the list. Is this supposed to be addressed by using the greedy option?

@adbar
Copy link
Owner

adbar commented Oct 14, 2021

Classes of words is a good place to start. Maybe they weren't in the word lists I used or something is wrong with decomposition rules.

The greedy option is indeed supposed to try everything and can even work as a stemmer in some cases. Unless its logic cannot be extrapolated to all supported languages, e.g. it doesn't work for Urdu.

@adbar
Copy link
Owner

adbar commented Oct 19, 2021

@1over137 I reviewed the files and added further word pairs in e0c0456, so that the library should get better results on Russian.

@adbar adbar added the question Further information is requested label Nov 1, 2021
@adbar adbar closed this as completed Sep 2, 2022
@1over137
Copy link
Contributor Author

1over137 commented Dec 28, 2022

Hi,
Recently it caught my attention that pymorphy2 is already over 2 years without new commits, and so I'm seeking to replace it as a dependency. (it doesn't seem to support python 3.11)
I first tried to evaluate how well simplemma currently works compared to pymorphy2. I took all the words in the lenta.ru (news dataset) and removed all words containing non-alphabetic characters and capitalized letters. I was left with about ~448k unique words, which I passed through both lemmatizers. The result differed between them for about 196k of these words (understandable, since they are likely to be lower frequency). The differences table is on difference.csv.

Observations:

  • As I observed earlier, most of these are adjectives, especially derived from verbs. pymorphy2 always convert them back to the verbs when there is one.
  • Some of these are common compound nouns and adjectives.
  • Surprisingly, there are also some relatively common words that seems to be missing.
  • pymorphy tries adds ё on the words. (the use of which is "optional" in Russian for historical reasons, but quite useful for dictionary lookups)

Note that the table apparently contains some typos present in the source, so some more improvements should probably be made before adding the data. I should also include another corpus that covers other domains.

@adbar adbar reopened this Dec 29, 2022
@adbar
Copy link
Owner

adbar commented Dec 29, 2022

Hi @1over137, thanks for listing the differences, I like your approach but I'm not sure how to modify the software to improve performance. Judging from your results most differences come from Simplemma not intervening in certain tricky cases.

If you know where to find a list of common words along with their lemmata I could add it to the list.

We could also write rules to cover additional vocabulary without needing a new list, here is a basic example for English, what do you think?

def apply_en(token: str) -> Optional[str]:

@1over137
Copy link
Contributor Author

1over137 commented Dec 30, 2022

I'm afraid that the inflection rules for Russian are often times not very reversible. The first thing I can think of for a source of these rules would be Migaku's de-inflection tables. Though, they might not be very suitable because their design allows for multiple possible lemma-forms for a certain ending, which is acceptable for their use case (dictionary lookups; dictionaries only have actual words as entries).
Here's a very long json file listing what lemma endings may correspond to a word:
conjugations.txt (Github doesn't want me to upload .json)

One way to use this would probably be running it through a known word list (e.g. hunspell dictionaries) to see which possible lemma-forms are actually words, but that probably involves a significant architecture change to the program.

When I have more time I'll run pymorphy2 through a few more corpora and somehow remove the typos, and filter out the least common words to supplement the current dictionary.

@adbar
Copy link
Owner

adbar commented Dec 30, 2022

Thanks for the file, it's not particularly long and I'm not sure I could directly use it (what's the source?) but it shows the problem with the current approach.
I've been using inflection lists from this project: https://github.com/tatuylonen/wiktextract
The coverage is not particularly good for Russian as of now but it should get better over time.

@adbar adbar changed the title Add PyMorphy2 data Additional inflection data for RU & UK Dec 30, 2022
@1over137
Copy link
Contributor Author

1over137 commented Jan 11, 2023

Sorry for forgetting about this thread.
The source of the json file is https://mega.nz/folder/eyYwyIgY#3q4XQ3BhdvkFg9KsPe5avw/folder/bz4ywa5A, which is apparently curated by people from Migaku (language learning software, https://migaku.io). I have no idea where they got it from, but as you can see from just how many candidates this would create, I don't think it would be that useful unless you intend to support for being able to provide multiple potential lemmas or use a supplemental dictionary.
On another note, this user came up with a method of extracting inflection tables from the Russian Wiktionary (download the json). It should have much better results given that RU wiktionary has a much higher headword count (73k vs 311k) for Russian than the EN wiktionary, and the inflection tables are autogenerated. This would be a much easier solution for my specific issue.

@adbar
Copy link
Owner

adbar commented Jan 12, 2023

No worries, thanks for providing additional information.

I get your point, although the EN Wiktionary is going to get better over time using the RU Wiktionary would make sense.

As I'm more interested in expanding coverage through rules at the moment I looked at the conjugations in the Migaku data, they don't look so reliable to me but I'm going to keep an eye on the resources you shared. If you know where to find a list of common suffixes and corresponding forms I'd be interested.

@1over137
Copy link
Contributor Author

1over137 commented Jan 14, 2023

https://en.wikipedia.org/wiki/Russian_declension#Nouns is a good start. However, there are going to be a lot of overlaps (as in, this token may either be one form of this hypothetical lemma or another form of that lemma, and there would be no way of distinguishing them without a knowledge of the words themselves). I don't really think rules would be very useful, other than a few narrow categories of words (abstract nouns ending in -ость, ство, adjectives in ский)
IMO a more useful approach is to somehow attempt to decompose words and try to apply the lemmatization table to the word minus the prefix, and then put the prefix back on. This should work as long as the decomposition is done correctly. While Russian has a large number of total lemmas, most of them are just compound words instead of words of a completely different origin, so this should work quite well.

Example where this can be useful:

In [10]: lemmatize("зафиксированные", lang='ru')
Out[10]: 'зафиксированные'

In [11]: lemmatize("фиксированные", lang='ru')
Out[11]: 'фиксированный'

Edit: seems like greedy works in this case. Can you explain how it works?

@adbar
Copy link
Owner

adbar commented Jan 17, 2023

@1over137 Yes, greedy works because it can take into account affixes of up to 2 characters in an unsupervised way. I decided to make it the default for languages for which this setting is beneficial (like Russian).

I also added the following rules to save dictionary space (endings) and get better coverage (supervised prefix search). I'm not seeing a real impact on the German and Russian treebanks but it's probably because the words were already covered.

We could now implement more rules or adopt a similar approach for Ukrainian. Could you please make suggestions?
(ский turned out to be too noisy/irregular, I also didn't use all possible declensions of -ость and -ство.)

RUSSIAN_PREFIXES = {"за", "много", "недо", "пере", "пред", "само"}

RUSSIAN_ENDINGS = {
    # -ость
    "ости": "ость",
    "остью": "ость",
    "остию": "ость",
    "остьи": "ость",
    "остии": "ость",
    "остьхъ": "ость",
    "остьма": "ость",
    "остьмъ": "ость",
    "остиѭ": "ость",
    "остьми": "ость",
    # -ство
    "ства": "ство",
    "ств": "ство",
    "ству": "ство",
    "ствам": "ство",
    "ством": "ство",
    "ствами": "ство",
    "стве": "ство",
    "ствах": "ство",
}

@1over137
Copy link
Contributor Author

I'm not sure where you got these endings from, they seem quite weird, some of them containing symbols no longer used in modern Russian. I'm not sure what you mean by ский being too noisy or irregular. Can you elaborate? Though, it may be useful to note that some adjectives ending in ский gets used as nouns by omitting the noun, so the intended dictionary form could be a non-male gender (for example, street name, which would be feminine). I tried to fix this list a bit:

RUSSIAN_ENDINGS = {
    # -ость
    "ости": "ость",
    "остью": "ость",
    "остей": "ость",
    "остям": "ость",
    "остями": "ость",
    "остях": "ость",
    # -ство
    "ства": "ство",
    "ств": "ство",
    "ству": "ство",
    "ствам": "ство",
    "ством": "ство",
    "ствами": "ство",
    "стве": "ство",
    "ствах": "ство",
  # -ский
    "ская": "ский",
    "ское": "ский",
    "ской": "ский",
    "скую": "ский",
    "ском": "ский",
    "ским": "ский",
    "ских": "ский",
    "ские": "ский",
    "скому": "ский",
    "скими": "ский",
    "скою": "ский",
    "ского": "ский",
}

@1over137
Copy link
Contributor Author

1over137 commented Jan 17, 2023

There are also many more prefixes. Here is a good list.
With all this, I would still strongly suggest processing and merging the data from Russian Wiktionary. With these rules you can probably get rid of 100k entries.
I don't know much Ukrainian, but I assume the rules would be very similar if you swap in the appropriate endings. Russian Wiktionary also has a lot of entries for Ukrainian, so maybe the script originally used to extract Russian entries can be used for that too. @Vuizur would you be able to do that?

@adbar
Copy link
Owner

adbar commented Jan 18, 2023

Thanks for the suggestions! I don't understand everything I change on Russian and Ukrainian but I tried to adapt suffix lists from the English Wiktionary, I think I took the dated ones by mistake.

  • I tried again and -ский is a bit too noisy or irregular, using the rules degrades the overall performance. The other rules work, thanks!
  • I will add a few prefixes, especially long, frequent ones.

I tried a newer version of the English Wiktionary and the Russian Wiktionary to expand the existing language data. Both degrade performance on the benchmark (UD-GSD). I could switch to another treebank but the main problem now seems to be the normalization of accentuated entries.

Take for example ско́рость, should it stay as it is or be normalized to скорость?

@1over137
Copy link
Contributor Author

1over137 commented Jan 19, 2023

This is a good question. You have to make a decision on this. Russian normally use accents only for clarification. It is not present in normal text. Another similar, but a bit more important decision is whether to put ё back when appropriate, remove in all cases, or simply handle both cases as separate words.

Reduced performance is not quite enough information here. Can you compile a differences table between difference choices of rules enabled, dictionaries, used, etc? These corpora are not always 100% accurate either. It would be interesting to see what's behind the different performance.

@Vuizur
Copy link

Vuizur commented Jan 20, 2023

@1over137 The updated pymorphy2 fork pymorphy3 might also be interesting to you. It apparently has Python 3.11 support. The new spaCy version already uses it.

About the integration of the Wiktionary data: I think one has to probably remove the accent marks everywhere to get good results. (In Python this can be done by word.replace("\u0301", "").replace("\u300", ""). The benchmarks will likely not have any accented texts (or only 1 word per thousands).

I also thought about integrating other languages in my Russian Wiktionary scraper, but I am still considering if I should try to add the Russian Wiktionary to wiktextract. (It is highly possible that this is nightmarishly difficult though.)

@adbar
Copy link
Owner

adbar commented Jan 20, 2023

Interesting thoughts, thanks. Sadly I lack the time to perform in-depth analyses of what's happening here, I look at the lemmatization accuracy and try to strike a balance.

The newest version out today comprises two significant changes for Russian and Ukrainian: better data (especially for the latter) and affix decomposition (now on by default). @1over137 You should hopefully see some improvement.

@Vuizur Since lemmatization can also be seen as a normalization task it makes sense to remove accent marks. I'm going to look into it.
I saw that turning "ё" into "e" at the end of words allowed for some progress but I'm open to adapt the processing of "ё".

@1over137
Copy link
Contributor Author

1over137 commented Jan 20, 2023

Note that if removing accent marks this way, you must first normalize the unicode to NFD or NFKD. You can't guarantee the mode of this from your input.
So, were the Russian Wiktionary entries added?
One possible reason is that many forms are ambiguous. Did you let Wiktionary data override existing data or only merged the non-conflicting parts?

@1over137
Copy link
Contributor Author

Interesting thoughts, thanks. Sadly I lack the time to perform in-depth analyses of what's happening here, I look at the lemmatization accuracy and try to strike a balance.

I made a PR to let test/udscore.py write CSV files, so it will be easier to compare that by just running tests with different dictionaries and sending the CSV files here.

In a cursory look, I noticed a few odd errors, such as lemmatizing something into an unrelated word:
свобода ("freedom") -> слова ("word, genitive")
государственный ("rel. adj. for state") -> районный ("rel. adj. for district")
День ("day") -> Победы ("victory, genitive"), but only in capitalized form

The most obvious thing here is that at they seem to be common phrases (свобода слова - freedom of speech, День Победы - victory day). Can you check what may have been the cause of this?

Also, there are quite some English words or otherwise words made of Latin characters. Maybe we should skip tokens containing latin characters, since a Russian lemmatizer should not be expected to do that. Also, a large number of them are capitalization issues.

@1over137
Copy link
Contributor Author

1over137 commented Jan 23, 2023

Sorry for the barrage of comments and PRs, but I decided to do a few experiments myself to create a better dictionary. I combined the current dictionary, the list of word-forms obtained through the Wiktionary dump, and passed all of them through PyMorphy2, disregarding all previous data. This is done without using anything from the test dataset. The new dictionary file is 7.7 MiB after sorting, which you can presumably reduce by applying rules. The test results are as following:

greedy:		 0.911
non-greedy:	 0.911
baseline:	 0.525
-PRO greedy:		 0.917
-PRO non-greedy:	 0.918
-PRO baseline:		 0.241

This seems to be a significant improvement from the original. This gets rid of most of the weird results I mentioned, with most errors being pronouns and capitalization issues (for reference, pymorphy2 always lowercases the words, even proper names). The advantage of using PyMorphy2 is that it can estimate the probability of each lemma form by corpus frequency, which is optimal for a limited solution like Simplemma that can only provide one answer, whereas the Wiktionary dump seem to contain errors and may override common words with uncommon lemmas, which accounts for the poorer performance.
The dictionary can be generated by this script from the project root directory, with ruwiktionary_words.json downloaded. It is rather slow to run because of PyMorphy2 (15 min on my laptop), but I think it should be worth it.

import json
import unicodedata
import pymorphy2
from tqdm import tqdm
import lzma, pickle

with open("ruwiktionary_words.json") as f:
    wikt = json.load(f)

morph = pymorphy2.MorphAnalyzer()

def removeAccents(word):
    #print("Removing accent marks from query ", word)
    ACCENT_MAPPING = {
        '́': '',
        '̀': '',
        'а́': 'а',
        'а̀': 'а',
        'е́': 'е',
        'ѐ': 'е',
        'и́': 'и',
        'ѝ': 'и',
        'о́': 'о',
        'о̀': 'о',
        'у́': 'у',
        'у̀': 'у',
        'ы́': 'ы',
        'ы̀': 'ы',
        'э́': 'э',
        'э̀': 'э',
        'ю́': 'ю',
        '̀ю': 'ю',
        'я́́': 'я',
        'я̀': 'я',
    }
    word = unicodedata.normalize('NFKC', word)
    for old, new in ACCENT_MAPPING.items():
        word = word.replace(old, new)
    return word

def caseAwareLemmatize(word):
    if not word.strip():
        return word
    if word[0].isupper():
        return str(morph.parse(word)[0].normal_form).capitalize()
    else:
        return str(morph.parse(word)[0].normal_form)

with lzma.open('simplemma/data/ru.plzma') as p:
    ru_orig = pickle.load(p)

# Old list, we send it through PyMorphy2
orig_dic = {}
for key in tqdm(ru_orig):
    orig_dic[key] =  caseAwareLemmatize(key)

# New list of word-forms from Wiktionary, we send it through PyMorphy2
lem = {}
for entry in tqdm(wikt):
    for inflection in entry['inflections']:
        inflection = removeAccents(inflection)
        lem[inflection] = caseAwareLemmatize(inflection)

d = {k: v for k, v in sorted((orig_dic | lem).items(), key=lambda item: item[1])}

with lzma.open('simplemma/data/ru.plzma', 'wb') as p:
    pickle.dump(d, p)

@adbar
Copy link
Owner

adbar commented Jan 23, 2023

Hi @1over137, thank you very much for the deep dive into the data and the build process!

I cannot address the topic right now, will come back to it later, here are a few remarks:

  • So far I didn't add the Russian Wiktionary entries mostly because of the accent issue, your approach could solve the problem.
  • As you say it could be that because of a parsing error multi-word expressions are split up which causes lemmatization errors. The dictionary building script is reasonable robust: it should be able to replace such candidates with other closer words (by edit distance), I don't know why it isn't happening in this case.
  • Skipping words made of latin chars makes sense, a simple regex could do the trick.
  • If you have a way to solve the capitalization issues I'm all ears.
  • I already thought about using word frequency cues to filter the input, maybe this project could be of help:
    https://github.com/rspeer/wordfreq

Concretely we are looking at writing a special data import for Russian and Ukrainian. We could talk about how to adapt the main function to this case.

@1over137
Copy link
Contributor Author

1over137 commented Jan 23, 2023

Can you explain how the dictionary building works? I don't think I ever quite understood it.

If writing a special data import, would it be possible to rely on PyMorphy2 for lemmas? Even without using Wiktionary data, only reprocessing the original list, using PyMorphy2 increases the accuracy from 0.854 to 0.889, which is a significant improvement before even adding any words. It avoids the complexity of choosing candidates and personally I have only ever seen it making an error a few times over a long period. Though we don't have a wiktionary export of word-forms for Ukrainian yet, the same can be applied to a corpus or a wordlist, such as from the wordfreq repository you linked, which should improve coverage quite a bit. Then we aim to capture the coverage with rules to cut down on size. We can be much more aggressive in applying rules since statistically exceptions are only common in high frequency words (adding back -ский, even adding some more adjective rules). Though, to see any different in the benchmarks, you might need to switch to a bigger corpus. The UD treebank only has ~90k tokens, so it may not even contain enough low frequency words to make a difference, especially since your scoring is frequency-weighed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants