Add some simple suffix rules for Finnish #23

osma · 2022-10-07T06:38:47Z

This PR adds some generated suffix rules for Finnish language, as discussed in #19.

All these rules have an accuracy above 90% as evaluated on the Finnish language dictionary included with Simplemma. Collectively they cover around 6% of the dictionary entries.

adbar · 2022-10-08T09:12:28Z

Hi @osma, thanks for the PR!

Everything looks good but I hardly see any change on the benchmark (+0.001 without pronouns, the rest unchanged). At this point I'd say there isn't any incentive to write a lot of rules because the vocabulary-based approach is already working OK.

However it could be that my benchmark (on this dataset, with the quick and dirty script in tests/udscore.py) is simply too blunt and my understanding of Finnish morphology too shallow.

If you wish to run a deeper analysis I can recommend the following approach. It is dedicated to Finnish and uses various corpora and different metrics, my guess is that it could catch fine-grained changes and ultimately tell us if we should go in that direction:
https://github.com/aajanki/finnish-pos-accuracy

If you're up for it you could tinker on this PR with the benchmark above until you reach conclusive results, what do you think?

osma · 2022-10-10T06:38:25Z

Thanks for testing the PR @adbar !

I'm not surprised that the difference is minimal. As you know, my goal wasn't to improve benchmark results but to be able to reduce the size of the dictionary. I don't have information on the frequency of the dictionary entries in real world texts but it's likely that the word forms that these rules target are quite rare in practice.

Let me try another set of rules, this time making many more of them (but still with >=90% accuracy) and storing them in an external dictionary instead of inline in rules.py. Those might be able to notch up the benchmark results a bit more. Still, if you're aiming to improve the benchmark, I'm not convinced that these kinds of data-driven rules are the best way to go. You'd need a much more intelligent set of rules, but that goes beyond the scope of Simplemma.

osma · 2022-10-10T06:41:17Z

Oh, forgot this one:

If you wish to run a deeper analysis I can recommend the following approach. It is dedicated to Finnish and uses various corpora and different metrics, my guess is that it could catch fine-grained changes and ultimately tell us if we should go in that direction:
https://github.com/aajanki/finnish-pos-accuracy

Thanks for the tip, I wasn't aware of this benchmark and it also seems to include tools that hadn't heard about. I really like the visualization with F1 scores on the X axis and speed (tokens per second) on the other axis! Simplemma seems to do very well in its own niche, with very high performance but lower quality lemmatization results.

adbar · 2022-10-10T07:26:33Z

Let me try another set of rules, this time making many more of them (but still with >=90% accuracy) and storing them in an external dictionary instead of inline in rules.py.

OK let's try this! In the meantime I'll upload a version of the Finnish dictionary with entries limited to a maximum of 16 chars.

osma · 2022-10-10T09:30:00Z

I expanded the number of rules to 1631. This is the largest set of rules I've generated so far, with each rule having >=90% accuracy on the dictionary entries; collectively, they match around 39% of the dictionary entries. Each rule matches a minimum of around 280 dictionary entries. It would be possible to generate even more rules by reducing that threshold, in case we want to expand further.

I put them in a file called fi-rules.plzma right next to the language-specific dictionaries. I'm not sure this is the best place and the best format (lzma compression is maybe a bit overkill for such a small file), but you have to start somewhere.

adbar · 2022-10-10T15:11:28Z

The rules have a small positive effect on my benchmark, more noticeable than the smaller set of rules, about +0.003.

I added the reduced version of the Finnish dictionary to the main branch of the repository, the corresponding data file is now about 30% smaller than in version 0.8.2.

Due to a lack of time and knowledge I cannot work extensively on Finnish. From there I see two options:

maybe the current size reductions are enough and your rules can be used to add a bit of accuracy?
or maybe you wish to experiment with the benchmark above to measure the effect of further rules and dictionary shrinking?

adbar · 2022-10-19T09:14:19Z

I'm not sure how to interpret your reaction, I'll leave the pull request open and I'd be glad to integrate future improvements!

osma · 2022-10-28T09:16:16Z

Sorry for the lack of response. I got distracted by other things & was on vacation.

My original motivation was to reduce the memory usage of the Finnish model. The reductions you've now implemented (removing multiword keys and long words) have accomplished that at least partly. If we want to reduce even further, the only way I see is to drop the words from the dictionary that were covered by the generated rules. But you didn't seem very keen on doing that.

The very small increase in accuracy is not, IMHO, a good enough reason to keep the rules, at least not these kinds of machine-generated rules. There is a maintenance overhead in both the rules.py code and the rule generation - I used a Colab notebook for prototyping, but it's not a good long term solution so the rule generation code should probably be put into the Simplemma repository itself.

In summary: I think this was a useful experiment, we found things to improve (dropping the multiword keys in not just Finnish but other languages as well), the memory usage should now be reduced (I didn't benchmark it though). The rules in this form, though, didn't turn out to be such a great idea. I think it's time to close this PR. The rules can always be revisited if we come up with better ideas.

adbar · 2022-10-28T11:16:06Z

Thanks for your message, I respect your decision. Let's keep the idea in mind and talk about it in the future if necessary.

adbar · 2023-01-12T19:10:38Z

Hi @osma, I picked up the idea where we left off and I tested the rules against the language data for some languages. This had two consequences (see last commits):

I reduced the number of rules when they appeared to generate noise
I then used the resulting (smaller) number of efficient rules to skip entries in the word pairs, thus reducing dictionary size as expected – interestingly it doesn't show that much in the compressed data because these are frequent, regular phenomena and thus they compress well...

The first implementation of this idea for Finnish can be found below, I wrote rules based on common noun forms and kept only the ones which didn't interfere too much with the data. The impact on accuracy is minimal and the total number of entries has been reduced by about 10%.
Feel free to suggest further improvements if something strikes your eye:

simplemma/simplemma/rules.py

Line 195 in fd93714

FINNISH_ENDINGS = {

osma · 2023-01-19T15:46:01Z

Hi @adbar , wow - this is great news! Thank you for continuing the work! I had already sort of given up on the idea of reducing the dictionary size, which was my original motivation with the rules, since you seemed to resist going in that direction.

adbar · 2023-01-19T18:01:49Z

Hi @osma, I guess I needed some time to get used to the idea since the slightly smaller memory footprint had consequences in terms of accuracy. I'm happy I could use your input to try to strike a balance. We can talk about further refinements in the future if the changes have a negative impact on your data or if you see a greater potential.

Add some simple suffix rules for Finnish

4d889dc

This comment was marked as off-topic.

Sign in to view

sourcery-ai bot mentioned this pull request Oct 7, 2022

Add some simple suffix rules for Finnish (Sourcery refactored) #24

Closed

expand to 1631 Finnish suffix rules stored in a pickled dict

f55a29e

sourcery-ai bot mentioned this pull request Oct 10, 2022

Add some simple suffix rules for Finnish (Sourcery refactored) #25

Closed

osma closed this Oct 28, 2022

osma mentioned this pull request Jan 19, 2023

Consider doing all development in branches and PRs #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some simple suffix rules for Finnish #23

Add some simple suffix rules for Finnish #23

osma commented Oct 7, 2022

This comment was marked as off-topic.

adbar commented Oct 8, 2022

osma commented Oct 10, 2022

osma commented Oct 10, 2022

adbar commented Oct 10, 2022

osma commented Oct 10, 2022 •

edited

adbar commented Oct 10, 2022

adbar commented Oct 19, 2022

osma commented Oct 28, 2022 •

edited

adbar commented Oct 28, 2022

adbar commented Jan 12, 2023

osma commented Jan 19, 2023

adbar commented Jan 19, 2023

Add some simple suffix rules for Finnish #23

Add some simple suffix rules for Finnish #23

Conversation

osma commented Oct 7, 2022

This comment was marked as off-topic.

adbar commented Oct 8, 2022

osma commented Oct 10, 2022

osma commented Oct 10, 2022

adbar commented Oct 10, 2022

osma commented Oct 10, 2022 • edited

adbar commented Oct 10, 2022

adbar commented Oct 19, 2022

osma commented Oct 28, 2022 • edited

adbar commented Oct 28, 2022

adbar commented Jan 12, 2023

osma commented Jan 19, 2023

adbar commented Jan 19, 2023

osma commented Oct 10, 2022 •

edited

osma commented Oct 28, 2022 •

edited