Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some simple suffix rules for Finnish #23

Closed
wants to merge 2 commits into from

Conversation

osma
Copy link
Contributor

@osma osma commented Oct 7, 2022

This PR adds some generated suffix rules for Finnish language, as discussed in #19.

All these rules have an accuracy above 90% as evaluated on the Finnish language dictionary included with Simplemma. Collectively they cover around 6% of the dictionary entries.

@sourcery-ai

This comment was marked as off-topic.

@adbar
Copy link
Owner

adbar commented Oct 8, 2022

Hi @osma, thanks for the PR!

Everything looks good but I hardly see any change on the benchmark (+0.001 without pronouns, the rest unchanged). At this point I'd say there isn't any incentive to write a lot of rules because the vocabulary-based approach is already working OK.

However it could be that my benchmark (on this dataset, with the quick and dirty script in tests/udscore.py) is simply too blunt and my understanding of Finnish morphology too shallow.

If you wish to run a deeper analysis I can recommend the following approach. It is dedicated to Finnish and uses various corpora and different metrics, my guess is that it could catch fine-grained changes and ultimately tell us if we should go in that direction:
https://github.com/aajanki/finnish-pos-accuracy

If you're up for it you could tinker on this PR with the benchmark above until you reach conclusive results, what do you think?

@osma
Copy link
Contributor Author

osma commented Oct 10, 2022

Thanks for testing the PR @adbar !

I'm not surprised that the difference is minimal. As you know, my goal wasn't to improve benchmark results but to be able to reduce the size of the dictionary. I don't have information on the frequency of the dictionary entries in real world texts but it's likely that the word forms that these rules target are quite rare in practice.

Let me try another set of rules, this time making many more of them (but still with >=90% accuracy) and storing them in an external dictionary instead of inline in rules.py. Those might be able to notch up the benchmark results a bit more. Still, if you're aiming to improve the benchmark, I'm not convinced that these kinds of data-driven rules are the best way to go. You'd need a much more intelligent set of rules, but that goes beyond the scope of Simplemma.

@osma
Copy link
Contributor Author

osma commented Oct 10, 2022

Oh, forgot this one:

If you wish to run a deeper analysis I can recommend the following approach. It is dedicated to Finnish and uses various corpora and different metrics, my guess is that it could catch fine-grained changes and ultimately tell us if we should go in that direction:
https://github.com/aajanki/finnish-pos-accuracy

Thanks for the tip, I wasn't aware of this benchmark and it also seems to include tools that hadn't heard about. I really like the visualization with F1 scores on the X axis and speed (tokens per second) on the other axis! Simplemma seems to do very well in its own niche, with very high performance but lower quality lemmatization results.

@adbar
Copy link
Owner

adbar commented Oct 10, 2022

Let me try another set of rules, this time making many more of them (but still with >=90% accuracy) and storing them in an external dictionary instead of inline in rules.py.

OK let's try this! In the meantime I'll upload a version of the Finnish dictionary with entries limited to a maximum of 16 chars.

@osma
Copy link
Contributor Author

osma commented Oct 10, 2022

I expanded the number of rules to 1631. This is the largest set of rules I've generated so far, with each rule having >=90% accuracy on the dictionary entries; collectively, they match around 39% of the dictionary entries. Each rule matches a minimum of around 280 dictionary entries. It would be possible to generate even more rules by reducing that threshold, in case we want to expand further.

I put them in a file called fi-rules.plzma right next to the language-specific dictionaries. I'm not sure this is the best place and the best format (lzma compression is maybe a bit overkill for such a small file), but you have to start somewhere.

@adbar
Copy link
Owner

adbar commented Oct 10, 2022

The rules have a small positive effect on my benchmark, more noticeable than the smaller set of rules, about +0.003.

I added the reduced version of the Finnish dictionary to the main branch of the repository, the corresponding data file is now about 30% smaller than in version 0.8.2.

Due to a lack of time and knowledge I cannot work extensively on Finnish. From there I see two options:

  • maybe the current size reductions are enough and your rules can be used to add a bit of accuracy?
  • or maybe you wish to experiment with the benchmark above to measure the effect of further rules and dictionary shrinking?

@adbar
Copy link
Owner

adbar commented Oct 19, 2022

I'm not sure how to interpret your reaction, I'll leave the pull request open and I'd be glad to integrate future improvements!

@osma
Copy link
Contributor Author

osma commented Oct 28, 2022

Sorry for the lack of response. I got distracted by other things & was on vacation.

My original motivation was to reduce the memory usage of the Finnish model. The reductions you've now implemented (removing multiword keys and long words) have accomplished that at least partly. If we want to reduce even further, the only way I see is to drop the words from the dictionary that were covered by the generated rules. But you didn't seem very keen on doing that.

The very small increase in accuracy is not, IMHO, a good enough reason to keep the rules, at least not these kinds of machine-generated rules. There is a maintenance overhead in both the rules.py code and the rule generation - I used a Colab notebook for prototyping, but it's not a good long term solution so the rule generation code should probably be put into the Simplemma repository itself.

In summary: I think this was a useful experiment, we found things to improve (dropping the multiword keys in not just Finnish but other languages as well), the memory usage should now be reduced (I didn't benchmark it though). The rules in this form, though, didn't turn out to be such a great idea. I think it's time to close this PR. The rules can always be revisited if we come up with better ideas.

@osma osma closed this Oct 28, 2022
@adbar
Copy link
Owner

adbar commented Oct 28, 2022

Thanks for your message, I respect your decision. Let's keep the idea in mind and talk about it in the future if necessary.

@adbar
Copy link
Owner

adbar commented Jan 12, 2023

Hi @osma, I picked up the idea where we left off and I tested the rules against the language data for some languages. This had two consequences (see last commits):

  • I reduced the number of rules when they appeared to generate noise
  • I then used the resulting (smaller) number of efficient rules to skip entries in the word pairs, thus reducing dictionary size as expected – interestingly it doesn't show that much in the compressed data because these are frequent, regular phenomena and thus they compress well...

The first implementation of this idea for Finnish can be found below, I wrote rules based on common noun forms and kept only the ones which didn't interfere too much with the data. The impact on accuracy is minimal and the total number of entries has been reduced by about 10%.
Feel free to suggest further improvements if something strikes your eye:

FINNISH_ENDINGS = {

@osma
Copy link
Contributor Author

osma commented Jan 19, 2023

Hi @adbar , wow - this is great news! Thank you for continuing the work! I had already sort of given up on the idea of reducing the dictionary size, which was my original motivation with the rules, since you seemed to resist going in that direction.

@adbar
Copy link
Owner

adbar commented Jan 19, 2023

Hi @osma, I guess I needed some time to get used to the idea since the slightly smaller memory footprint had consequences in terms of accuracy. I'm happy I could use your input to try to strike a balance. We can talk about further refinements in the future if the changes have a negative impact on your data or if you see a greater potential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants