New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some simple suffix rules for Finnish #23
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Hi @osma, thanks for the PR! Everything looks good but I hardly see any change on the benchmark (+0.001 without pronouns, the rest unchanged). At this point I'd say there isn't any incentive to write a lot of rules because the vocabulary-based approach is already working OK. However it could be that my benchmark (on this dataset, with the quick and dirty script in If you wish to run a deeper analysis I can recommend the following approach. It is dedicated to Finnish and uses various corpora and different metrics, my guess is that it could catch fine-grained changes and ultimately tell us if we should go in that direction: If you're up for it you could tinker on this PR with the benchmark above until you reach conclusive results, what do you think? |
Thanks for testing the PR @adbar ! I'm not surprised that the difference is minimal. As you know, my goal wasn't to improve benchmark results but to be able to reduce the size of the dictionary. I don't have information on the frequency of the dictionary entries in real world texts but it's likely that the word forms that these rules target are quite rare in practice. Let me try another set of rules, this time making many more of them (but still with >=90% accuracy) and storing them in an external dictionary instead of inline in |
Oh, forgot this one:
Thanks for the tip, I wasn't aware of this benchmark and it also seems to include tools that hadn't heard about. I really like the visualization with F1 scores on the X axis and speed (tokens per second) on the other axis! Simplemma seems to do very well in its own niche, with very high performance but lower quality lemmatization results. |
OK let's try this! In the meantime I'll upload a version of the Finnish dictionary with entries limited to a maximum of 16 chars. |
I expanded the number of rules to 1631. This is the largest set of rules I've generated so far, with each rule having >=90% accuracy on the dictionary entries; collectively, they match around 39% of the dictionary entries. Each rule matches a minimum of around 280 dictionary entries. It would be possible to generate even more rules by reducing that threshold, in case we want to expand further. I put them in a file called |
The rules have a small positive effect on my benchmark, more noticeable than the smaller set of rules, about +0.003. I added the reduced version of the Finnish dictionary to the main branch of the repository, the corresponding data file is now about 30% smaller than in version 0.8.2. Due to a lack of time and knowledge I cannot work extensively on Finnish. From there I see two options:
|
I'm not sure how to interpret your reaction, I'll leave the pull request open and I'd be glad to integrate future improvements! |
Sorry for the lack of response. I got distracted by other things & was on vacation. My original motivation was to reduce the memory usage of the Finnish model. The reductions you've now implemented (removing multiword keys and long words) have accomplished that at least partly. If we want to reduce even further, the only way I see is to drop the words from the dictionary that were covered by the generated rules. But you didn't seem very keen on doing that. The very small increase in accuracy is not, IMHO, a good enough reason to keep the rules, at least not these kinds of machine-generated rules. There is a maintenance overhead in both the rules.py code and the rule generation - I used a Colab notebook for prototyping, but it's not a good long term solution so the rule generation code should probably be put into the Simplemma repository itself. In summary: I think this was a useful experiment, we found things to improve (dropping the multiword keys in not just Finnish but other languages as well), the memory usage should now be reduced (I didn't benchmark it though). The rules in this form, though, didn't turn out to be such a great idea. I think it's time to close this PR. The rules can always be revisited if we come up with better ideas. |
Thanks for your message, I respect your decision. Let's keep the idea in mind and talk about it in the future if necessary. |
Hi @osma, I picked up the idea where we left off and I tested the rules against the language data for some languages. This had two consequences (see last commits):
The first implementation of this idea for Finnish can be found below, I wrote rules based on common noun forms and kept only the ones which didn't interfere too much with the data. The impact on accuracy is minimal and the total number of entries has been reduced by about 10%. Line 195 in fd93714
|
Hi @adbar , wow - this is great news! Thank you for continuing the work! I had already sort of given up on the idea of reducing the dictionary size, which was my original motivation with the rules, since you seemed to resist going in that direction. |
Hi @osma, I guess I needed some time to get used to the idea since the slightly smaller memory footprint had consequences in terms of accuracy. I'm happy I could use your input to try to strike a balance. We can talk about further refinements in the future if the changes have a negative impact on your data or if you see a greater potential. |
This PR adds some generated suffix rules for Finnish language, as discussed in #19.
All these rules have an accuracy above 90% as evaluated on the Finnish language dictionary included with Simplemma. Collectively they cover around 6% of the dictionary entries.