Add a dictionary factory backed by MARISA-tries #133

Dunedan · 2024-05-30T08:19:22Z

This adds an additional dictionary factory backed by MARISA-tries. This dictionary factory on average offers 20x lower memory usage and 100x faster initialization time, in exchange for reduced lemmatization and language detection performance.

The first time loading a dictionary with the TrieDictionaryFactory requires more memory and will take a few seconds, as the trie-backed dictionary has to be generated on-the-fly from the pickled dict-based dictionary first.

osma · 2024-05-30T09:52:55Z

README.md

+using the `TrieDictionaryFactory` for the first time for a language and
+will take a few seconds and use as much memory as loading the Python
+dicts for the language requires. For further invocations the trie
+dictionaries get cached on disk.


Would it be helpful to explain where the cache is located on disk?

I tried not to document every little detail in the README to avoid blowing it up too much. IMO there should be a separate API-documentation for Simplemma to cover stuff like that. However, if it's desired, please let me know and I'll happily add it.

osma · 2024-05-30T09:53:16Z

Awesome work @Dunedan !

tests/strategies/dictionaries/test_trie_dictionary_factory.py

adbar · 2024-05-30T13:12:32Z

Thanks for sharing your insights and for contributing this functionality!

Everything looks good at first sight but we need to update the installation process on Github actions.

codecov · 2024-06-04T12:17:22Z

Codecov Report

Attention: Patch coverage is 98.63014% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (5f4fa16) to head (81f08ba).

Files	Patch %	Lines
...emma/strategies/dictionaries/dictionary_factory.py	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   97.15%   97.25%   +0.09%     
==========================================
  Files          33       34       +1     
  Lines         563      619      +56     
==========================================
+ Hits          547      602      +55     
- Misses         16       17       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-06-04T12:20:12Z

@Dunedan Your PR only works on Windows, there seems to be an issue with the .dic file names, could you please look into it?

There are also lines not covered by the tests, it should be easy to add further tests.

Dunedan · 2024-06-04T20:03:29Z

@Dunedan Your PR only works on Windows, there seems to be an issue with the .dic file names, could you please look into it?

I guess you're talking about the tests, which only succeed on the Windows runners so far. At first glance that looks like a list ordering issue affecting just the test assertions. Should be easy to fix. I'll probably do so tomorrow.

If you're not talking about the tests, please let me know and provide a bit more details about what functionality doesn't work and how that manifests. I'd be surprised about that though, as I did develop and test the code with Linux and didn't encounter any problems there.

There are also lines not covered by the tests, it should be easy to add further tests.

Sure, no big deal, I can add tests for that as well.

juanjoDiaz

I've dropped a bunch of comments.
I like the idea of using tries but I'm not fully convinced of the implementation.
I think that we should have a TrieDictionaryFactory which:

has a method to convert all (or the selected) pickled dictionaries into pickled tries and save them on disk
has a method to load a dictionary which loads the dicto from disk if there or from the original dictionary as needed
doesn't do all that hashing logic because dictionaries are not likely to change unless the library is updated
reconsider if the current bytestring usage is right

juanjoDiaz · 2024-06-04T17:56:31Z