Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dictionary factory backed by MARISA-tries #133

Merged
merged 2 commits into from
Jun 26, 2024

Conversation

Dunedan
Copy link
Contributor

@Dunedan Dunedan commented May 30, 2024

This adds an additional dictionary factory backed by MARISA-tries. This dictionary factory on average offers 20x lower memory usage and 100x faster initialization time, in exchange for reduced lemmatization and language detection performance.

The first time loading a dictionary with the TrieDictionaryFactory requires more memory and will take a few seconds, as the trie-backed dictionary has to be generated on-the-fly from the pickled dict-based dictionary first.

using the `TrieDictionaryFactory` for the first time for a language and
will take a few seconds and use as much memory as loading the Python
dicts for the language requires. For further invocations the trie
dictionaries get cached on disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to explain where the cache is located on disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried not to document every little detail in the README to avoid blowing it up too much. IMO there should be a separate API-documentation for Simplemma to cover stuff like that. However, if it's desired, please let me know and I'll happily add it.

@osma
Copy link
Contributor

osma commented May 30, 2024

Awesome work @Dunedan !

@adbar
Copy link
Owner

adbar commented May 30, 2024

Thanks for sharing your insights and for contributing this functionality!

Everything looks good at first sight but we need to update the installation process on Github actions.

Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 98.63014% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (5f4fa16) to head (81f08ba).

Files Patch % Lines
...emma/strategies/dictionaries/dictionary_factory.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   97.15%   97.25%   +0.09%     
==========================================
  Files          33       34       +1     
  Lines         563      619      +56     
==========================================
+ Hits          547      602      +55     
- Misses         16       17       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Jun 4, 2024

@Dunedan Your PR only works on Windows, there seems to be an issue with the .dic file names, could you please look into it?

There are also lines not covered by the tests, it should be easy to add further tests.

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 4, 2024

@Dunedan Your PR only works on Windows, there seems to be an issue with the .dic file names, could you please look into it?

I guess you're talking about the tests, which only succeed on the Windows runners so far. At first glance that looks like a list ordering issue affecting just the test assertions. Should be easy to fix. I'll probably do so tomorrow.

If you're not talking about the tests, please let me know and provide a bit more details about what functionality doesn't work and how that manifests. I'd be surprised about that though, as I did develop and test the code with Linux and didn't encounter any problems there.

There are also lines not covered by the tests, it should be easy to add further tests.

Sure, no big deal, I can add tests for that as well.

Copy link
Collaborator

@juanjoDiaz juanjoDiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped a bunch of comments.
I like the idea of using tries but I'm not fully convinced of the implementation.
I think that we should have a TrieDictionaryFactory which:

  • has a method to convert all (or the selected) pickled dictionaries into pickled tries and save them on disk
  • has a method to load a dictionary which loads the dicto from disk if there or from the original dictionary as needed
  • doesn't do all that hashing logic because dictionaries are not likely to change unless the library is updated
  • reconsider if the current bytestring usage is right

logger = logging.getLogger(__name__)


class TrieWrapDict(MutableMapping):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simply modify the DictionaryFactory protocol to return a Mapping instead of Dict instead of having this wrapper?

Why MutableMapping instead of Mapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the constraints I gave myself was to implement this functionality without requiring changes to the existing code of Simplemma. That's why I didn't modify expected return types of the DictionaryFactory for example. Doing so would certainly simplify things, however that's something I didn't want to decide on my own.

It's a MutableMapping right now, because a dict, what's used when using the DefaultDictionaryFactory, is as well. No further reason beyond that.

def __init__(
self,
cache_max_size: int = 8,
use_disk_cache: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_disk_cache is not needed.
If the disk_cache_dir is None, then you don't use disk caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now if disk_cache_dir is None, a subdirectory in the users platform-specific cache directory is used to store the cache. So use_disk_cache is used right now to distinguish between disabling the disk cache and using the default cache location.

requirements-dev.txt Outdated Show resolved Hide resolved
from typing import ByteString, Dict, List, Optional, cast

from marisa_trie import BytesTrie, HUGE_CACHE # type: ignore[import-not-found]
from platformdirs import user_cache_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platformdirs is also a external dependency. So, why it doesn't need the type ignore like marisa_trie?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm not entirely sure, my guess is that mypy can't get any type information, because marisa_trie isn't Python code, but a C-extension and doesn't provide type information.


def __init__(
self,
cache_max_size: int = 8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing double caching: in-memory and in-disk.
These params make unclear what it is about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. I just kept that parameter from DefaultDictionaryFactory to be able to use TrieDictionaryFactory as a drop-in replacement for it.

hasher.update(chunk)
return hasher.hexdigest()

def _cache_is_current(self, lang: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario in which the in-disk trie won't match the in-memory one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is for checking whether the tries cached on disk still match the pickled bytestring dictionaries. They don't match anymore whenever a new version of Simplemma gets released which includes updated pickled bytestring dictionaries, resulting in the need to regenerate the cached tries.

self._trie = trie

def __getitem__(self, item):
return self._trie[item.decode()][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to do this decoding after just encoding feels a bit wasteful, every time makes me think if we did a right change changing this to bytestring...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I somewhat agree, it wasn't much better before, as BytesTrie expects the keys to be strings and the values to be bytestrings. So while now the keys have to be encoded/decoded, before the values had to be.

@@ -0,0 +1,234 @@
from collections.abc import ItemsView, KeysView
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we somehow reuse generic DictionaryFactory tests for all the implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For common functionality that would be helpful and I considered implementing dictionary agnostic tests, but concluded that this would be out-of-scope for this PR.

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 5, 2024

  • has a method to convert all (or the selected) pickled dictionaries into pickled tries and save them on disk

That happens right now when loading the dictionary. Having it as a separate method IMO wouldn't provide much benefit, but wouldn't be a big change either.

for lang in SUPPORTED_LANGUAGES:
    TrieDictionaryFactory().get_dictionary(lang)
  • has a method to load a dictionary which loads the dicto from disk if there or from the original dictionary as needed

That's currently done when loading a dictionary:

dictionary = TrieDictionaryFactory().get_dictionary("en")
  • doesn't do all that hashing logic because dictionaries are not likely to change unless the library is updated

How would you handle regenerating the cached tries after dictionary updates then?

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 5, 2024

I just pushed another commit fixing the syntax errors in the tests for Python <3.9.

The failure to run the tests with Python 3.13.0b1 is caused by marisa-trie not supporting Python 3.13 yet. For details about this, please check the issue I opened there: pytries/marisa-trie#104

Aside from those failed Github actions the only remaining one is with Python 3.10 on the MacOS runner and seems to be unrelated to my changes, as it's failing because of a missing Codecov token.

@adbar
Copy link
Owner

adbar commented Jun 6, 2024

Python 3.6 and 3.7 are now deprecated and the token error with 3.10 on MacOS is a Codecov problem, nothing serious.

Thanks for your work @Dunedan !

@juanjoDiaz Can we merge the PR or do you have other questions?

@juanjoDiaz
Copy link
Collaborator

I'm all in for having trie-backed dictionaries.

However, I'm still not convinced of the implementation.
I did a quick and dirty alternative proposal (not fully tested, just to show my thinking)

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 10, 2024

I'm all in for having trie-backed dictionaries.

However, I'm still not convinced of the implementation. I did a quick and dirty alternative proposal (not fully tested, just to show my thinking)

Thanks for that alternative proposal. 👍

As I'd prefer to keep the discussion in a single place, I'm going to comment on your PR here:

  • It encapsulates better the bytestring optimization while keeping the dictionary factory working on strings

That's what I'd prefer as well, but as that requires changes to the DictionaryFactory protocol, I didn't include it in my PR so far. If changing the DictionaryFactory protocol is up for discussion, I'm all for it. 👍

  • It does not require a external dependency to pick a temp folder for disk caching

The way you implemented that has several disadvantages from my point of view:

  • in cases where no per-user/per-process temp directory is used (e.g. when not using systemd's PrivateTmp=true option), the default cache directory would be shared by all local users. Depending on user groups and umasks, only the first user would be able to write in there. This could be solved though by adding the name or uid of the current user as part of the file path, like other applications do as well.
  • the temp directory is often backed by memory (e.g. through tmpfs), therefore it's no good place to store data, which should be cached on disk only
  • the generated trie dictionaries would be regenerated on every reboot and depending on the configuration for temp directory and usage of simplemma even more frequently (e.g. by default systemd deletes temporary files which haven't been accessed for 10 days)

All of that is why I didn't use a temp directory, but a user-specific cache directory instead.

  • It does not require all the additional hashing logic. It simply caches data in a folder specific to simplema's version or wherever the user says

In the end that's balancing between minimizing the number of times the trie-based dictionaries have to be regenerated and implementation complexity. I (obviously) opted for minimizing the number of regenerations, as I expect Simplemma not to update dictionaries with every release and regenerating them is pretty costly in terms of CPU and memory usage. If Simplemma is going to update dictionaries more frequently than I anticipate, then the hashing wouldn't make much sense of course.

  • It does not use internal functions of simplemma like _load_dictionary_from_disk

I like that. 👍

@adbar
Copy link
Owner

adbar commented Jun 12, 2024

I guess it would now be best to integrate some of the changes discussed into this PR? We could first start with the changes you both agree on.

Besides, as far as I'm concerned we can make changes in the DictionaryFactory protocol.

@juanjoDiaz
Copy link
Collaborator

We agree on modifying the DictionaryFactory protocol.
I cando that in my other PR.
And then the trie implementation could be adjusted to that.

I see 3 aspects that we need to align:

  • How to cache on disk
  • Is hashing needed to notice if dictionaries changed?
  • should this be part of simplemma or published as a separate module?

How to cache on disk

The way you implemented that has several disadvantages from my point of view:

  • in cases where no per-user/per-process temp directory is used (e.g. when not using systemd's PrivateTmp=true option), the default cache directory would be shared by all local users. Depending on user groups and umasks, only the first user would be able to write in there. This could be solved though by adding the name or uid of the current user as part of the file path, like other applications do as well.

  • the temp directory is often backed by memory (e.g. through tmpfs), therefore it's no good place to store data, which should be cached on disk only

  • the generated trie dictionaries would be regenerated on every reboot and depending on the configuration for temp directory and usage of simplemma even more frequently (e.g. by default systemd deletes temporary files which haven't been accessed for 10 days)

The more I think about this, the more I think that we should simple store the trie dictionaries in str(Path(__file__).parent / data_marisa_tries. It will be there for all runs of simplemma no matter how many processes you have running simplemma.
We could, of course, have an optional argument to provide a custom path. The same could also be used in the DefaultDictionary to provide own pickled dictionaries.

I think that we need to think of how Simplemma is used.
I expect that simplemma is use either in scripts or as part of a server.
In both cases, I would simply generate the tries on disk before running the script or starting the server.
If I put my server in a docker container, I would build the tries during startup or even doing the image creation.
I wouldn't expect much dynamism on creating the tries.

That's also why I propose to have a method that allow to pre-generate all tries in disk which can be a simple for loop documented.

Is hashing needed to notice if dictionaries changed?

In the end that's balancing between minimizing the number of times the trie-based dictionaries have to be regenerated and implementation complexity. I (obviously) opted for minimizing the number of regenerations, as I expect Simplemma not to update dictionaries with every release and regenerating them is pretty costly in terms of CPU and memory usage. If Simplemma is going to update dictionaries more frequently than I anticipate, then the hashing wouldn't make much sense of course.

I think that Simplemma never updates the dictionaries because it doesn't have a fully automated mechanism to do so.
Otherwise, I would update the dictionaries with every single release of Kaikki or whatever other dataset that we want to use. That's the reason why I would like to really automate the dictionary training.

So, for me the question is: are we going to release sooo many versions of simplemma that regenerating the tries with every trie would become a problem? I don' t think so.

And I have the same remark as before. How will simplemma be used?
In servers and APIs, I would consider creating the trie as part of setting up the service (building the docker image, building the VM or manually setting up the server, depending on how modern you are).

Should this be part of simplemma or published as a separate module?

One of the beauties of simplemma is that it's dependency free.
Having this would make it depend, even if only for one optional functionality, on an external library.

Also, today is marisa trie. But tomorrow it could be another trie implementation, a better hashmap or something else.
Do we want to add all those to simplemma?
Or should they be published as external modules that just implement the protocol so can be used by simplemma easily?

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 14, 2024

We agree on modifying the DictionaryFactory protocol.
I cando that in my other PR.

If you don't mind, I could also change the DictionaryFactory protocol in my PR.

The more I think about this, the more I think that we should simple store the trie dictionaries in str(Path(__file__).parent / data_marisa_tries.

That path might not even be writable. What's the reason for your resistance to just to use the users cache directory as default?

So, for me the question is: are we going to release sooo many versions of simplemma that regenerating the tries with every trie would become a problem? I don' t think so.

We can simply go on without the hashing for now. Hashing can easily be added later on, if the dictionaries change more frequently than anticipated.

One of the beauties of simplemma is that it's dependency free.
Having this would make it depend, even if only for one optional functionality, on an external library.

I believe you're limiting yourself too much with this stance on the use of dependencies. While I'm a fan of as little as necessary dependencies, sometimes adding a third-party library can massively improve the utility of a package. The use of marisa-trie is an example for that and I believe having it as an optional dependency is a good trade-off.

Should this be part of simplemma or published as a separate module?

While I'd prefer to have it as part of Simplemma, as it would be easier to use and maintain, a separate package would be fine for me as well. That's something you guys have to decide on how you want to handle that.

training/dictionary_pickler.py Fixed Show fixed Hide fixed
assert ("balconies" in dictionary) is True
assert ("balconies123" in dictionary) is False
with pytest.raises(KeyError):
dictionary["balconies123"]

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
assert ("balconies123" in wrapped_trie) is False
assert wrapped_trie["balconies"] == "balcony"
with pytest.raises(KeyError):
wrapped_trie[b"balconies123"]

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
@adbar
Copy link
Owner

adbar commented Jun 14, 2024

I would also prefer to integrate it into simplemma, as long as the installation remains optional the package is still usable without dependencies.

As for the rest of the questions I'll keep on following the discussion as I have no fixed opinion on this. Maybe keep hashing on the side for now for the sake of simplicity.

@juanjoDiaz
Copy link
Collaborator

Alright, we agreed on:

  • Introducing my changes to the DictionaryFactory to simplify things
  • Keeping the trie implementation within Simplemma
  • Not introducing hashing for now

The only thing left is to agree on the disk caching.
I'm no t really against it. I'm just trying to keep things simple.
I'm happy to keep the high configurability of the location with a sensible default.
However, I'd like to remove the dependency on platformdirs if possible.

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 18, 2024

Alright, we agreed on:

  • Introducing my changes to the DictionaryFactory to simplify things
  • Keeping the trie implementation within Simplemma
  • Not introducing hashing for now

All of that is already implemented in this PR. 🙂

The only thing left is to agree on the disk caching. I'm no t really against it. I'm just trying to keep things simple. I'm happy to keep the high configurability of the location with a sensible default. However, I'd like to remove the dependency on platformdirs if possible.

IMO the most simple and sensible solution is to use the users cache directory as default and get its location using platformdirs. If you disagree, please suggest a better alternative.

@juanjoDiaz
Copy link
Collaborator

I would use https://docs.python.org/3/library/tempfile.html instead of platformdirs to have a default temp folder with no dependencies.
And let the user provide it's own folder if he wants more control.

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 20, 2024

I would use https://docs.python.org/3/library/tempfile.html instead of platformdirs to have a default temp folder with no dependencies.

So this is the same approach you implemented in your PR and I above already explained the downsides this approach has. To sum it up: You prefer to have a default configuration which keeps all generated tries in memory and regenerates them on every reboot (or potentially more often), just to avoid having to use an additional, tiny, well maintained, pure-Python dependency. That approach negates the lower memory benefit of using tries and slows down the users applications as the tries have to be regenerated regularly. For long-running applications using Simplemma which get started on boot, not caching tries would even be better than this approach, as it'd mean lower memory utilization and the tries wouldn't have to be regenerated more often.

In my opinion that's not a user friendly default and I still don't understand the insistence on not using platformdirs. If you prefer adding additional code to Simplemma instead of adding an additional dependency, including a copy of the relevant code from platformdirs would be an alternative option, still providing the benefit of a sensible caching location.

@adbar
Copy link
Owner

adbar commented Jun 24, 2024

Since this PR entails substantial work I'd be in favor of moving on with the integration. We can always amend it later to improve on the issues discussed here.

@Dunedan The tests do not pass for Python 3.8, could you please have a look at it?

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 25, 2024

@Dunedan The tests do not pass for Python 3.8, could you please have a look at it?

Sorry, I missed that. This should be fixed now.

If you're fine with the code as-is, I suggest I rebase & squash the commits in this branch to have a single commit for all the changes and promote the pull request from "Draft" to "Ready for review" afterwards.

@juanjoDiaz
Copy link
Collaborator

juanjoDiaz commented Jun 25, 2024

So this is the same approach you implemented in your PR and I above already explained the downsides this approach has. To sum it up: You prefer to have a default configuration which keeps all generated tries in memory and regenerates them on every reboot (or potentially more often), just to avoid having to use an additional, tiny, well maintained, pure-Python dependency. That approach negates the lower memory benefit of using tries and slows down the users applications as the tries have to be regenerated regularly. For long-running applications using Simplemma which get started on boot, not caching tries would even be better than this approach, as it'd mean lower memory utilization and the tries wouldn't have to be regenerated more often.

Using tempfile does not keep things in memory but on a temp folder in disk.
My insistence in not having platformdirs is that it's an external dependency that the user will know that must be installed for things to work.
The only difference that I can see with platformdirs is that tempfile gives a folder per process whereas platformdirs gives a generic folder shared by every process.
I think that it makes sense to have temp data scoped by process. Lazy caching by many processes, if not done carefully may produce race conditions.
The user always have the option to provide a folder manually if he wants a folder shared by many processes. But, as said, this must be used carefully. E.g., loading the tries in disk on startup and then launching all the processes.

Maybe you are using simplemma very differently.

@Dunedan
Copy link
Contributor Author

Dunedan commented Jun 25, 2024

Using tempfile does not keep things in memory but on a temp folder in disk.

On Linux tempfile creates files and directories in /tmp which is nowadays often backed by tmpfs, which stores its contents in memory.

My insistence in not having platformdirs is that it's an external dependency that the user will know that must be installed for things to work.

I don't know how you install Python packages, but usually the user doesn't need to know about the dependencies of a package. In this case the user has to just follow the instructions for using tries and install simplemma[marisa-trie] e.g. with pip:

pip install simplemma[marisa-trie]

platformdirs will be installed automatically as part of that.

The only difference that I can see with platformdirs is that tempfile gives a folder per process whereas platformdirs gives a generic folder shared by every process. I think that it makes sense to have temp data scoped by process. Lazy caching by many processes, if not done carefully may produce race conditions.

As explained already, there are more than just this single difference.
It'd be easy to use a per-process sub-directory in a users cache directory, however that'd remove the benefit of caching, as they'd have to be regenerated on every invocation of the application.

Regarding race conditions: There is one possible race condition right now: If two processes start in parallel, one doesn't find a cached trie, starts generating it and is still in the process of writing it to disk when the second process tries to reads the unfinished file from the cache, then the second process will retrieve an exception, because the not-yet-finished-file isn't valid. I expect that case to be very rare though and fixing it would be as easy as writing the cached trie to a temporary file and moving it to its final position after writing.

@adbar
Copy link
Owner

adbar commented Jun 25, 2024

@Dunedan Thanks for fixing the PR, I believe you can prepare the PR so that it can be merged.

(My understanding is that merging the PR is not the end of the discussion and that work on this optional function can continue beyond this point.)

This adds an additional dictionary factory backed by MARISA-tries. This
dictionary factory on average offers 20x lower memory usage and 100x
faster initialization time, in exchange for reduced lemmatization and
language detection performance.

The first time loading a dictionary with the `TrieDictionaryFactory`
requires more memory and will take a few seconds, as the trie-backed
dictionary has to be generated on-the-fly from the pickled dict-based
dictionary first.
This changes the format of the dictionary returned by
`DictionaryFactory().get_dictionary()` from
`Dict[ByteString, ByteString]` to `Mapping[str, str] to accommodate
alternative dictionary factory implementations better and to ease the
dictionary handling again. This keeps the storage of pickled
dictionaries with byte strings though, as they're smaller than when
using strings.
@Dunedan Dunedan marked this pull request as ready for review June 26, 2024 04:38
@adbar adbar merged commit 63933fc into adbar:main Jun 26, 2024
14 checks passed
@adbar
Copy link
Owner

adbar commented Jun 26, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants