Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage during pickling #102

Closed
Dobatymo opened this issue Dec 13, 2018 · 8 comments
Closed

High memory usage during pickling #102

Dobatymo opened this issue Dec 13, 2018 · 8 comments

Comments

@Dobatymo
Copy link

First: Thank you so much for fixing the pickling bug.
Now let me be so impertinent to ask you if it is possible to reduce the memory usage during pickle.dump.

Using the same wikititles dataset as before, loading 7000000 lines requires 2.87 GB of memory. However, during pickle, the Python process reaches peak memory of 10.45 GB. That means about 7.5 GB of memory are used to pickle the object, which is almost 3 times the size of the original object in memory. The resulting file is 2.17 GB. For larger automatons I will run out of memory quickly and python raises a MemoryError.

@WojciechMula
Copy link
Owner

Thanks for this report, I'm aware of the issue. The problem is that we need to map physical addresses of objects into some abstract numbers and now it's done in an extremely naive way.

@WojciechMula
Copy link
Owner

WojciechMula commented Dec 15, 2018

@Dobatymo I tried to solve it, but seems pickle module does not support chunking of input. I mean, pickle collects all data and then write everything using single call to write() on a file object.

I'm thinking about adding custom save/load methods, that wouldn't be that memory consuming.

@Dobatymo
Copy link
Author

Thank you for your efforts!
It would be great if your custom save method could save it in a pickle compatible format. So that normal pickle.load would still work. But of course I would be happy also with any custom save/load methods.

@WojciechMula
Copy link
Owner

I know what you mean, but I am afraid it would be a burden for me. Pickle supports several formats, and even if we stick to the "highest", we need to catch up with changes, take into account Py2 & Py3, i.e. we'd have a heavy dependency. But this is my impression at the very moment, and I feel that it;d be easier to fix the official pickle module to support progressive saving. :)

BTW I did some hacking in the meantime. The full save/load sequence in Python code would be like this:

import ahocorasick
import pickle

A = ahocorasick.Automaton()
path = "automaton.dat"

A.save(path, pickle.dumps)
B = ahocorasick.load(path, pickle.loads)

Here is time & memory usage comparison of the WIP code:

+----------+---------------+------------+
| action   | memory [B]    |  time [s]  |
+==========+===============+============+
| save     |   597 880 832 |      2.58  |
+----------+---------------+------------+
| load     |   986 472 448 |      9.68  |
+----------+---------------+------------+
| pickle   | 1 757 429 760 |      3.89  |
+----------+---------------+------------+
| unpickle | 1 197 559 808 |      1.46  |
+----------+---------------+------------+

Automaton consumes 597MB. It means that save takes (almost) no extra memory; load requires extra memory. The current implementation of load is slow, I think it might be faster, perhaps at some memory cost.

@Dobatymo
Copy link
Author

That looks great. I understand that the Pickle support would be too hard to maintain.

For loading, I would be happy to sacrifice speed for less memory. I mostly use this library in web apps, so loading is usually a one-time operation which just increases the time to start the app-server. Whereas if the memory usage is too high, it would prohibit loading the automaton at all. But that's just my preference.

Thanks again for your work!

@WojciechMula
Copy link
Owner

@Dobatymo I guess it's possible to shrink memory requirement to really small values, but at cost of multiple re-reading input file and re-scanning memory. Say, your machine has 16GB, and automaton would occupy 15,5GB. Then you might request that 50MB memory during loading is the max.

BTW, have you ever specified the store attribute of Automaton class. I mean one of the constants STORE_ANY, STORE_INTS?

@Dobatymo
Copy link
Author

Of course it would be great to have the loading method be adaptable to the available memory. But I think at the moment my limiting factor is pickle.dump, the new save/load methods would already improve things a lot.

I have not specified the STORE attribute, as I use the automaton to store Python bytes objects, and there is no constant for that, is there?

WojciechMula added a commit that referenced this issue Dec 20, 2018
- Issue #102: add alternative pickling mechanism which requires
  less memory than the standard pickle solution
@WojciechMula
Copy link
Owner

@Dobatymo I released 1.3.0 which supports save/load methods. Usage is exactly as I posted above, but since you store only bytes objects, I believe callbacks passed to save/load can be simple lambda x: x

If loading time will be still too slow for you, please open a new issue. (I'll try to improve it, but without an issue it's not a priority.)

Speaking of STORE constants, bytes object are just python native objects, so no special flag is needed. I'm asking, because I'd like to get rid off this flag and then store only python objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants