New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid pickle file generated: "ValueError: binary data truncated (1)" #50
Comments
I understand that this is something that is tricky to reproduce. Therefore I've created a new repository with my code, and invited you to that repository. I've added documentation on how to run the code there. Beware that running through the full wikidata dump with 24 million entries takes several hours. After all the building is done you can quickly run the example and see it fail. Somehow the Let me know if there is anything I can do to help troubleshoot this. |
@EmilStenstrom Thanks a lot for your effort! I'll try to reproduce the bug. |
@EmilStenstrom I was able to build wikidata-reduced.json, it was really time consuming. :) Now can I debug, thank you. |
@WojciechMula Phew. Now you have some more waiting to do as you build the automation, and then try to search it. Building the automation works. The crash occurs when you try to load it using the last command. Sorry about the long waits :) |
@EmilStenstrom I'm working on this issue now, and managed to fix ugly memory leak. It's not a fix yet. :) |
@WojciechMula That sounds fantastic! :) I'm happy all that processing power didn't go to waste. |
@EmilStenstrom I'm still trying to reproduce the bug. Unfortunately, my laptop has too few memory and your app is killed after eating all 4GB. I tried to split input and then build/pickle/unpickle smaller chunks, but nothing wrong happened so far. I supposed there were some unicode-related problems (like #53), but it seems it's not a case. Just writing to give you feedback. |
@WojciechMula I'm thinking of different ways of helping out. Would it help if I sent you the pickle file? It is 88 Mb if I zip it, so I think I can give you a dropbox link? What e-mail should I send the link to? |
I had an idea that maybe the whole file was truncated? So I found that you can inspect a pickle file with pickletools from the python library. But it seems it ends in the expected way:
|
@EmilStenstrom If you can, please send me the pickle file directly. My e-mail: wojciech_mula@poczta.onet.pl |
Sent the link to your e-mail! I also pushed some updates the the script that creates the wikidata-reduced.json file (I sent you the old file, not the updated one to make sure you can reproduce). The updated file is now excluding lots of entities I'm not interested in anyway. Should be about half the size. Maybe that makes it possible to create the automation on 4 Gb? I'm on a Macbook Pro from work with 16 Gb RAM, so I can deal with huge files. |
@EmilStenstrom Just clicked what's wrong. If your automaton takes several gigabytes it's almost impossible that a pickle file would be several times smaller. |
@WojciechMula: So something is wrong with how I create the pickle file? |
@EmilStenstrom You're doing everything perfectly right, there's some bug in pickle. Just created automaton with 1.000.000 words and pickled file has 350MB. |
@EmilStenstrom You've shown tail of the pickled file, but could you please the beginning on file. On my system I have:
For sure the file is corrupted. At offset 33 is an empty bytes object, while it should be a large blob of data, field at offset 37 should be 20. For now I have no idea what's wrong, of course will continue work on this. Is everything OK when you build smaller automatons? |
Here's the first 20 lines of my file:
Looks very similar to yours. I'm using the latest stable version of python (3.5.2) that is distributed with Homebrew (the most popular package manager for Mac).
|
I've now tried with a couple of different files. First a new one generated with the updated script. It removes all empty labels:
Same error, but with a (3) at the end instead of a (1) as before. When I try to run pickletools on this file i get:
And it hangs a LONG time before outputing the OSError which I think confirms that this file contains the large blob of data that should be there. I've also tried with a much smaller wikidata-reduced-file (only 10 lines) and everything works fine there. Inspecting that file with pickletools yields the correct results:
|
@EmilStenstrom Thank you very much for checking this. I have some vague ideas about the source of errors, but need to verify it. I haven't replicated your problems yet. |
@EmilStenstrom Sorry for a stupid question, but: is your MacOS 64-bit? |
@WojciechMula Yes. The processor is an "Intel Core i7" which is 64 bit, and the macOS version is Sierra which runs in 64 bit mode. Also, my python returns 64 bit:
|
@EmilStenstrom Thank you, I supposed it might be somehow related to integer overflows. ATM I have no idea how to reproduce the error and what might be its cause. Could you recompile the module with -fsanitize=address and -fsanitize=undefined. I think setting CFLAGS is sufficient, i.e.:
|
@EmilStenstrom I didn't forget about the problem, just run out of ideas. |
Hi! I'm still planning to try the compile flags you suggested above, didn't have time! Maybe next week! |
Here's the output after running with the CFLAGS you suggested:
It takes all my RAM (16 Gb) for about an hour, and then gets killed. I guess we won't get any further from here. I think I should try to solve my problem in another way. Instead of trying to build the Trie in memory, I should persist it to disk in some sort of database optimized for this usecase. Thank you for all you hard work! |
@EmilStenstrom Thank you very much for your time and effort. I really want to fix that bug, but so far I couldn't. :( As far I understand your problem, you could try ngram-indexes. They allow to narrow searched space significantly, and are not too complicated. I did some experiments with full-text search and results were impressive. |
Hi I'm hitting something like this with Python-3.5 in windows. It seems related to Python-3.5 as I can read the same pickle in 3.4 without error and use the automaton. I'll email you details of the files and load them in dropbox for you to download. The dataset it much smaller than the one mentioned here (the pickle is only 17Mb. It doesn't seem to matter if the pickle is created in 3.4 or 3.5. The read issue only happens in Windows though. Reading it in linux returns an automaton with no words! Guess that is too much to hope for! |
David, thank you very much, will look closer at this. I've already download the file. |
Tested with windows in Python 3.6 and no error, so looks like Windows Python 3.5 only. |
I get the following:
I deleted the comment where I thought it was working because I tested with 3.6 not 3.5.3. Sigh |
I added another trace which produces the following output:
|
Could you send me a wheel for 64 bit 3.5.3. I'm exploring the idea that there is a build config issue with my laptop. I'm installing Visual Studio 2015 on another laptop just now to try it out on another machine |
Found how to break like this and why I think it works for you! Using 32 bit python 3.5.3 in Windows I can create and load the pickle no problem. The 64 bit version is where the issue lies. The pickle for 64 bit windows is larger, 294 bytes instead of 214 bytes. |
@woakesd I'm not on Windows right now, but I'm pretty sure that I have 64-bit versions of python and compilation also produces 64-bit binaries. But it might be a proper hint, thank you for checking it. I will send you my compiled modules tomorrow. |
@EmilStenstrom do you mind trying with the latest release? |
@pombredanne Sorry, I don't have any of the code left I used for this. Since my usecase was too big for RAM I just decided to go another route... |
I have the same problem. I create a large automaton (several gigabytes in memory), pickle and load from disk: I cannot test if 32-bit python works, because the dataset is too large and I get a memory error ( |
@Dobatymo is it possible to somehow get the dataset you use? I'd love to finally fix the bug, but I'm not able to reproduce it on my own. |
I use this one https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz |
Great! Thank you |
Hah, nice! That’s the original dataset I used too. But I used the Swedish version of Wikipedia, not the English one. The idea was to quickly find all Wikipedia articles from a span of text. Some thoughts:
|
Depending on compiler and system sizeof(int) != sizeof(size_t) != sizeof(Py_uintptr_t); On my 64-bit system sizeof(int) == 4 and sizeof(size_t, Py_uintptr_t) == 8. As it was noted by @Dobatymo and @EmilStenstrom, their data was huge, exceeding 8GB. My rough guess is the for really such sizes there was integer overflow. It might have happened either during pickling (on var `DumpState.id`) or unpickling (on automaton_unpickle var `id`). I tried reproduce this error, but my computer has only 4GB, and I was able to experiment with 3 + something GB. I was unable to reproduce the error.
@WojciechMula Maybe it's time to close this bug, until someone sees this problem with the latest version of the code, and Python 3.6+? |
@EmilStenstrom I'll keep this bug open, because I cannot prove the problem was gone. :) |
I still encounter the same problem. But the error message changed slightly.
EDIT: I did some more testing with how many strings it starts to fail. Using the same code as above I found the maximum number of wikititles which can be loaded. Loading These are the lines
|
@Dobatymo Thanks a lot for this! Seems there's overflow on signed 32-bit number. |
Yup, there are some pickling functions returning ints, which are 32-bit on 64-bit machines. |
@Dobatymo, @EmilStenstrom, @woakesd --- guys, thanks a lot for your help, I really appreciate your time and effort. Finally the cause of bug was found. It also wouldn't be possible without @lemire, who gave me access to severs with nice CPUs :) and plenty of RAM. Thanks to that I had opportunity to reproduce the bug and then validate my hypotheses. TL;DR: the module allocates huge chunks of memory that are not handled correctly by Python. Now more technical details, if sb's interested. Previously there were errors in module, where bare The automaton structure is saved in a huge array of bytes, then the array is converted into python object using I did simple tests on Python 3.6.4 and 3.5.2: So this is a bug (or feature) of Py_BuildValue. Whatever we call it, I wasn't aware of size limitation, and to be honest didn't even suspect that such limit might still exists. The real bug is that I store the whole data in a single chunk of memory. I checked that if (in Py3) a bytes object is created using dedicated procedure Solution is to split this huge chunk into smaller ones. |
It is documented that
And because the real function signature is
In that case I think you only have to define this macro and not change your code. Same for Python 2 (https://docs.python.org/2/c-api/arg.html#parsing-arguments-and-building-values)
|
Thanks, I've already seen this flag. TBH it's a weird solution, I don't like it. |
The current status:
or
|
🎆 🥇 🎊 AWESOMENESS! 🎆 🥇 🎊 |
@EmilStenstrom, @woakesd, @Dobatymo, @pombredanne, @leonqli --- version 1.1.13.1 is available on PIP. If you wish, please check it. Sorry that it took so long. And thank you very much for your help. |
I've managed to create an automation, and then pickle that automation to a 286 Mb pickle file. Problem is, when I try to unpickle, I get this error:
The source of that error is here: https://github.com/WojciechMula/pyahocorasick/blob/master/Automaton_pickle.c#L309
Would you mind helping me troubleshoot this? Any ideas? I don't think I can send files this big to you?
Update: This is how I build the pickle file:
Where generator just runs
yield ("Belgium", "Q31")
.The text was updated successfully, but these errors were encountered: