Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid pickle file generated: "ValueError: binary data truncated (1)" #50

Closed
EmilStenstrom opened this Issue Jan 4, 2017 · 60 comments

Comments

Projects
None yet
5 participants
@EmilStenstrom
Copy link

EmilStenstrom commented Jan 4, 2017

I've managed to create an automation, and then pickle that automation to a 286 Mb pickle file. Problem is, when I try to unpickle, I get this error:

$ python -m pickle wikidata-automation.pickle 
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py", line 1605, in <module>
    obj = load(f)
ValueError: binary data truncated (1)

The source of that error is here: https://github.com/WojciechMula/pyahocorasick/blob/master/Automaton_pickle.c#L309

Would you mind helping me troubleshoot this? Any ideas? I don't think I can send files this big to you?

Update: This is how I build the pickle file:

automaton = ahocorasick.Automaton()
for i, (label, id_) in enumerate(generator):
    automaton.add_word(label, id_)

automaton.make_automaton()

with open(filename_out, "wb") as f:
    pickle.dump(automaton, f)

Where generator just runs yield ("Belgium", "Q31").

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 8, 2017

I understand that this is something that is tricky to reproduce. Therefore I've created a new repository with my code, and invited you to that repository. I've added documentation on how to run the code there.

Beware that running through the full wikidata dump with 24 million entries takes several hours. After all the building is done you can quickly run the example and see it fail.

Somehow the UNLIKELY(size < count*(sizeof(TrieNode) - sizeof(TrieNode*))) returns true, preventing the pickle file to be read.

Let me know if there is anything I can do to help troubleshoot this.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 8, 2017

@EmilStenstrom Thanks a lot for your effort! I'll try to reproduce the bug.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 9, 2017

@EmilStenstrom I was able to build wikidata-reduced.json, it was really time consuming. :) Now can I debug, thank you.

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 9, 2017

@WojciechMula Phew. Now you have some more waiting to do as you build the automation, and then try to search it. Building the automation works. The crash occurs when you try to load it using the last command.

Sorry about the long waits :)

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 11, 2017

@EmilStenstrom I'm working on this issue now, and managed to fix ugly memory leak. It's not a fix yet. :)

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 11, 2017

@WojciechMula That sounds fantastic! :) I'm happy all that processing power didn't go to waste.

@WojciechMula WojciechMula added the bug label Jan 11, 2017

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 15, 2017

@EmilStenstrom I'm still trying to reproduce the bug. Unfortunately, my laptop has too few memory and your app is killed after eating all 4GB. I tried to split input and then build/pickle/unpickle smaller chunks, but nothing wrong happened so far. I supposed there were some unicode-related problems (like #53), but it seems it's not a case. Just writing to give you feedback.

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 15, 2017

@WojciechMula I'm thinking of different ways of helping out. Would it help if I sent you the pickle file? It is 88 Mb if I zip it, so I think I can give you a dropbox link? What e-mail should I send the link to?

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 15, 2017

I had an idea that maybe the whole file was truncated? So I found that you can inspect a pickle file with pickletools from the python library. But it seems it ends in the expected way:

$ python -m pickletools wikidata_automation.pickle | tail
268149006: r            LONG_BINPUT 13858041
268149011: X            BINUNICODE 'Q27876039'
268149025: r            LONG_BINPUT 13858042
268149030: e            APPENDS    (MARK at 268141046)
268149031: t        TUPLE      (MARK at 27)
268149032: r    LONG_BINPUT 13858043
268149037: R    REDUCE
268149038: r    LONG_BINPUT 13858044
268149043: .    STOP
highest protocol among opcodes = 3
@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 15, 2017

@EmilStenstrom If you can, please send me the pickle file directly. My e-mail: wojciech_mula@poczta.onet.pl

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 15, 2017

Sent the link to your e-mail!

I also pushed some updates the the script that creates the wikidata-reduced.json file (I sent you the old file, not the updated one to make sure you can reproduce). The updated file is now excluding lots of entities I'm not interested in anyway. Should be about half the size. Maybe that makes it possible to create the automation on 4 Gb? I'm on a Macbook Pro from work with 16 Gb RAM, so I can deal with huge files.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 16, 2017

@EmilStenstrom Just clicked what's wrong. If your automaton takes several gigabytes it's almost impossible that a pickle file would be several times smaller.

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 16, 2017

@WojciechMula: So something is wrong with how I create the pickle file?

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 16, 2017

@EmilStenstrom You're doing everything perfectly right, there's some bug in pickle. Just created automaton with 1.000.000 words and pickled file has 350MB.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 16, 2017

@EmilStenstrom You've shown tail of the pickled file, but could you please the beginning on file. On my system I have:

$ python3 -m pickletools ref.pickle
    0: \x80 PROTO      3
    2: c    GLOBAL     'ahocorasick Automaton'
   25: q    BINPUT     0
   27: (    MARK
   28: J        BININT     168760016
   33: C        SHORT_BINBYTES b''
   35: q        BINPUT     1
   37: K        BININT1    2
   39: K        BININT1    2
   41: J        BININT     16182875
   46: J        BININT     16182874
   51: M        BININT2    310
   54: ]        EMPTY_LIST
   55: q        BINPUT     2
   57: (        MARK

For sure the file is corrupted. At offset 33 is an empty bytes object, while it should be a large blob of data, field at offset 37 should be 20. For now I have no idea what's wrong, of course will continue work on this.

Is everything OK when you build smaller automatons?
Do you have own compilation of Python, or it comes from precompiled package?

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 16, 2017

Here's the first 20 lines of my file:

$ python -m pickletools wikidata_automation.pickle | head -n20
    0: \x80 PROTO      3
    2: c    GLOBAL     'ahocorasick Automaton'
   25: q    BINPUT     0
   27: (    MARK
   28: J        BININT     168760016
   33: C        SHORT_BINBYTES b''
   35: q        BINPUT     1
   37: K        BININT1    2
   39: K        BININT1    2
   41: J        BININT     16182875
   46: J        BININT     16182874
   51: M        BININT2    310
   54: ]        EMPTY_LIST
   55: q        BINPUT     2
   57: (        MARK
   58: X            BINUNICODE 'Q23600353'
   72: q            BINPUT     3
   74: X            BINUNICODE 'Q14877373'
   88: q            BINPUT     4
   90: X            BINUNICODE 'Q26446664'

Looks very similar to yours.

I'm using the latest stable version of python (3.5.2) that is distributed with Homebrew (the most popular package manager for Mac).

$ brew info python3
python3: stable 3.5.2 (bottled), devel 3.6.0rc1, HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org/
/usr/local/Cellar/python3/3.5.2 (3,664 files, 55.0M) *
  Poured from bottle on 2016-07-07 at 20:30:19
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/python3.rb
$ python3.5 
Python 3.5.2 (default, Jun 29 2016, 13:43:58) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 16, 2017

I've now tried with a couple of different files. First a new one generated with the updated script. It removes all empty labels:

$ python run_wikidata_search.py wikidata_automation_noempty.pickle "Belgium, Sweden and Poland are three fine countries"
Traceback (most recent call last):
  File "run_wikidata_search.py", line 17, in <module>
    main(filename_in, text)
  File "run_wikidata_search.py", line 11, in main
    automation = pickle.load(f)
ValueError: binary data truncated (3)

Same error, but with a (3) at the end instead of a (1) as before. When I try to run pickletools on this file i get:

$ python -m pickletools wikidata_automation_noempty.pickle
    0: \x80 PROTO      3
    2: c    GLOBAL     'ahocorasick Automaton'
   25: q    BINPUT     0
   27: (    MARK
   28: J        BININT     144852559
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickletools.py", line 2833, in <module>
    args.indentlevel, annotate)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickletools.py", line 2475, in dis
    print(line, file=out)
OSError: [Errno 22] Invalid argument

And it hangs a LONG time before outputing the OSError which I think confirms that this file contains the large blob of data that should be there.

I've also tried with a much smaller wikidata-reduced-file (only 10 lines) and everything works fine there. Inspecting that file with pickletools yields the correct results:

$ python -m pickletools wikidata_automation_mini.pickle
    0: \x80 PROTO      3
    2: c    GLOBAL     'ahocorasick Automaton'
   25: q    BINPUT     0
   27: (    MARK
   28: M        BININT2    302
   31: B        BINBYTES   b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00"\x00\x00\x00\x00\x00\x00\x00$\x00\x00\x00\x00\x00\x00\x005\x00\x00\x00\x00\x00\x00\x00?\x00\x00\x00\x00\x00\x00\x00E\x00\x00\x00\x00\x00\x00\x00P\x00\x00\x00\x00\x00\x00\x00\\\x00\x00\x00\x00\x00\x00\x00a\x00\x00\x00\x00\x00\x00\x00j\x00\x00\x00\x00...
@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 16, 2017

@EmilStenstrom Thank you very much for checking this. I have some vague ideas about the source of errors, but need to verify it. I haven't replicated your problems yet.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 17, 2017

@EmilStenstrom Sorry for a stupid question, but: is your MacOS 64-bit?

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 17, 2017

@WojciechMula Yes. The processor is an "Intel Core i7" which is 64 bit, and the macOS version is Sierra which runs in 64 bit mode. Also, my python returns 64 bit:

$ python -c "import platform; print(platform.architecture())"
('64bit', '')
@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 20, 2017

@EmilStenstrom Thank you, I supposed it might be somehow related to integer overflows. ATM I have no idea how to reproduce the error and what might be its cause.

Could you recompile the module with -fsanitize=address and -fsanitize=undefined. I think setting CFLAGS is sufficient, i.e.:

export CFLAGS="-fsanitize=address -fsanitize=undefined"
@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jan 25, 2017

@EmilStenstrom I didn't forget about the problem, just run out of ideas.

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jan 26, 2017

Hi! I'm still planning to try the compile flags you suggested above, didn't have time! Maybe next week!

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Feb 8, 2017

Here's the output after running with the CFLAGS you suggested:

$ python run_wikidata_build_automation.py wikidata-reduced.json wikidata_automation.pickle
==57697==ERROR: Interceptors are not working. This may be because AddressSanitizer is loaded too late (e.g. via dlopen). Please launch the executable with:
DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/8.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
==57697==AddressSanitizer CHECK failed: /Library/Caches/com.apple.xbs/Sources/clang_compiler_rt/clang-800.0.42.1/src/projects/compiler-rt/lib/sanitizer_common/sanitizer_mac.cc:690 "(("interceptors not installed" && 0)) != (0)" (0x0, 0x0)
    <empty stack>

Abort trap: 6
$ export DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/8.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
$ python run_wikidata_build_automation.py wikidata-reduced.json wikidata_automation.pickle
Building automaton...
Building automaton, step 0...
Building automaton, step 100000...
Building automaton, step 200000...
...
Building automaton, step 16800000...
Time to make it...
Killed: 9

It takes all my RAM (16 Gb) for about an hour, and then gets killed.

I guess we won't get any further from here. I think I should try to solve my problem in another way. Instead of trying to build the Trie in memory, I should persist it to disk in some sort of database optimized for this usecase. Thank you for all you hard work!

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Feb 8, 2017

@EmilStenstrom Thank you very much for your time and effort. I really want to fix that bug, but so far I couldn't. :(

As far I understand your problem, you could try ngram-indexes. They allow to narrow searched space significantly, and are not too complicated. I did some experiments with full-text search and results were impressive.

@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 17, 2017

Hi

I'm hitting something like this with Python-3.5 in windows. It seems related to Python-3.5 as I can read the same pickle in 3.4 without error and use the automaton.

I'll email you details of the files and load them in dropbox for you to download. The dataset it much smaller than the one mentioned here (the pickle is only 17Mb.

It doesn't seem to matter if the pickle is created in 3.4 or 3.5. The read issue only happens in Windows though.

Reading it in linux returns an automaton with no words! Guess that is too much to hope for!

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Feb 17, 2017

David, thank you very much, will look closer at this. I've already download the file.

@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 17, 2017

Tested with windows in Python 3.6 and no error, so looks like Windows Python 3.5 only.

@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 20, 2017

I get the following:

(<class 'ahocorasick.Automaton'>, (7, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x82\x01\x00\x00\x02\x00\x00\x00\x00\x00
\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x82\x01a\x00\x03\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x0
0\x00\x82\x01b\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x83\x01c\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x82\x01d\x00\x06
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x01\x00\x00\x00\x82\x01e\x00\x07\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x83\x0
1f\x00', 1, 30, 100, 2, 2, 3, ['abc', 'def']))

I deleted the comment where I thought it was working because I tested with 3.6 not 3.5.3. Sigh

@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 20, 2017

I added another trace which produces the following output:

size 216, count 7, sizeof(TrieNode) 32, sizeof(TrieNode*) 8
@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 21, 2017

Could you send me a wheel for 64 bit 3.5.3.

I'm exploring the idea that there is a build config issue with my laptop.

I'm installing Visual Studio 2015 on another laptop just now to try it out on another machine

@woakesd

This comment has been minimized.

Copy link
Contributor

woakesd commented Feb 22, 2017

Found how to break like this and why I think it works for you!

Using 32 bit python 3.5.3 in Windows I can create and load the pickle no problem.

The 64 bit version is where the issue lies.

The pickle for 64 bit windows is larger, 294 bytes instead of 214 bytes.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Feb 22, 2017

@woakesd I'm not on Windows right now, but I'm pretty sure that I have 64-bit versions of python and compilation also produces 64-bit binaries. But it might be a proper hint, thank you for checking it.

I will send you my compiled modules tomorrow.

@pombredanne

This comment has been minimized.

Copy link
Collaborator

pombredanne commented May 18, 2018

@EmilStenstrom do you mind trying with the latest release?

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented May 19, 2018

@pombredanne Sorry, I don't have any of the code left I used for this. Since my usecase was too big for RAM I just decided to go another route...

@Dobatymo

This comment has been minimized.

Copy link

Dobatymo commented Jul 12, 2018

I have the same problem. I create a large automaton (several gigabytes in memory), pickle and load from disk: ValueError: binary data truncated (1)
Python 3.6.6 x64 Windows 10, installed with pip install pyahocorasick

I cannot test if 32-bit python works, because the dataset is too large and I get a memory error (SystemError: <built-in method add_word of ahocorasick.Automaton object at 0x080305E0> returned NULL without setting an error)

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jul 19, 2018

@Dobatymo is it possible to somehow get the dataset you use? I'd love to finally fix the bug, but I'm not able to reproduce it on my own.

@Dobatymo

This comment has been minimized.

Copy link

Dobatymo commented Jul 19, 2018

I use this one https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz
Tomorrow I can check which version/date of the dump exactly and give you the code to reproduce it.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Jul 19, 2018

Great! Thank you

@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Jul 19, 2018

Hah, nice! That’s the original dataset I used too. But I used the Swedish version of Wikipedia, not the English one. The idea was to quickly find all Wikipedia articles from a span of text.

Some thoughts:

  1. does the automation get bigger than ram?
  2. is there very long strings in Wikipedia that somehow throws this off?
  3. are there Unicode codepoints that mess things up?
@Dobatymo

This comment has been minimized.

Copy link

Dobatymo commented Jul 20, 2018

Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32

import gzip, pickle
import ahocorasick

def read(wiki_titles):
	with gzip.open(wiki_titles, "rt", encoding="utf-8") as fr:
		for line in fr:
			yield line.strip()

def create_automaton(wiki_titles):
	a = ahocorasick.Automaton()

	for i, line in enumerate(read(wiki_titles)):
		a.add_word(line.lower(), i)
	a.make_automaton()

	return a

if __name__ == "__main__":

	# https://dumps.wikimedia.org/enwiki/20180701/enwiki-20180701-all-titles-in-ns0.gz
	wiki_path = "enwiki-20180701-all-titles-in-ns0.gz"
	pickle_path = "enwiki.p"

	with open(pickle_path, "wb") as fw:
		a = create_automaton(wiki_path)
		pickle.dump(a, fw)
		del a

	with open(pickle_path, "rb") as fr:
		a = pickle.load(fr)

Traceback (most recent call last):
  File "...\test.py", line 30, in <module>
    a = pickle.load(fr)
ValueError: binary data truncated (1)

Memory usage maxes out at 10.75 GB (just small enough to work on my 16GB machine)
stats

WojciechMula added a commit that referenced this issue Dec 1, 2018

Possible bug fix in pickling/unpickling #50
Depending on compiler and system sizeof(int) != sizeof(size_t) != sizeof(Py_uintptr_t);
On my 64-bit system sizeof(int) == 4 and sizeof(size_t, Py_uintptr_t) == 8.

As it was noted by @Dobatymo and @EmilStenstrom, their data was huge, exceeding
8GB.  My rough guess is the for really such sizes there was integer overflow.
It might have happened either during pickling (on var `DumpState.id`) or
unpickling (on automaton_unpickle var `id`).

I tried reproduce this error, but my computer has only 4GB, and I was able
to experiment with 3 + something GB. I was unable to reproduce the error.
@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Dec 1, 2018

@WojciechMula Maybe it's time to close this bug, until someone sees this problem with the latest version of the code, and Python 3.6+?

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 2, 2018

@EmilStenstrom I'll keep this bug open, because I cannot prove the problem was gone. :)

@Dobatymo

This comment has been minimized.

Copy link

Dobatymo commented Dec 6, 2018

I still encounter the same problem. But the error message changed slightly.

  • Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
  • Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
ValueError: Data truncated [parsing header of node #5910864]: offset 236675600, expected at least 32 bytes

EDIT: I did some more testing with how many strings it starts to fail. Using the same code as above I found the maximum number of wikititles which can be loaded.

Loading 6458078 titles still works and yields a pickle file of 2.02 GB (2,177,764,377 bytes)
Loading 6458079 failes with ValueError: Data truncated [parsing header of node #0]: offset 0, expected at least 32 bytes and yields a pickle file of 28.8 MB (30,280,907 bytes)

These are the lines 6458077-6458080 from the wikititles, so I don't think string content is causing the problem

John_Briscoe_(baseball)
John_Briscoe_(disambiguation)
John_Briscoe_(water_engineer)
John_Brisker
@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 6, 2018

@Dobatymo Thanks a lot for this! Seems there's overflow on signed 32-bit number.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 6, 2018

Yup, there are some pickling functions returning ints, which are 32-bit on 64-bit machines.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 6, 2018

@Dobatymo, @EmilStenstrom, @woakesd --- guys, thanks a lot for your help, I really appreciate your time and effort. Finally the cause of bug was found. It also wouldn't be possible without @lemire, who gave me access to severs with nice CPUs :) and plenty of RAM. Thanks to that I had opportunity to reproduce the bug and then validate my hypotheses.

TL;DR: the module allocates huge chunks of memory that are not handled correctly by Python.

Now more technical details, if sb's interested.

Previously there were errors in module, where bare int were used rather than size_t. It was fixed, but as @Dobatymo has checked, the problem still exists.

The automaton structure is saved in a huge array of bytes, then the array is converted into python object using Py_BuildValue("y#", array_ptr, array_size) [the actual invocation is different, the format "y#" is crucial]. The problem is that array_size is internally cast to int; even though array_size is handled in the code using 64-bit unsigned values, the value is narrowed --- its lower 32 bits are used. This is why @Dobatymo got ~30MB while there should by ~2GB.

I did simple tests on Python 3.6.4 and 3.5.2: Py_BuildValue("y#", array_ptr, 2**31 - 1) yields non-null object having 2**31 - 1 bytes inside. But for size 2**31, the procedure yields an empty object, that has zero bytes.

So this is a bug (or feature) of Py_BuildValue. Whatever we call it, I wasn't aware of size limitation, and to be honest didn't even suspect that such limit might still exists.

The real bug is that I store the whole data in a single chunk of memory. I checked that if (in Py3) a bytes object is created using dedicated procedure PyBytes_FromStringAndSize we hit another limit: "OverflowError: cannot serialize a bytes object larger than 4 GiB". (But now it's a regular Python exception, which is perfect.)

Solution is to split this huge chunk into smaller ones.

@Dobatymo

This comment has been minimized.

Copy link

Dobatymo commented Dec 7, 2018

It is documented that Py_BuildValue uses int inputs (https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue)

y# (bytes) [const char *, int]

And because the real function signature is PyObject *Py_BuildValue(const char *format, ...) the compiler cannot catch it. But it seems Py_ssize_t versions are available.

Note

For all # variants of formats (s#, y#, etc.), the type of the length argument (int or Py_ssize_t) is controlled by defining the macro PY_SSIZE_T_CLEAN before including Python.h. If the macro was defined, length is a Py_ssize_t rather than an int. This behavior will change in a future Python version to only support Py_ssize_t and drop int support. It is best to always define PY_SSIZE_T_CLEAN.

In that case I think you only have to define this macro and not change your code.

Same for Python 2 (https://docs.python.org/2/c-api/arg.html#parsing-arguments-and-building-values)

Starting with Python 2.5 the type of the length argument can be controlled by defining the macro PY_SSIZE_T_CLEAN before including Python.h. If the macro is defined, length is a Py_ssize_t rather than an int.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 7, 2018

Thanks, I've already seen this flag. TBH it's a weird solution, I don't like it.

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 10, 2018

The current status:

$ time python3 tests/pickle_stresstest.py -c -p -u --max-words=8000000 --file-gz enwiki-latest-all-titles-in-ns0.gz
Adding 8000000 words from enwiki-latest-all-titles-in-ns0.gz
Automaton statistics:
- nodes_count  : 71365726
- words_count  : 8000001
- links_count  : 71365725
- longest_word : 250
- sizeof_node  : 40
- total_size   : -869412456 ### this is another bug
Saving automaton in pickle_stresstest.pickle
   file size is 2.67 GB (2862650552 bytes)
Loading automaton from pickle_stresstest.pickle
Comparing added words with restored automaton

real	0m59.723s
user	0m38.060s
sys	0m5.468s

or

$ time python3 tests/pickle_stresstest.py -p -u --max-words=12000000 --file-gz enwiki-latest-all-titles-in-ns0.gz
Adding 12000000 words from enwiki-latest-all-titles-in-ns0.gz
Automaton statistics:
- nodes_count  : 104971950
- words_count  : 12000001
- links_count  : 104971949
- longest_word : 251
- sizeof_node  : 40
- total_size   : 743686296
Saving automaton in pickle_stresstest.pickle
   file size is 3.92 GB (4210909896 bytes)
Loading automaton from pickle_stresstest.pickle

real	1m16.736s
user	0m45.988s
sys	0m6.372s
@EmilStenstrom

This comment has been minimized.

Copy link
Author

EmilStenstrom commented Dec 11, 2018

🎆 🥇 🎊 AWESOMENESS! 🎆 🥇 🎊

@WojciechMula

This comment has been minimized.

Copy link
Owner

WojciechMula commented Dec 11, 2018

@EmilStenstrom, @woakesd, @Dobatymo, @pombredanne, @leonqli --- version 1.1.13.1 is available on PIP. If you wish, please check it.

Sorry that it took so long. And thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.