Biopython trie implementation can't load large data sets #892

twrightsman · 2016-07-24T00:04:55Z

Migrated from https://redmine.open-bio.org/issues/3395

Imagine I have Biopython trie:
from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)
Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:
from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)
Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?

The text was updated successfully, but these errors were encountered:

twrightsman · 2016-07-24T00:05:26Z

@peterjc replied:

Can you try the same test case without gzip? i.e. Can you load /tmp/trie.dat rather than /tmp/trie.dat.gz?

Also I would try explicitly opening the files in binary mode.

P.S. Which OS, which version of Python, which version of Biopython?

twrightsman · 2016-07-24T00:05:55Z

@mnowotka replied:

Sure, I'll update this issue as soon as I check that.

twrightsman · 2016-07-24T00:06:26Z

@mnowotka replied:

OK, I tried using standard python file handler with explicit binary mode and it also failed. The file is now 165.5MB.
I also tried bz2 and zip compression, without any luck...

twrightsman · 2016-07-24T00:06:58Z

@peterjc replied:

Well that is progress - it means this isn't a problem coming from reading a compressed file on disk - you've made the test case simpler. Can you actually share a self contained example script? If not, I suggest you try halving the dataset (only record the first half of the tries), and retest. Then repeat - this should tell you if the problem is as you suspect a large dataset, or something specific about a special value.

Alternatively can you share the (compressed) file? I could at least check if it fails the same way here, and perhaps add some debugging code to get more information.

The error message itself is coming from some C code, which hasn't changed for some time:
https://github.com/biopython/biopython/blob/master/Bio/triemodule.c

The error itself is likely triggered in function _deserialize_transition in trie.c:
https://github.com/biopython/biopython/blob/master/Bio/triemodule.c

You still haven't told us the important information of which OS, which version of Python, which version of Biopython. Given it is C code, I'd also like to know how Biopython was installed (e.g. did you compile it from source yourself).

twrightsman · 2016-07-24T00:07:30Z

@mnowotka replied:

I'm using Ubuntu 12.04 LTS, Biopython 1.6 and Python 2.7.3.
Can you tell me where should I place compressed file?

twrightsman · 2016-07-24T00:07:49Z

@peterjc replied:

Sadly RedMine is limited to 5MB attachments. You could use DropBox or something similar, or if you have your own server put the file online temporarily for me to download it?

You probably have Biopython 1.60 (one dot sixty), there was no Biopython 1.6, one dot six. Did you install Biopython using the Ubuntu package manager? i.e. the GUI tool, or at the command line with something like 'apt-get install biopython'?

twrightsman · 2016-07-24T00:08:12Z

@mnowotka replied:

I put the file here: http://mnowotka.kei.pl/trie.4.dat.gz

twrightsman · 2016-07-24T00:08:27Z

@mnowotka replied:

I confirm, it's 1.60 version, I'm using. I installed it either by apt-get install or pip.

twrightsman · 2016-07-24T00:08:55Z

@peterjc replied:

I can reproduce the problem with your saved file under Mac OS X, using the latest Biopython from github, e.g.
$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from Bio import trie
import gzip
with gzip.open("trie.4.dat.gz") as handle:

... t = trie.load(handle)
... 
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
RuntimeError: loading failed for some reason
Adding a little debugging to the C code tells us where this fails (see attachment), line 669:
668 if(has_value) {
669 if(!(trie->value = (*read_value)(data)))
670 goto _deserialize_trie_error;
371 }
What kind of CPU does your machine have? i.e. is it a normal Intel or AMD CPU, or something unusual like a PowerPC where we have to worry about the bit order interpretation?

We may need a complete example creating the trie as well - the problem could be in the trie itself, the serialisation (writing to disk), or de-serialisation (loading from disk).

twrightsman · 2016-07-24T00:09:18Z

@mnowotka replied:

I'm using ubuntu virtual machine running on MacBookPro using single Intel® Core™ i7-2720QM CPU @ 2.20GHz processor. I will try to prepare code and data for which it fails.

twrightsman · 2016-07-24T00:09:41Z

@mdehoon replied:

It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem.

twrightsman · 2016-07-24T00:10:58Z

@mnowotka replied:

That just means that bug is in save() not in load() function.
But of course I will provide data file, although I can't guarantee it will be minimal.

twrightsman · 2016-07-24T00:11:19Z

@mdehoon replied:

You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file.

twrightsman · 2016-07-24T00:11:56Z

@mnowotka replied:

This is my minimal test case:

from Bio import trie
        import pickle
f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()
index = trie.trie()
for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'r')
        index = trie.load(f)
        f.close()

twrightsman · 2016-07-24T00:13:33Z

@mdehoon replied:

Hi Michal,

Unfortunately I cannot load your minimal_data.pkl file. At
list = pickle.load(f)
I get
ImportError: No module named django.db.models.query

Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail.

twrightsman · 2016-07-24T00:14:23Z

@mnowotka replied:

Hello,
As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well.

twrightsman · 2016-07-24T00:14:42Z

@mdehoon replied:

We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django.

twrightsman · 2016-07-24T00:15:14Z

@mnowotka replied:

Sure, today I'll strip all django dependencies and resubmit data set and loading code.

twrightsman · 2016-07-24T00:16:15Z

@mnowotka replied:

Minimal test case with stripped django dependencies, loading code below:

from Bio import trie
        import pickle
f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()
index = trie.trie()
for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
f = open('trie.dat', 'r')
        new_trie = trie.load(f)
        f.close()

twrightsman · 2016-07-24T00:16:53Z

@mdehoon replied:

The problem was indeed that one of the chunks had a size of 2000.
I've uploaded a fix to github; could you please give it a try? See

6e09a4a

In particular, please make sure that new_trie is identical to trie.

twrightsman · 2016-07-24T00:17:30Z

@mdehoon replied:

Michał, can you confirm that the fixed Bio.trie works for you? Then we can close this bug report.

twrightsman · 2016-07-24T00:29:54Z

@mnowotka replied:

Can you just give me two more weeks? I need some time to evaluate it.

twrightsman · 2016-07-24T00:30:17Z

@peterjc replied:

Kevin Wu reported a related issue, which we discussed with Jeff Chang (off list), where a key in the trie exceeded 1000 bytes (the original value of MAX_KEY_LENGTH). See:
http://lists.open-bio.org/pipermail/biopython-dev/2013-February/010284.html
31909c8

(Ideally we could give a specific ValueError exception here, but nevertheless the current print message is an improvement)

twrightsman · 2016-07-24T00:31:34Z

If this was fixed, can we add a test case for it? I can work on it.
minimal_data.pkl.txt

twrightsman · 2016-07-24T00:53:23Z

Can confirm it does not work in the test:

    def test_large_chunk(self):
        f = open('minimal_data.pkl', 'rb')
        list = pickle.load(f)
        f.close()
        index = trie.trie()
        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
        f = open('trie.dat', 'wb')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'rb')
        new_trie = trie.load(f)
        f.close()
        self.assertEqual(index, new_trie)

$ python3 run_tests.py test_trie
Python version: 3.5.2 (default, Jun 29 2016, 13:43:58) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]
Operating system: posix darwin
test_trie ... FAIL
======================================================================
FAIL: test_large_chunk (test_trie.TestTrie)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/biopython/Tests/test_trie.py", line 151, in test_large_chunk
    self.assertEqual(index, new_trie)
AssertionError: <trie object at 0x1057772a0> != <trie object at 0x1057772b8>

----------------------------------------------------------------------
Ran 1 test in 0.604 seconds

sticken88 · 2016-09-16T09:54:28Z

Any improvement? Can I help?

peterjc · 2016-09-16T12:05:59Z

@sticken88 it seems 31909c8 helped, but the minimal_data.pkl test case is still failing according to @twrightsman who checked in July.

If you know Python and C, then having some fresh eyes look at though would be great. The original author Jeff Chang is no longer actively involved in Biopython.

sticken88 · 2016-09-16T12:18:20Z

@peterjc I do know both of them, I'll try to have a look at it.

Check for null in py_handle.read retval Testcase for large trie save/load Squashed commit of pull request #1015, closes issue #892.

See prevision commit, squashed commit of pull request #1015 which closes issue #892 (handling large datasets).

peterjc · 2017-04-11T16:22:49Z

Should be fixed via #1015 from @noamkremen

Check for null in py_handle.read retval Testcase for large trie save/load Squashed commit of pull request biopython#1015, closes issue biopython#892.

See prevision commit, squashed commit of pull request biopython#1015 which closes issue biopython#892 (handling large datasets).

peterjc · 2020-07-24T21:09:15Z

Should have closed this in 2017 with the fix being applied.

Since then we removed Bio.trie in #2501 - but the licence would allow anyone to fork it and make a stand alone project for release on PyPI.

peterjc added the From Redmine label Jul 25, 2016

peterjc pushed a commit that referenced this issue Apr 11, 2017

Cope with large trie save/load

f72aaac

Check for null in py_handle.read retval Testcase for large trie save/load Squashed commit of pull request #1015, closes issue #892.

peterjc added a commit that referenced this issue Apr 11, 2017

Thank Noam for Bio.trie fix

028e16e

See prevision commit, squashed commit of pull request #1015 which closes issue #892 (handling large datasets).

MarkusPiotrowski pushed a commit to MarkusPiotrowski/biopython that referenced this issue Oct 31, 2017

Cope with large trie save/load

f6dbc12

Check for null in py_handle.read retval Testcase for large trie save/load Squashed commit of pull request biopython#1015, closes issue biopython#892.

MarkusPiotrowski pushed a commit to MarkusPiotrowski/biopython that referenced this issue Oct 31, 2017

Thank Noam for Bio.trie fix

6437c0c

See prevision commit, squashed commit of pull request biopython#1015 which closes issue biopython#892 (handling large datasets).

peterjc closed this as completed Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biopython trie implementation can't load large data sets #892

Biopython trie implementation can't load large data sets #892

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 •

edited

sticken88 commented Sep 16, 2016

peterjc commented Sep 16, 2016

sticken88 commented Sep 16, 2016

peterjc commented Apr 11, 2017

peterjc commented Jul 24, 2020

Biopython trie implementation can't load large data sets #892

Biopython trie implementation can't load large data sets #892

Comments

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 • edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 • edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 • edited

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016

twrightsman commented Jul 24, 2016 • edited

sticken88 commented Sep 16, 2016

peterjc commented Sep 16, 2016

sticken88 commented Sep 16, 2016

peterjc commented Apr 11, 2017

peterjc commented Jul 24, 2020

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016 •

edited

twrightsman commented Jul 24, 2016 •

edited