Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biopython trie implementation can't load large data sets #892

Closed
twrightsman opened this issue Jul 24, 2016 · 30 comments
Closed

Biopython trie implementation can't load large data sets #892

twrightsman opened this issue Jul 24, 2016 · 30 comments

Comments

@twrightsman
Copy link
Contributor

Migrated from https://redmine.open-bio.org/issues/3395

@mnowotka said:

Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?

@twrightsman
Copy link
Contributor Author

@peterjc replied:

Can you try the same test case without gzip? i.e. Can you load /tmp/trie.dat rather than /tmp/trie.dat.gz?

Also I would try explicitly opening the files in binary mode.

P.S. Which OS, which version of Python, which version of Biopython?

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

Sure, I'll update this issue as soon as I check that.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

OK, I tried using standard python file handler with explicit binary mode and it also failed. The file is now 165.5MB.
I also tried bz2 and zip compression, without any luck...

@twrightsman
Copy link
Contributor Author

@peterjc replied:

Well that is progress - it means this isn't a problem coming from reading a compressed file on disk - you've made the test case simpler. Can you actually share a self contained example script? If not, I suggest you try halving the dataset (only record the first half of the tries), and retest. Then repeat - this should tell you if the problem is as you suspect a large dataset, or something specific about a special value.

Alternatively can you share the (compressed) file? I could at least check if it fails the same way here, and perhaps add some debugging code to get more information.

The error message itself is coming from some C code, which hasn't changed for some time:
https://github.com/biopython/biopython/blob/master/Bio/triemodule.c

The error itself is likely triggered in function _deserialize_transition in trie.c:
https://github.com/biopython/biopython/blob/master/Bio/triemodule.c

You still haven't told us the important information of which OS, which version of Python, which version of Biopython. Given it is C code, I'd also like to know how Biopython was installed (e.g. did you compile it from source yourself).

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

I'm using Ubuntu 12.04 LTS, Biopython 1.6 and Python 2.7.3.
Can you tell me where should I place compressed file?

@twrightsman
Copy link
Contributor Author

@peterjc replied:

Sadly RedMine is limited to 5MB attachments. You could use DropBox or something similar, or if you have your own server put the file online temporarily for me to download it?

You probably have Biopython 1.60 (one dot sixty), there was no Biopython 1.6, one dot six. Did you install Biopython using the Ubuntu package manager? i.e. the GUI tool, or at the command line with something like 'apt-get install biopython'?

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

I put the file here: http://mnowotka.kei.pl/trie.4.dat.gz

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

I confirm, it's 1.60 version, I'm using. I installed it either by apt-get install or pip.

@twrightsman
Copy link
Contributor Author

twrightsman commented Jul 24, 2016

@peterjc replied:

I can reproduce the problem with your saved file under Mac OS X, using the latest Biopython from github, e.g.

$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from Bio import trie
import gzip
with gzip.open("trie.4.dat.gz") as handle:

... t = trie.load(handle)
... 
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
RuntimeError: loading failed for some reason

Adding a little debugging to the C code tells us where this fails (see attachment), line 669:

668 if(has_value) {
669 if(!(trie->value = (*read_value)(data)))
670 goto _deserialize_trie_error;
371 }

What kind of CPU does your machine have? i.e. is it a normal Intel or AMD CPU, or something unusual like a PowerPC where we have to worry about the bit order interpretation?

We may need a complete example creating the trie as well - the problem could be in the trie itself, the serialisation (writing to disk), or de-serialisation (loading from disk).

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

I'm using ubuntu virtual machine running on MacBookPro using single Intel® Core™ i7-2720QM CPU @ 2.20GHz processor. I will try to prepare code and data for which it fails.

@twrightsman
Copy link
Contributor Author

twrightsman commented Jul 24, 2016

@mdehoon replied:

It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

That just means that bug is in save() not in load() function.
But of course I will provide data file, although I can't guarantee it will be minimal.

@twrightsman
Copy link
Contributor Author

@mdehoon replied:

You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

This is my minimal test case:

from Bio import trie
        import pickle
f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()
index = trie.trie()
for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'r')
        index = trie.load(f)
        f.close()

@twrightsman
Copy link
Contributor Author

twrightsman commented Jul 24, 2016

@mdehoon replied:

Hi Michal,

Unfortunately I cannot load your minimal_data.pkl file. At
list = pickle.load(f)
I get
ImportError: No module named django.db.models.query

Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

Hello,
As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well.

@twrightsman
Copy link
Contributor Author

@mdehoon replied:

We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

Sure, today I'll strip all django dependencies and resubmit data set and loading code.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

Minimal test case with stripped django dependencies, loading code below:

from Bio import trie
        import pickle
f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()
index = trie.trie()
for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
f = open('trie.dat', 'r')
        new_trie = trie.load(f)
        f.close()

@twrightsman
Copy link
Contributor Author

@mdehoon replied:

The problem was indeed that one of the chunks had a size of 2000.
I've uploaded a fix to github; could you please give it a try? See

6e09a4a

In particular, please make sure that new_trie is identical to trie.

@twrightsman
Copy link
Contributor Author

@mdehoon replied:

Michał, can you confirm that the fixed Bio.trie works for you? Then we can close this bug report.

@twrightsman
Copy link
Contributor Author

@mnowotka replied:

Can you just give me two more weeks? I need some time to evaluate it.

@twrightsman
Copy link
Contributor Author

@peterjc replied:

Kevin Wu reported a related issue, which we discussed with Jeff Chang (off list), where a key in the trie exceeded 1000 bytes (the original value of MAX_KEY_LENGTH). See:
http://lists.open-bio.org/pipermail/biopython-dev/2013-February/010284.html
31909c8

(Ideally we could give a specific ValueError exception here, but nevertheless the current print message is an improvement)

@twrightsman
Copy link
Contributor Author

If this was fixed, can we add a test case for it? I can work on it.
minimal_data.pkl.txt

@twrightsman
Copy link
Contributor Author

twrightsman commented Jul 24, 2016

Can confirm it does not work in the test:

    def test_large_chunk(self):
        f = open('minimal_data.pkl', 'rb')
        list = pickle.load(f)
        f.close()
        index = trie.trie()
        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]
        f = open('trie.dat', 'wb')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'rb')
        new_trie = trie.load(f)
        f.close()
        self.assertEqual(index, new_trie)
$ python3 run_tests.py test_trie
Python version: 3.5.2 (default, Jun 29 2016, 13:43:58) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]
Operating system: posix darwin
test_trie ... FAIL
======================================================================
FAIL: test_large_chunk (test_trie.TestTrie)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/biopython/Tests/test_trie.py", line 151, in test_large_chunk
    self.assertEqual(index, new_trie)
AssertionError: <trie object at 0x1057772a0> != <trie object at 0x1057772b8>

----------------------------------------------------------------------
Ran 1 test in 0.604 seconds

@sticken88
Copy link

Any improvement? Can I help?

@peterjc
Copy link
Member

peterjc commented Sep 16, 2016

@sticken88 it seems 31909c8 helped, but the minimal_data.pkl test case is still failing according to @twrightsman who checked in July.

If you know Python and C, then having some fresh eyes look at though would be great. The original author Jeff Chang is no longer actively involved in Biopython.

@sticken88
Copy link

@peterjc I do know both of them, I'll try to have a look at it.

peterjc pushed a commit that referenced this issue Apr 11, 2017
Check for null in py_handle.read retval

Testcase for large trie save/load

Squashed commit of pull request #1015, closes issue #892.
peterjc added a commit that referenced this issue Apr 11, 2017
See prevision commit, squashed commit of pull request #1015
which closes issue #892 (handling large datasets).
@peterjc
Copy link
Member

peterjc commented Apr 11, 2017

Should be fixed via #1015 from @noamkremen

MarkusPiotrowski pushed a commit to MarkusPiotrowski/biopython that referenced this issue Oct 31, 2017
Check for null in py_handle.read retval

Testcase for large trie save/load

Squashed commit of pull request biopython#1015, closes issue biopython#892.
MarkusPiotrowski pushed a commit to MarkusPiotrowski/biopython that referenced this issue Oct 31, 2017
See prevision commit, squashed commit of pull request biopython#1015
which closes issue biopython#892 (handling large datasets).
@peterjc
Copy link
Member

peterjc commented Jul 24, 2020

Should have closed this in 2017 with the fix being applied.

Since then we removed Bio.trie in #2501 - but the licence would allow anyone to fork it and make a stand alone project for release on PyPI.

@peterjc peterjc closed this as completed Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants