Preserve on-disk order in SeqIO & SearchIO indexes #1619

peterjc · 2018-04-18T15:03:37Z

Since the standard dict object provided in Python 3.6 onwards preserves order, the in-memory dictionary-like indexing in SeqIO and SearchIO will now preserve record order there.

This pull request generalises this to apply to on older versions of Python, and for the SQLite based on-disk dictionary-like indexes.

This means looping over the index will load the records in the on-disk order, which will be much faster (previously it would be effectively at random, based on the key hash sorting).

Essentially this works by explicitly using OrderedDict on Python 3.5 or older for the in-memory indexes, and adding explicit sorting to the SQL command for the on-disk SQLite indexes.

Potentially the SQL schema could add an index to the offset column to speed this up further?

--

I hereby agree to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.

codecov · 2018-04-18T16:23:49Z

Codecov Report

Merging #1619 into master will increase coverage by <.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master    #1619      +/-   ##
==========================================
+ Coverage   85.47%   85.47%   +<.01%     
==========================================
  Files         301      301              
  Lines       49301    49313      +12     
==========================================
+ Hits        42141    42152      +11     
- Misses       7160     7161       +1

Impacted Files	Coverage Δ
Bio/SeqIO/__init__.py	`84.44% <100%> (+0.35%)`	⬆️
Bio/SearchIO/__init__.py	`89.74% <100%> (+0.55%)`	⬆️
Bio/File.py	`87.01% <83.33%> (-0.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0364518...386a2d8. Read the comment docs.

peterjc · 2018-04-19T12:16:19Z

@bow do you have a favourite multi-record SearchIO file I should use for a unit test on the order?

bow

This looks good @peterjc. My only comment is about the mention of Python 3.6. As far as I know, it is CPython that has insertion-ordered dicts in 3.6, and it won't be until Python 3.7 that this becomes a language feature (as opposed to an implementation feature).

Is the use of CPython something we should also check against here?

As for adding an offset column index, I think it would be useful indeed. Have you played around to see how much speedup would be gained?

bow · 2018-04-22T20:47:57Z

@peterjc, the xml_2226_tblastn_001.xml is one you can use, but IIRC anything annotated with 'multiple queries' in Tests/Blast/README.txt or Tests/Hmmer/README.txt should be usable here.

peterjc · 2018-04-23T09:51:46Z

Thanks.

I've not run any detailed timings yet with an extra index, but that seems prudent.

As to the Python 3.6 case, given we only care about CPython and PyPy, these should all be ordered. However, yes, we can't promise this on Jython or IronPython. Do you have a better wording suggestion and/or check in the if-statement for this?

Finally, with this work in place, it ought to be possible to expand the dictionary-like interface to also accept integer indexes (0, 1, 2, ..., n-1 for a file with n records). I can see this being useful, although have not thought about exactly what this mean for implementation on the SQLite side.

bow · 2018-04-23T16:56:37Z

For the wording, how about:

Since Python 3.7, the default dict class maintains key order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

As of Biopython 1.72, we explicitly use OrderedDict for CPython older than
3.6 (and for other Python older than 3.7) so that you can always assume the
record order is preserved.

And the if block check can include the platform.get_implementation() check:

# Ordered dict is a language feature from Python 3.7 onwards.
# CPython 3.6 also already has ordered dicts.
if sys.version_info >= (3, 7) or (sys.version_info == (3, 6) and platform.python_implementation() == "CPython"):
    _dict = dict
else:
    from collections import OrderedDict as _dict

bow · 2018-04-24T11:19:06Z

I just realized, perhaps this:

Since Python 3.7, the default dict class maintains key order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

is better phrased as this (since maintains insertion order sounds less ambiguous than maintains key order):

Since Python 3.7, the default dict class maintains insertion order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

Python 3.6 has an ordered dict, but this was declared an implementation detail only rather than a language standard. PyPy has always had an ordered dict. Python 3.7 onwards has an ordered dict as part of the language definition, see: https://mail.python.org/pipermail/python-dev/2017-December/151283.html We do not in general want to use OrderedDict which is designed for different use cases (making lots of changes to the dictionary content), see: https://mail.python.org/pipermail/python-dev/2017-December/151266.html

This is simpler to think about, and also means looping over the index to iterate over a file (or files) will be done in the disk-order, which will be faster. (Previously iteration over the index would happen in the sorted order of the index key hashes, essentially random, thus there would be a lot of disk random access jumping back and forth through the file to parse the records in arbitrary order.) Internally the SeqIO/SearchIO index_db function will now use explicit sorting via the file number and offset so that the keys and iteration order reflect that on disk. Internally the SeqIO/SearchIO index function will now use an OrderedDict on Python 3.5 or older. Python 3.6 has an ordered dict, but this was declared an implementation detail only, rather than a language standard. PyPy has always had an ordered dict. Python 3.7 onwards has an ordered dict as part of the language definition, see: https://mail.python.org/pipermail/python-dev/2017-December/151283.html We do not in general want to use OrderedDict which is designed for different use cases (making lots of changes to the dictionary content), see: https://mail.python.org/pipermail/python-dev/2017-December/151266.html

This should handle the corner cases of Jython or IronPython too. Based on suggestion from Wibowo Arindrarto.

peterjc · 2018-11-22T16:01:42Z

I forgot about this - I'll rebase it now to deal with the minor merge conflicts, and say Biopython 1.73 not 1.72 as we missed the last release.

This will be in Biopython 1.73 now. Includes clarification to wording based on suggestion from Bow.

peterjc requested a review from bow April 18, 2018 15:03

bow approved these changes Apr 22, 2018

View reviewed changes

peterjc added 6 commits November 22, 2018 15:52

Use OrderedDict in SeqIO and SearchIO to_dict(...)

f298438

Test for SeqIO to_dict ordering

cbe1560

Tests for SeqIO index ordering

0c29e58

Make OrderedDict vs dict check more rigerous.

dcc6364

This should handle the corner cases of Jython or IronPython too. Based on suggestion from Wibowo Arindrarto.

SeqIO/SearchIO ordered indexing NEWS for Biopython 1.73

386a2d8

This will be in Biopython 1.73 now. Includes clarification to wording based on suggestion from Bow.

peterjc force-pushed the sorted_index branch from 5a16ad0 to 386a2d8 Compare November 22, 2018 16:11

peterjc merged commit e52918b into biopython:master Nov 23, 2018

peterjc deleted the sorted_index branch November 23, 2018 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve on-disk order in SeqIO & SearchIO indexes #1619

Preserve on-disk order in SeqIO & SearchIO indexes #1619

peterjc commented Apr 18, 2018

codecov bot commented Apr 18, 2018 •

edited

peterjc commented Apr 19, 2018

bow left a comment

bow commented Apr 22, 2018

peterjc commented Apr 23, 2018

bow commented Apr 23, 2018 •

edited

bow commented Apr 24, 2018

peterjc commented Nov 22, 2018

Preserve on-disk order in SeqIO & SearchIO indexes #1619

Preserve on-disk order in SeqIO & SearchIO indexes #1619

Conversation

peterjc commented Apr 18, 2018

codecov bot commented Apr 18, 2018 • edited

Codecov Report

peterjc commented Apr 19, 2018

bow left a comment

Choose a reason for hiding this comment

bow commented Apr 22, 2018

peterjc commented Apr 23, 2018

bow commented Apr 23, 2018 • edited

bow commented Apr 24, 2018

peterjc commented Nov 22, 2018

codecov bot commented Apr 18, 2018 •

edited

bow commented Apr 23, 2018 •

edited