Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve on-disk order in SeqIO & SearchIO indexes #1619

Merged
merged 7 commits into from Nov 23, 2018

Conversation

peterjc
Copy link
Member

@peterjc peterjc commented Apr 18, 2018

Since the standard dict object provided in Python 3.6 onwards preserves order, the in-memory dictionary-like indexing in SeqIO and SearchIO will now preserve record order there.

This pull request generalises this to apply to on older versions of Python, and for the SQLite based on-disk dictionary-like indexes.

This means looping over the index will load the records in the on-disk order, which will be much faster (previously it would be effectively at random, based on the key hash sorting).

Essentially this works by explicitly using OrderedDict on Python 3.5 or older for the in-memory indexes, and adding explicit sorting to the SQL command for the on-disk SQLite indexes.

Potentially the SQL schema could add an index to the offset column to speed this up further?

--

I hereby agree to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.

@peterjc peterjc requested a review from bow April 18, 2018 15:03
@codecov
Copy link

codecov bot commented Apr 18, 2018

Codecov Report

Merging #1619 into master will increase coverage by <.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1619      +/-   ##
==========================================
+ Coverage   85.47%   85.47%   +<.01%     
==========================================
  Files         301      301              
  Lines       49301    49313      +12     
==========================================
+ Hits        42141    42152      +11     
- Misses       7160     7161       +1
Impacted Files Coverage Δ
Bio/SeqIO/__init__.py 84.44% <100%> (+0.35%) ⬆️
Bio/SearchIO/__init__.py 89.74% <100%> (+0.55%) ⬆️
Bio/File.py 87.01% <83.33%> (-0.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0364518...386a2d8. Read the comment docs.

@peterjc
Copy link
Member Author

peterjc commented Apr 19, 2018

@bow do you have a favourite multi-record SearchIO file I should use for a unit test on the order?

Copy link
Member

@bow bow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @peterjc. My only comment is about the mention of Python 3.6. As far as I know, it is CPython that has insertion-ordered dicts in 3.6, and it won't be until Python 3.7 that this becomes a language feature (as opposed to an implementation feature).

Is the use of CPython something we should also check against here?

As for adding an offset column index, I think it would be useful indeed. Have you played around to see how much speedup would be gained?

@bow
Copy link
Member

bow commented Apr 22, 2018

@peterjc, the xml_2226_tblastn_001.xml is one you can use, but IIRC anything annotated with 'multiple queries' in Tests/Blast/README.txt or Tests/Hmmer/README.txt should be usable here.

@peterjc
Copy link
Member Author

peterjc commented Apr 23, 2018

Thanks.

I've not run any detailed timings yet with an extra index, but that seems prudent.

As to the Python 3.6 case, given we only care about CPython and PyPy, these should all be ordered. However, yes, we can't promise this on Jython or IronPython. Do you have a better wording suggestion and/or check in the if-statement for this?

Finally, with this work in place, it ought to be possible to expand the dictionary-like interface to also accept integer indexes (0, 1, 2, ..., n-1 for a file with n records). I can see this being useful, although have not thought about exactly what this mean for implementation on the SQLite side.

@bow
Copy link
Member

bow commented Apr 23, 2018

For the wording, how about:

Since Python 3.7, the default dict class maintains key order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

As of Biopython 1.72, we explicitly use OrderedDict for CPython older than
3.6 (and for other Python older than 3.7) so that you can always assume the
record order is preserved.

And the if block check can include the platform.get_implementation() check:

# Ordered dict is a language feature from Python 3.7 onwards.
# CPython 3.6 also already has ordered dicts.
if sys.version_info >= (3, 7) or (sys.version_info == (3, 6) and platform.python_implementation() == "CPython"):
    _dict = dict
else:
    from collections import OrderedDict as _dict

@bow
Copy link
Member

bow commented Apr 24, 2018

I just realized, perhaps this:

Since Python 3.7, the default dict class maintains key order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

is better phrased as this (since maintains insertion order sounds less ambiguous than maintains key order):

Since Python 3.7, the default dict class maintains insertion order, meaning
this dictionary will reflect the order of records given to it. For CPython,
this was already implemented in 3.6.

Python 3.6 has an ordered dict, but this was declared
an implementation detail only rather than a language
standard.

PyPy has always had an ordered dict.

Python 3.7 onwards has an ordered dict as part of the
language definition, see:

https://mail.python.org/pipermail/python-dev/2017-December/151283.html

We do not in general want to use OrderedDict which is
designed for different use cases (making lots of changes
to the dictionary content), see:

https://mail.python.org/pipermail/python-dev/2017-December/151266.html
This is simpler to think about, and also means looping over the
index to iterate over a file (or files) will be done in the
disk-order, which will be faster.

(Previously iteration over the index would happen in the sorted
order of the index key hashes, essentially random, thus there
would be a lot of disk random access jumping back and forth
through the file to parse the records in arbitrary order.)

Internally the SeqIO/SearchIO index_db function will now use
explicit sorting via the file number and offset so that the
keys and iteration order reflect that on disk.

Internally the SeqIO/SearchIO index function will now use an
OrderedDict on Python 3.5 or older.

Python 3.6 has an ordered dict, but this was declared an
implementation detail only, rather than a language standard.

PyPy has always had an ordered dict.

Python 3.7 onwards has an ordered dict as part of the
language definition, see:

https://mail.python.org/pipermail/python-dev/2017-December/151283.html

We do not in general want to use OrderedDict which is designed
for different use cases (making lots of changes to the dictionary
content), see:

https://mail.python.org/pipermail/python-dev/2017-December/151266.html
This should handle the corner cases of Jython or IronPython too.
Based on suggestion from Wibowo Arindrarto.
@peterjc
Copy link
Member Author

peterjc commented Nov 22, 2018

I forgot about this - I'll rebase it now to deal with the minor merge conflicts, and say Biopython 1.73 not 1.72 as we missed the last release.

This will be in Biopython 1.73 now.

Includes clarification to wording based on suggestion from Bow.
@peterjc peterjc merged commit e52918b into biopython:master Nov 23, 2018
@peterjc peterjc deleted the sorted_index branch November 23, 2018 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants