Skip to content
The Nested Containment List for Python. Basically an immutable interval-tree that is silly fast for both construction and lookups.
C Python Shell
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples Changedump. Break tests? Jun 14, 2018
ncls Merge branch 'master' of github.com:endrebak/ncls Aug 13, 2019
tests Add 32-bit methods Nov 13, 2018
.gitattributes 0.0.42 Apr 16, 2019
.gitignore 0.0.42 Apr 16, 2019
.travis.yml Trigger travis Jan 28, 2019
CHANGELOG ncls Aug 13, 2019
LICENSE Fix #8 Apr 12, 2019
MANIFEST.in Update version, manifest Feb 27, 2019
README.md Update README Apr 22, 2019
build_wheels.sh Add wheel script Apr 24, 2019
setup.py Fix #8 Apr 12, 2019

README.md

Nested containment list

Build Status PyPI version

The Nested Containment List is a datastructure for interval overlap queries, like the interval tree. It is usually an order of magnitude faster than the interval tree both for building and query lookups.

The implementation here is a revived version of the one used in the now defunct PyGr library, which died of bitrot. I have made it less memory-consuming and created wrapper functions which allows batch-querying the NCLS for further speed gains.

It was implemented to be the cornerstone of the PyRanges project, but I have made it available to the Python community as a stand-alone library. Enjoy.

Paper: https://academic.oup.com/bioinformatics/article/23/11/1386/199545

Install

pip install ncls

Usage

# see the examples/ folder for more examples
from ncls import NCLS

import pandas as pd

starts = pd.Series(range(0, 5))
ends = starts + 100
ids = starts

ncls = NCLS(starts.values, ends.values, ids.values)

# python API, slower
it = ncls.find_overlap(0, 2)
for i in it:
    print(i)
# (0, 100, 0)
# (1, 101, 1)

starts_query = pd.Series([1, 3])
ends_query = pd.Series([52, 14])
indexes_query = pd.Series([10000, 100])

# everything done in C/Cython; faster
ncls.all_overlaps_both(starts_query.values, ends_query.values, indexes_query.values)
# (array([10000, 10000, 10000, 10000, 10000,   100,   100,   100,   100,
#          100]), array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))

# return intervals in python (slow/mem-consuming)
intervals = ncls.intervals()
intervals
# [(0, 100, 0), (1, 101, 1), (2, 102, 2), (3, 103, 3), (4, 104, 4)]

Benchmark

Test file of 100 million intervals (created by subsetting gencode gtf with replacement):

Library Function Time (s) Memory (GB)
bx-python build 161.7 2.5
ncls build 3.15 0.5
bx-python overlap 148.4 4.3
ncls overlap 7.2 0.5

Building is 50 times faster and overlap queries are 20 times faster. Memory usage is one fifth and one ninth.

Cite

https://www.biorxiv.org/content/10.1101/609396v1

Original paper

Alexander V. Alekseyenko, Christopher J. Lee; Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics, Volume 23, Issue 11, 1 June 2007, Pages 1386–1393, https://doi.org/10.1093/bioinformatics/btl647

You can’t perform that action at this time.