Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use COLLATE NOCASE instead of LOWER() for Bio::DB::SeqFeature::Store::DBI::SQLite #66

Merged
merged 1 commit into from
Mar 27, 2014

Conversation

nathanweeks
Copy link
Contributor

After upgrading from BioPerl 1.6.901 to 1.6.923, a significant performance regression was observed that led to GBrowse timeouts when doing keyword searches. This was traced to the change made to Bio::DB::SeqFeature::Store::DBI::SQLite made in commit d3af015.

When executing Bio::DB::SeqFeature::Store->get_features_by_name(), The SQL generated by the 1.6.923 version is:

SELECT f.id,f.object
  FROM feature as f, name as n
  WHERE (n.id=f.id AND lower(n.name) = lower('M1') AND n.display_name>0)

The 1.6.923 query (which uses the LOWER() function on the "name" column) results in a full table scan, while the 1.6.901 method, which omits LOWER(), allows the use of the index on the "name" column:

sqlite> explain query plan SELECT f.id,f.object FROM feature as f, name as n WHERE (n.id=f.id AND lower(n.name) = lower('M1') AND n.display_name>0);
0|0|1|SCAN TABLE name AS n
0|1|0|SEARCH TABLE feature AS f USING INTEGER PRIMARY KEY (rowid=?)
sqlite> explain query plan SELECT f.id,f.object FROM feature as f, name as n WHERE (n.id=f.id AND n.name = lower('M1') AND n.display_name>0);
0|0|1|SEARCH TABLE name AS n USING INDEX index_name_name (name=?)
0|1|0|SEARCH TABLE feature AS f USING INTEGER PRIMARY KEY (rowid=?)

A more efficient way of achieving a case-insensitive search that allows the use of indexes is to add the COLLATE NOCASE constraint to the relevant columns. This also allows the "LIKE" optimization ( http://www.sqlite.org/optoverview.html ) as well, which in my benchmark database containing 692300 gene/mRNA/CDS features resulted in an almost 7x speedup of Bio::DB::SeqFeature::Store->search_attributes().

The changes in this pull request add the COLLATE NOCASE option to name.name, attribute.attribute_value, and typelist.tag columns, and (for backwards compatibility with existing SeqFeature databases that don't have the COLLATE NOCASE constraint on those columns) replaces the use of the LOWER() function with the COLLATE NOCASE operator.

These changes pass the t/LocalDB/SeqFeature_SQLite.t regression test.

attribute.attribute_value, and typelist.tag; and remove the user of the
LOWER() function; to allow the use of indexes instead of full-table
scans. Add the COLLATE NOCASE opterator to affected queries for
backwards compatibility with Bio::DB::SeqFeature::Store databases
created without the COLLATE NOCASE constraints.
fjossandon added a commit that referenced this pull request Mar 27, 2014
Use COLLATE NOCASE instead of LOWER() for Bio::DB::SeqFeature::Store::DBI::SQLite
@fjossandon fjossandon merged commit eeedd0a into bioperl:master Mar 27, 2014
@fjossandon
Copy link
Member

I confirm that tests are passing on my system and Travis is fine, so I'll proceed with the merge

@nathanweeks nathanweeks deleted the sqlite_collate_nocase branch March 27, 2014 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants