Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy search for tools search #3356

Merged
merged 7 commits into from Jan 23, 2017

Conversation

Projects
None yet
5 participants
@anuprulez
Copy link
Member

commented Dec 21, 2016

This PR proposes to include approximate string search (also called fuzzy search) for tools. It is implemented using the same query divided into 3-grams. For example a string 'text' has these 3-grams - 'tex' and 'ext'. The q-grams size is kept 3 as the number of minimum characters required to perform the search is also 3.

With each 3-gram, the search is performed as in the existing approach and the score for each result (tool) given by BM25 is added up for each 3-gram. The results are then sorted in the decreasing order of their BM25 scores. The main advantage of using this approximate search comes when a query is misspelt. For example, if one wants to type 'text' as query but misspell it and type 'texx'. In this scenario, the proposed approach still shows the following tools:

'addValue', 'mergeCols1', 'genomespace_importer', 'comp1', 'Convert characters1', 'trimmer', 'createInterval', 'Remove beginning1', 'cat1', 'Show beginning1', 'Cut1', 'ChangeCase', 'secure_hash_message_digest', 'wc_gnu', 'Show tail1', 'Paste1', 'random_lines1', 'Grep1', 'Count1', 'gff_filter_by_attribute'

while the existing approach brings none.

With the correct spelling of 'text', these results are obtained:

proposed approach -

'addValue', 'mergeCols1', 'gene2exon1', 'Convert characters1', 'Show beginning1', 'Cut1', 'wc_gnu', 'random_lines1', 'ChangeCase', 'secure_hash_message_digest', 'Paste1', 'trimmer', 'createInterval', 'cat1', 'Show tail1', 'Extract_features1', 'Extract genomic DNA 1', 'Interval2Maf1', 'genomespace_importer', 'Interval2Maf_pairwise1', 'maf_by_block_number1', 'comp1', 'Remove beginning1', 'Grep1', 'Count1', 'gff_filter_by_attribute'

existing approach -

'addValue', 'mergeCols1', 'genomespace_importer', 'comp1', 'wc_gnu', 'random_lines1', 'ChangeCase', 'trimmer', 'createInterval', 'Show tail1', 'secure_hash_message_digest', 'Convert characters1', 'Cut1', 'Paste1', 'cat1', 'Show beginning1', 'Remove beginning1', 'Grep1', 'Count1', 'gff_filter_by_attribute'

Another example for query 'charcter':

The 3-grams are 'cha', 'har', 'arc', 'rct', 'cte', 'ter'

proposed approach -

'trimmer', 'Grep1', 'wc_gnu', 'barchart_gnuplot', 'Filter1', 'gff_filter_by_attribute', 'ucsc_table_direct_archaea1', 'sort1', 'ChangeCase', 'MAF_To_Interval1', 'gtf_filter_by_attribute_values_list', 'gff_filter_by_feature_count', 'Convert characters1', 'Extract_features1', 'upload1', 'Interval2Maf_pairwise1', 'MAF_filter', 'gtf2bedgraph', 'Interval_Maf_Merged_Fasta2', 'wiggle2simple1', 'Extract genomic DNA 1', 'Summary_Statistics1', 'createInterval', 'Interval2Maf1', 'genomespace_exporter', 'wig_to_bigWig', 'MAF_To_Fasta1', 'join1', 'MAF_To_BED1', 'Count1', 'secure_hash_message_digest', 'Paste1', 'gff2bed1'

existing approach -

No result

The q-gram size can be made configurable. Your suggestions are welcome.

@galaxybot galaxybot added the triage label Dec 21, 2016

@galaxybot galaxybot added this to the 17.01 milestone Dec 21, 2016

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Dec 22, 2016

Nice! That's been requested multiple times!

There shouldn't be any change in static/ since you didn't modify any javascript or css file.

@bgruening bgruening requested review from martenson and removed request for martenson Dec 22, 2016

@martenson
Copy link
Member

left a comment

Thank you for the PR. Apart from the inline ones I have 2 main comments:

  1. Given this considerably changes output of the API and the behavior of tool search I would prefer this to be a configurable option, at least until we are sure it is superior to fulltext.

  2. For ngrams searching we could use the Whoosh built-in ngram support (http://whoosh.readthedocs.io/en/latest/ngrams.html) which includes NgramFilter for ngram splitting with min and maxsize.

Thank you for the contribution, I hope my review is helpful to you.

# Sort the results based on aggregated BM25 score in decreasing order of scores
hits_with_score = sorted(hits_with_score.items(), key=lambda x: x[1], reverse=True)
# Return the tool ids
return [item[0] for item in hits_with_score]

This comment has been minimized.

Copy link
@martenson

martenson Jan 12, 2017

Member

This does not respect the tool_search_limit configuration. It returns all hits.

hits = searcher.search( parser.parse( '*' + q + '*' ), limit=float( tool_search_limit ) )
return [ hit[ 'id' ] for hit in hits ]
# Fuzzy search using q-grams
qgram_length = 3

This comment has been minimized.

Copy link
@martenson

martenson Jan 12, 2017

Member

Making qgram_length configurable is a good idea. I would pursue it.

@martenson martenson modified the milestones: 17.05, 17.01 Jan 12, 2017

@martenson

This comment has been minimized.

Copy link
Member

commented Jan 16, 2017

@galaxybot test this

@anuprulez

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2017

@martenson Thanks a lot for your review!

The changes include:

  • NgramFilter from Whoosh search library for breaking down the query into multiple ngrams.

  • Ngrams search is now configurable through a config variable tool_enable_ngram_search in galaxy.ini file. Moreover, the minimum and maximum sizes of ngrams can be configured as well through tool_ngram_minsize and tool_ngram_maxsize variables respectively.

  • For limiting the search results upto the value of tool_ngram_minsize in Ngrams search, the items are taken from starting to this value in the set of hits.

@bgruening

anuprulez added some commits Jan 19, 2017

@martenson

This comment has been minimized.

Copy link
Member

commented Jan 20, 2017

@galaxybot test this

@martenson martenson added status/review and removed status/WIP labels Jan 23, 2017

@martenson

This comment has been minimized.

Copy link
Member

commented Jan 23, 2017

Good job @anuprulez - looking great! Thank you for the contribution very much.

@martenson martenson modified the milestones: 17.01, 17.05 Jan 23, 2017

@martenson martenson merged commit 5bd8a0f into galaxyproject:dev Jan 23, 2017

4 checks passed

api test Build finished. 256 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 134 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 580 tests run, 0 skipped, 0 failed.
Details
@anuprulez

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2017

Thanks a lot for your suggestions and merging!
@martenson @bgruening @nsoranzo

@bgruening

This comment has been minimized.

Copy link
Member

commented Jan 23, 2017

Wuhu!! Thanks to all! Good job @anuprulez!

@bgruening bgruening deleted the bgruening:fuzzysearch branch Jan 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.