Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve tool panel search #2272

Open
martenson opened this issue Apr 29, 2016 · 63 comments
Open

improve tool panel search #2272

martenson opened this issue Apr 29, 2016 · 63 comments

Comments

@martenson
Copy link
Member

@martenson martenson commented Apr 29, 2016

reported by @jennaj
given the number of tools on Main the results of search needs to be better, mainly:

  • give more results and let people scroll
  • give more weight to name and section
  • let people search for tool IDs
  • make search understand hyphens

I am trying to address the first two (for Main) with: galaxyproject/usegalaxy-playbook#19

@hexylena
Copy link
Member

@hexylena hexylena commented Aug 8, 2016

utvalg_999 019

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Aug 8, 2016

@erasche since recently Galaxy now searches tool IDs

screenshot 2016-08-08 13 49 51

I think improvements might be made regarding the interchangeability of ' ', '_', '-'

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Aug 8, 2016

@martenson that's great! +1 for allowing users to substitute in ' ' for the _. I know I look for my tools by ID sometimes and fail to find them.

Loading

@dannon
Copy link
Member

@dannon dannon commented Aug 8, 2016

+1 to that, requiring users to know _'s is unfortunate. Is the input not tokenized and matched? (I guess not, if the broken string doesn't match?)

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Aug 8, 2016

@martenson RFC: Things I would like to see indexed and available to search, along with my feelings on their boosts:

  • tool name (5)
  • tool id (4)
  • tool help text (2)
  • tool parameter helps (0.3)
  • tool input data formats (1)
  • tool output data formats (0.6)

I just find myself frustrated when I cannot find the tool I want or the results are very limited because of what is searched upon. Of course, I do not know what the state of 16.07/dev is, have not gotten there yet.

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Aug 8, 2016

@erasche we have these boosts on Main and these are the defaults

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Aug 8, 2016

@martenson hey that's most of the things I need. In that case, then it would be nice to have more space to display this and where the search actually "hit". Apologies, have not been following along with this stuff closely enough to make informed comments.

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Aug 8, 2016

we have the 'hit' information but I did not figure out a good place to display it - related to the limited canvas

Loading

@jennaj
Copy link
Member

@jennaj jennaj commented Aug 8, 2016

I would add in that it would be nice to have the underlying tool (binary) name be part of the search, if wrapped under a slightly different tool name or short label of some type.

Related utilized/dependent binaries would be included in this. (lower "boost" probably)

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Oct 11, 2018

xref #1084

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Oct 11, 2018

digging through Main usage metrics any improvements to toolpanel search should be very well worth it

screenshot 2018-10-11 11 19 36

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Dec 5, 2018

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Jan 18, 2019

Another concrete issue:

https://usegalaxy.eu/api/tools?q=peakachu returns 2 results, neither are displayed on frontend. client issue.

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Jan 18, 2019

@erasche I cannot reproduce
screenshot 2019-01-18 11 02 38

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Jan 18, 2019

Firefox on linux, cannot repro in chrome.

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Jan 18, 2019

two subsequent searches for peakachu yielded different results for me in firefox, the first does not show in client, the second does

["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0"]
["toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.1", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.0", "toolshed.g2.bx.psu.edu/repos/rnateam/peakachu/peakachu/0.1.0.2"]

Loading

@nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented Jan 18, 2019

@erasche I also can reproduce ~50% of the times on the UI, on both Firefox and Chrome on Linux. One of the web handler hasn't reloaded the toolbox probably.

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Jan 18, 2019

xref new issue for the display bug: #7238

Loading

@jennaj
Copy link
Member

@jennaj jennaj commented Feb 20, 2019

another search term returning unexpected results: ncbi

browser might not matter, same results using chrome or safari under mac osx (but didn't test firefox)

  • usegalaxy.org == finds "get data > NCBI bam" download tool but not "get data > NCBI fastq". This server doesn't include "get data > NCBI pileup" anymore (tool routinely failed -- data usually too large plus any represents ambiguous scientific content)

  • usegalaxy.eu == doesn't find any of these three (all are present under "get data")

  • usegalaxy.org.au == finds all three (under "get data")

  • usegalaxy.be == doesn't find any of these three (all are present under "get data")

Loading

@FredericBGA
Copy link
Contributor

@FredericBGA FredericBGA commented May 24, 2019

another search term: convert
It did not find the convert tool (Text Manipulation>Convert delimiters to TAB)

It works

Tries made with Firefox.

I discover boosters! What can I set in order to find a result as usegalaxy.eu?

# tool_name_boost = 9
# tool_section_boost = 3
# tool_description_boost = 2
# tool_label_boost = 1
# tool_stub_boost = 5
# tool_help_boost = 0.5

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented May 24, 2019

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

Loading

@FredericBGA
Copy link
Contributor

@FredericBGA FredericBGA commented May 24, 2019

@FredericBGA .eu's boosts are here https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/gxconfig.yml#L1076 but they're pretty aggressive / strange compared to other sites'

@erasche thank you! The link in Martin post above is broken. I will try with something between default and .eu

Loading

@martenson
Copy link
Member Author

@martenson martenson commented May 24, 2019

@FredericBGA we have this on Main atm

  tool_name_boost: 12
  tool_section_boost: 5

We should probably experiment with tool_enable_ngram_search

I created a PR to mimic EU and enable ngram too: galaxyproject/usegalaxy-playbook#228

Loading

@nsoranzo
Copy link
Member

@nsoranzo nsoranzo commented May 24, 2019

We use tool_enable_ngram_search: true, which works fine.

Loading

@FredericBGA
Copy link
Contributor

@FredericBGA FredericBGA commented May 27, 2019

thank you all for sharing your config with me!
It works now, with:
tool_name_boost: 20

Loading

@jennaj
Copy link
Member

@jennaj jennaj commented May 31, 2019

tool_name_boost: 20

Wonder if Main would benefit from that much higher boost, specifically. Searches are still a bit unpredictable and result too limited imho. martin probably is on that already... is not new and we've tried a few variations already but still could use some tuning.

Has to be frustrating to search for a tool and not find it -- as the stats above he posted backup.

Loading

@wm75
Copy link
Contributor

@wm75 wm75 commented Jan 10, 2020

a question: is there really no way to express an AND between search terms?

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Jan 10, 2020

@wm75 not at the moment, can you please provide examples of searches that don't behave as you'd expect?

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Jan 21, 2020

Odd result on EU:

Multiqc appears in two sections

afbeelding

Searching for it yields only one:
afbeelding

Contrast with fastqc which appears in two and searches yield two (of the same version)

afbeelding

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Aug 4, 2020

xref: #10030

Loading

@martenson martenson added this to TODO in Tool Search Aug 4, 2020
@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

"UCSC main" is unfindable on EU: https://usegalaxy.eu/api/tools?q=ucsc+main doesn't include ucsc_table_direct1, but it does include 150 other things. @bgruening

It does on .org, but not nearly the top hit for a search on the exact tool title

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Oct 29, 2020

on EU searchingucsc has 52 results and "Main" is the last one 😭

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Oct 29, 2020

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

Possibly! I just expected the tool_name boost to have the biggest effect. I would love to debug the internals sometime, and see what scores x boost are being returned for each of these results that are doing 'better' than the direct text match. Like, if those are returning first, clearly they say "ucsc main" dozens of time in their descriptions or something?

Loading

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Oct 29, 2020

is it possible that exact matches overflow in score ? This is a search for ucsc:
Screenshot 2020-10-29 at 11 54 15

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

@mvdbeek neat! How did you obtain that?

Loading

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Oct 29, 2020

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

Ahh ok, wondered if it was a secret api I was missing.

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

so I booted up a copy of the app against EU because I always feel worried about reproducing locally with the v. different toolboxes. This looks odd to me:

(Pdb) galaxy_app.toolbox_search.parser.parse('*' + 'ucsc main' + '*')
Or([Wildcard('name', '*ucsc'), Wildcard('old_id', '*ucsc'), Wildcard('description', '*ucsc'), Wildcard('section', '*ucsc'), Wildcard('help', '*ucsc'), Wildcard('labels', '*ucsc'), Wildcard('stub', '*ucsc'), Prefix('name', 'main'), Prefix('old_id', 'main'), Prefix('description', 'main'), Prefix('section', 'main'), Prefix('help', 'main'), Prefix('labels', 'main'), Prefix('stub', 'main')])

why does only ucsc stay prefixed with *, and main loses it's one?

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('*ucsc main*'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
(296, <Hit {'id': 'ucsc_table_direct1'}>, 0.4618992716030244)
(297, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.22288633588616671)
None

or without *

(Pdb) for idx, hit in enumerate(galaxy_app.toolbox_search.searcher.search(galaxy_app.toolbox_search.parser.parse('ucsc main'), limit=400)): print((idx, hit, hit.score) if 'ucsc' in hit['id'] else None)
...
None
(103, <Hit {'id': 'ucsc_table_direct1'}>, 0.4371187999893639)
None
None
None
(107, <Hit {'id': 'ucsc_table_direct_archaea1'}>, 0.1950255439003959)

trying out the individual fields of a search, seems like description is a negative in this case:

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc* *main*'), limit=40): print(hit, hit.score)
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.9662951360360124
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpstblastn_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_rpsblast_wrapper/2.10.1+galaxy0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: notseq61/5.0.0'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/tmhmm2/0.0.16'}> 0.8341075747686874
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.4456963370285034
<Hit {'id': 'ucsc_table_direct1'}> 0.44038679685244836
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.20059770229755006
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.1750118444883364
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.13424605571325632
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.11250224762822575
<Hit {'id': 'bwtool-lift'}> 0.05599209141540291
tool name description
vcf_to_maf_customtrack1 VCF to MAF Custom Track for display at UCSC
ucsc_table_direct1 UCSC Main table browser

feels very odd that vcf scores higher.

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

Some more debugging

(Pdb) print(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True).termdocs)
{('name', b'ucsc'): array('I', [101, 1460, 2546]), ('description', b'ucsc'): array('I', [967, 1215, 1559, 2255, 2427]), ('description', b'maintaining'): array('I', [2122]), ('name', b'main'): array('I', [2546])}

So that's matching maintaing (hmm. I get why but. surely that should score lower than an exact word boundary match?)

and doc 2546 which hits both main + ucsc is indeed our tool:

(Pdb) print(list(galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True))[1].docnum)
2546

aha (ish)

(Pdb) for hit in galaxy_app.toolbox_search.searcher.search(MultifieldParser(['name', 'old_id', 'description'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.9)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'vcf_to_maf_customtrack1'}> 1.7205082440315107 967 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 0.38998429489994046 1215 [('description', b'ucsc')]
<Hit {'id': 'ucsc_table_direct_archaea1'}> 0.1755229895103563 101 [('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 0.1736500008575602 2122 [('description', b'maintaining')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 0.15313536392729438 1559 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 0.11746529874909928 2427 [('description', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 0.09843946667469754 1460 [('name', b'ucsc')]
<Hit {'id': 'bwtool-lift'}> 0.048993079988477545 2255 [('description', b'ucsc')]

orgroup changed from 0.1 to 0.9 doesn't produce a big different. Oddly I've specified old_id in the MultifieldParser, but there are no ID matches? I'd exepect

<Hit {'id': 'ucsc_table_direct1'}> 0.44322627964299954 2546 [('name', b'main'), ('name', b'ucsc'), ('old_id', b'ucsc')]

but old_id isn't anywhere there? It's when help is included that the results become garbage:

<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/xpath/xpath/1.0.0'}> 5.7824381765403645 1006 [('help', b'maintainers')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/rsem/rsem_prepare_reference/1.1.17'}> 5.74646047844554 676 [('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_communitytype/mothur_get_communitytype/1.39.5.0'}> 5.6337872716676785 1183 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 5.628083628215019 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_lefse/mothur_lefse/1.39.5.0'}> 5.6050434590571285 1771 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.9'}> 5.4913612182626235 947 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_classify_rf/mothur_classify_rf/1.36.1.0'}> 5.4325805833938325 2391 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_pcr_seqs/mothur_pcr_seqs/1.39.5.0'}> 5.3072620955901515 121 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_get_mimarkspackage/mothur_get_mimarkspackage/1.39.5.0'}> 5.187549416742254 1622 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_merge_files/mothur_merge_files/1.39.5.0'}> 5.187549416742254 1767 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_primer_design/mothur_primer_design/1.39.5.0'}> 5.18227372171205 286 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/openbabel/ctb_subsearch/0.1'}> 5.105499296205204 1339 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_fastq_info/mothur_fastq_info/1.39.5.0'}> 5.105499296205204 1414 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_lookup/mothur_make_lookup/1.39.5.0'}> 5.0794508304082395 1662 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_cluster_classic/mothur_cluster_classic/1.39.5.0'}> 5.0794508304082395 1710 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_fastq/mothur_make_fastq/1.39.5.0'}> 5.0794508304082395 1723 [('help', b'main_page')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_chimera_vsearch/mothur_chimera_vsearch/1.39.5.1'}> 5.039159295530664 161 [('help', b'main_page')]

so they're all matching on the term main, even though EU's balances should preclude these getting ANY points:

(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['help']._field_B
{'help': 1.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['description']._field_B
{'description': 40.0}
(Pdb) galaxy_app.toolbox_search.searcher.weighting.weightings['name']._field_B
{'name': 40.0}

So constructing my own weightings

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
...
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

vs

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(2.0)), help=BM25F(name_B=float(1.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
...
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
....
<Hit {'id': 'ucsc_table_direct1'}> 5.63599403557272 2546 [('name', b'main'), ('name', b'ucsc')]

so name boost of 2 is worse than a name boost of 1? ucsc_table_direct1 goes from 8 to 5? Swapping the weights for name=1, help=2

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), help=BM25F(name_B=float(2.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 20.208461736081546 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/seurat_export_cellbrowser/seurat_export_cellbrowser/3.1.1+galaxy0'}> 11.821254960210167 1559 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'maintained')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 9.356772484789413 967 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_links/2.29.2'}> 9.255265497863771 2427 [('description', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

like, are boosts inverse? Fixing description to 40, name=1 returns ucsc_table_direct1 with the same score but vcf_to_maf_customtrack1 is finally gone?

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(1.0)), description=BM25F(description_B=float(40.0)))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/ucsc_custom_track/build_ucsc_custom_track_1/1.0.0'}> 14.699808940243486 1215 [('description', b'ucsc'), ('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'wig_to_bigWig'}> 12.279176289129241 1072 [('help', b'_ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/ucsc_cell_browser/ucsc_cell_browser/0.7.10+galaxy0'}> 10.435593913196673 1460 [('name', b'ucsc'), ('help', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/ebi_metagenomics_run_downloader/ebi_metagenomics_run_downloader/0.1.0'}> 10.190349240221241 2105 [('help', b'maintains')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_flankbed/2.29.2'}> 9.717608110624447 1931 [('help', b'ucsc'), ('help', b'main')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.1'}> 8.858243038340541 2046 [('help', b'ucsc')]
<Hit {'id': 'ucsc_table_direct1'}> 8.288424974617481 2546 [('name', b'main'), ('name', b'ucsc')]

Got ucsc main above for the first time:

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), name=BM25F(name_B=float(0.1)), description=BM25F(description_B=float(0.1)))).search(MultifieldParser(['name', 'old_id', 'description', 'section'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.99)).parse('*ucsc main*'), limit=40, terms=True): print(hit, hit.score, hit.docnum, hit.matched_terms())
<Hit {'id': 'ucsc_table_direct1'}> 12.409585144761694 2546 [('name', b'main'), ('name', b'ucsc')]
<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/devteam/emboss_5/EMBOSS: shuffleseq87/5.0.0.1'}> 6.3525702972838705 2122 [('description', b'maintaining')]
<Hit {'id': 'vcf_to_maf_customtrack1'}> 6.111259177622381 967 [('description', b'ucsc')]

With... both terms boosted to 0.1. This seems like black magic?

Loading

@martenson
Copy link
Member Author

@martenson martenson commented Oct 29, 2020

Boosts shouldn't be inverse: https://whoosh.readthedocs.io/en/latest/schema.html?highlight=boost#field-boosts (I am sorry I do not have time atm to dive into this)

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

My thought too after reading the doc!! but, it definitely seems to be behaving like it is? it's the only time I can get ucsc_table_direct1 to have a high score (10+) is whenever I do name=0.1, desc=0.1, rest=1

Loading

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Oct 29, 2020

I am circling around a bug in whoosh's MultiWeighting class, which alters the scores in a non-sense way. Haven't finished this thoug.

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Oct 29, 2020

Compare the results for 'snpeff eff':

0.1name/desc → 25

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 35.753471381397226, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 33.047009486777924, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/jjohnson/snpeff_to_peptides/snpeff_to_peptides/0.0.1'}>, 25.2224812642123, 1511, [('help', b'_snpeff'), ('help', b'snpeff'), ('name', b'snpeff'), ('help', b'effects'), ('help', b'eff')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_databases/4.3+T.galaxy2'}>, 25.031613016029844, 1223, [('help', b'snpeff'), ('name', b'snpeff'), ('help', b'eff')])

vs

10.0 name/desc → 19

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=10.0), section=BM25F(section_B=1.0), description=BM25F(description_B=10.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*snpeff eff*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff_sars_cov_2/snpeff_sars_cov_2/4.5covid19'}>, 22.617891323807093, 448, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects')])
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1'}>, 19.91142942918779, 832, [('help', b'snpeff'), ('name', b'eff'), ('name', b'snpeff'), ('help', b'effect'), ('help', b'effects'), ('help', b'eff')])

edit: sorry, had an old help boost.

Or the query "select lines that match an expression"

0.1/0.1 → Grep1 = 40.0, 1st place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=0.1), section=BM25F(section_B=1.0), description=BM25F(description_B=0.1), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'Grep1'}>, 40.085088543825, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

40/40 → Grep1 = 16, 2nd place

(Pdb) for hit in galaxy_app.toolbox_search.index.searcher(weighting=MultiWeighting(BM25F(), old_id=BM25F(old_id_B=1.0), name=BM25F(name_B=40.0), section=BM25F(section_B=1.0), description=BM25F(description_B=40.0), labels=BM25F(labels_B=1.0), stub=BM25F(stub_B=1.0), help=BM25F(help_B=1.0))).search(MultifieldParser(['name', 'old_id', 'description', 'section', 'help'], schema= galaxy_app.toolbox_search.schema, group= OrGroup.factory(0.90)).parse('*select lines that match an expression*'.lower()), limit=40, terms=True): print((hit, hit.score, hit.docnum, hit.matched_terms()))
(<Hit {'id': 'toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1'}>, 18.069423636987338, 1181, [('help', b'match'), ('help', b'lines'), ('help', b'expressions'), ('help', b'expression'), ('help', b'select')])
(<Hit {'id': 'Grep1'}>, 16.10481479336669, 621, [('help', b'match'), ('help', b'lines'), ('help', b'expression'), ('description', b'expression'), ('description', b'lines'), ('description', b'match'), ('name', b'select'), ('help', b'select')])

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Nov 5, 2020

@mvdbeek did you have any more information about what that issue was with whoosh?

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Nov 6, 2020

So we deployed the new boosts on eu, to see how those work. I.... think they're a huge improvement? I was discussing with @shiltemann and her test query was 'group', expecting the full match of Grouping1 to be found. We need some way to rank by "this term or terms constitutes the entire name field", but I'm not sure how we'd accomplish that given that we currently break into individual words :/

Loading

@hexylena
Copy link
Member

@hexylena hexylena commented Nov 6, 2020

@bgruening provides 'tail-to-head' which doesn't return useful things (but don't know about before.) and same for tail

@wm75 provides

only exception I found so far is mimodd vcf which only returns general vcf stuff as top hits. Strangely, reverting words to vcf mimodd does much better.

Loading

@simonbray
Copy link
Member

@simonbray simonbray commented Apr 23, 2021

Not sure if this is the right place to report issues with the search, but trying to find Filter failed datasets from a collection with the search is quite tough. For tools which contain a relatively unique word things seem to be better than they used to be 👍

Loading

@jrr-cpt
Copy link

@jrr-cpt jrr-cpt commented Jul 10, 2021

Putting this here after some interactions at CoFest. Search functions have definitely improved with updates. There are still cases where the search results could be improved. I think that it is intentional for the results to include potentially less relevant hits, to assist with tool discovery and to help with spelling errors/choices. But I think if the weighting more obvious biased the tool name over the description, this would help a lot with generic search terms. Our users at the CPT Galaxy would prefer a stricter (smaller) search result, and we don't even have as many tools as the larger public Galaxy's! Perhaps this is already implemented but it isn't terribly transparent how the tool search works and it is hard to pick out logical patterns in the return list order by eye (tool name relevance, alphabetical, popularity/use?)

For example, in our CPT Galaxy where we're running 20.05 (I realize that this does not have all the latest fixes discussed in this issue, @hexylena ngram searching is enabled), when I searched fasta looking for a tool called Remove FASTA Sequences from .gff3 File, it is 19th in the list and various tools with that string NOT in the name are before it.
fasta 1

At usegalaxy.org when I search align, I get this,
align

At usegalaxy.eu when I search for genome, the list includes the assembly tools like Bowtie2 and Spades pretty far down into the results.
genome

That is still the case when I search for genome assembly.
genome assembly

Maybe all these issues can be ameliorated with better tool metadata. New users, and users doing new analyses, will search for tools that they don't necessarily know the names of. Tool organization is pretty good, and while it is great to have many tool options, having very many tools also makes it hard to discover new ones without consistent help from the search function. Perhaps a ‘close match’ and ‘related match’ scenario to vastly improve the overall user experience, and make it easier to discover just the right tools?

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
GTN Priorities
High Priority
Tool Search
  
TODO
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants