Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial results matching #4

Closed
cgreene opened this issue Oct 10, 2016 · 9 comments
Closed

Partial results matching #4

cgreene opened this issue Oct 10, 2016 · 9 comments

Comments

@cgreene
Copy link

cgreene commented Oct 10, 2016

We are considering using mygene.info to serve as a search backend for genes in the cognoma project front end (more discussion: cognoma/core-service#29 (comment) ). One use case that we have is an autocomplete style query. For this, we'd need partial queries to be supported. Is it possible to enable this with the current API either through the standard querystring or a specific string?

There is a bit more discussion of an ngram field in https://github.com/SuLab/mygene.info/issues/2

Thanks!

@dhimmel
Copy link

dhimmel commented Oct 10, 2016

Copying over relevant comments from #2.

By @cgreene:

@newgene : Would an autocomplete style search be expected to work on this field (or another one)?

By @cgreene:

@newgene : What I'm really asking - is there an ngram tokenizer used for those fields? Trying to figure out if partial queries will return sensible matches. I searched for ngram_filter and didn't find anything in the source.

I poked around in this https://github.com/SuLab/mygene.info/blob/master/src/utils/es.py a bit, but I didn't find anything obvious right off hand and thought you might know.

By @dhimmel:

@cgreene I think you're asking about partial search terms. For example, does https://mygene.info/v3/query?q=alpha-1-B%20glycoprot return a superset of the results that https://mygene.info/v3/query?q=alpha-1-B%20glycoprotein returns? It appears not, but I suggest you open a new issue, since this issue is for searching by alternate names.

@newgene
Copy link
Member

newgene commented Oct 10, 2016

@cgreene We currently do not apply, at least not explicitly, that ngram filter when doing the indexing. The autocomplete feature we implemented in this widget is made possible through the wildcard query (by adding "*" at the end of the query term), which seems working just fine.

If there is enough use cases, I'm also considering to expose prefix query to our services, probably more efficient than wildcard query.

@dhimmel
Copy link

dhimmel commented Oct 10, 2016

@newgene can you give us an example that uses the wilcard (*)? I'm not getting any hits for https://mygene.info/v3/query?q=alpha-1-B%20glycoprot*.

@newgene
Copy link
Member

newgene commented Oct 11, 2016

@dhimmel wildcard query works on specific field only:

https://mygene.info/v3/query?q=name:alpha-1-B%20glycoprot*

@newgene
Copy link
Member

newgene commented Oct 20, 2016

For now, one alternative for you is to make your own customized query like this:

q=symbol:A1BG^10 OR name:A1BG OR alias:A1BG OR summary:A1BG^0.1

@newgene for the customized query, how should we encode queries with spaces or wildcards in >them? For example, how would we search for alpha-1-B glycopro* in the symbol, name, or alias with >custom weights? I'm struggling with how the URI encoding should be performed.

@dhimmel @cgreene just want to let you know that we are now working on an improvement on our query endpoint, so that you can make such query (for autocompletion) easier. It might look like this:

q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1

So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.

@dhimmel
Copy link

dhimmel commented Oct 21, 2016

So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.

@newgene, thanks for the great support.

Your suggested syntax looks nice. We will probably also restrict to human and entrez genes like:

q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1&species=human&entrezonly=true

Confirming that the wildcard search is implied by specifying suggest_from so we no longer need *.

One more thing, I think we want to make sure we can encode the query term so it's a valid URL. So some guidance on how we should encode the URL would be appreciated. I.e. in javascript do we use encodeURIComponent or encodeURI and on what portion of the URL?

@newgene
Copy link
Member

newgene commented Oct 21, 2016

@dhimmel you should only need to encode the value passed to "q" parameter. To encode in Javascript, this might help: http://stackoverflow.com/questions/332872/encode-url-in-javascript

@newgene
Copy link
Member

newgene commented Jan 17, 2017

@cgreene @dhimmel, it took us a while, but we now have rolled out (thanks to @cyrus0824 's hard work) a new feature of "user queries" to MyGene.info, which is highly relevant to the use case in this issue.

Basically, we now allow users to define a customized query (aka "user query") to fit their very specific use cases, where the default query feature cannot satisfy perfectly. This is how it works:

  • First, all user queries will be stored/versioned in this repo. Under "mygene" folder, those are user queries for mygene.info .

  • Each folder under "mygene" is a customized query. And the text file "query.txt" defines the query defined by the user. See this example.

    You certainly need to know the query syntax from Elasticsearch to write a user query. Note that "{{q}}" will be replaced by the value passed from the "q" query parameter.

  • Users can submit a user query via pull request, or we can give some users the commit right to this repo.

  • For security reason, we won't automatically deploy the master branch to our production, we will double-check user's commits and make changes with them if needed. Then we will merge changes to a "production" branch for the deployment.

  • Once it's deployed, users can pass a "userquery" via URL, e.g.:

    http://mygene.info/v3/query?q=a1bg&userquery=prefix

    To see the actual Elasticsearch query was executed:

    http://mygene.info/v3/query?q=a1bg&userquery=prefix&rawquery=1

  • A couple of extra notes:

    • It's possible to pass additional variables from URL to a user query template. For example, you can use "{{test}}" in the template, and pass it from the URL as "uq_test". "uq_" prefix is required except for that "q" parameter, just to avoid the conflict with other possible parameters.

    • It's also possible to pass a customized filter by adding a "filter.txt" in the user query folder. A possible use-case could be a website which only focuses on a subset of genes (e.g. all kinases), and want to implement something like an autocompleted input box. Users can then include all gene ids in the filter.

All right, let us know how you guys think. The example "prefix" user query is pretty much added for the specific use case you guys mentioned in this issue. Note that we boosted up symbol matches as well. Feel free to make changes to fit what you want (can add you two to this repo for write permission).

@cgreene
Copy link
Author

cgreene commented Jan 17, 2017

Thanks @newgene! @bdolly was working yesterday to handle gene search for the cognoma web app. I'll tag him here so he gets notified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants