Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES equivalent of lucene SpanNearQuery.Builder.addGap(int) #27862

Closed
Chandan83 opened this issue Dec 18, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@Chandan83
Copy link
Contributor

commented Dec 18, 2017

Lucene's SpanNearQuery.Builder.addGap(int) serves an important search criteria. Especially with slop=0, it acts as arbitrary gap between two clauses just like the gaps created by removal of stop words in a phrase search. ES does not provide a similar Java API. Infact, even the lucene's way of introducing a gap in phrase search by PhraseQuery.Builder.add(Term term, int gap), used after filtering stop words, is missing in ES. Have I overlooked any existing ES way of introducing such behaviour?

@jimczi

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

For simple phrase query the gap introduced by stop words is handled internally by checking the number of filtered positions (number of filtered tokens before the next) so this query is not exposed.
span queries are not analyzed in es so we don't apply any gaps to span_near queries but it could be added in the DSL in some way. I'll mark this issue with the discuss label in order to have more opinions and see if we can have a feature implemented.

@Chandan83

This comment has been minimized.

Copy link
Contributor Author

commented Dec 18, 2017

Neither ES nor lucene analyzes span terms and that is why we need to pre-analyze user input and introduce gaps in place of stop words. This is exactly what is done internally for phrase search. Phrase search is limited to simple terms and it can't be used within a SpanNearQuery. For phrase containing multi-term words, the only way is to use Span queries with gaps.

@jimczi

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

Neither ES nor lucene analyzes span terms and that is why we need to pre-analyze user input and introduce gaps in place of stop words

This is correct, gaps are needed to build valid span queries from a client.

Phrase search is limited to simple terms and it can't be used within a SpanNearQuery. For phrase containing multi-term words, the only way is to use Span queries with gaps

Multi-term words like prefix or wildcard ? Is this why you need to use span queries instead of a match_phrase ?

@Chandan83

This comment has been minimized.

Copy link
Contributor Author

commented Dec 18, 2017

Yes. We allow users to search for a complex phrase (i.e. a phrase containing regex, wildcard etc besides simple terms). For such a phrase, we require SpanNearQuery along with a way to introduce gaps for removed stop words.

@Chandan83

This comment has been minimized.

Copy link
Contributor Author

commented Dec 18, 2017

For a POC hack, in place of a gap, I am using a SpanTermQuery with term text as "A_GAP" and then modified SpanNearQueryParser checks for such a SpanTermQuery and replaces it with lucene's SpanNearQuery.SpanGapQuery which had to be made public. The later could have been avoided but it required more changes to the parser. An ideal solution will be to introduce SpanGapQuery like SpanQuery to ES. If this seems alright to maintainers, I can submit a pull request.

@jimczi

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

An ideal solution will be to introduce SpanGapQuery like SpanQuery to ES. If this seems alright to maintainers, I can submit a pull request.

Thanks for offering you help, I'll be happy to review ;). Though it should only work within a span_near query since it's the only query that accepts this option. Do you have an idea of the format for this new query ? Something like:

"span_near": {
  "clauses" : [
                { "span_term" : { "field" : "value1" } },
                { "span_gap": 2 },
                { "span_term" : { "field" : "value3" } }
    ]
}

?

@Chandan83

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2017

Yes, the format will be exactly like that (except the typo with extra : and " after span_gap). You are also correct that it should only work with span_near. Even Lucene has this option for SpanNearQuery only; in fact the SpanGapQuery class is private to SpanNearQuery. I will submit a change right after I am done making application changes required in my day job. I assume the changes are always expected over master branch.

@jimczi

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

I assume the changes are always expected over master branch.

Yes please.

@jimczi jimczi removed the discuss label Dec 26, 2017

Chandan83 added a commit to Chandan83/elasticsearch that referenced this issue Feb 12, 2018

@Chandan83

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2018

@jimczi I could finally find get to contributing the span_near clause. The syntax, given below, is slightly different than previously proposed. I have created a pull request #28636 without writing new unit tests as I wanted some feedback on the approach.

{"span_near" : {"field" : 2}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.