Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial phrase matching suggestion #34960

Closed
mohmad-null opened this issue Oct 29, 2018 · 7 comments
Closed

Partial phrase matching suggestion #34960

mohmad-null opened this issue Oct 29, 2018 · 7 comments
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories

Comments

@mohmad-null
Copy link

It would be nice if you could do partial phrase matching. I don't mean like the phrase_prefix.
Consider the below query:

south europe trees

You run this as a match_phrase against a text field with this value:
South Europe Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

This it will find nothing as the entire sequence of terms does not appear in the text field.
What I would like is for there to be a "partial" flag which when enabled would allow this to return the partial result as "South Europe" does appear in the text field. The score would be commensurately lower of course.

@mohmad-null
Copy link
Author

To further clarify, the following searches would also partially phrase match that document:
road south europe
country south europe national
etc etc.

@DaveCTurner DaveCTurner added the :Search/Search Search-related issues that do not fall into other categories label Oct 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@cbuescher
Copy link
Member

@mohmad-null sorry, but I don't quite follow your as here. Phrase queries are intended to match exact phrases and even have the "slop" factor to incorporate a certain degree of fuzzy matching. What you describe should be already covered by general "match" queries, possibly in conjuction with "span" queries if you need the terms to be in a certain order. Could you elaborate a bit more why those don't satisfy your needs?

@mohmad-null
Copy link
Author

My suggestion is based around the fact that in real-world use, a search query can have extraneous words beyond just a phrase itself. For example, in road transport south europe, there are two phrases: "road transport" and "south europe". In south europe trees, it's "South Europe" as a phrase, and the left over "trees".
What you want in both cases is higher ranked results when the words are in close proximity. Yes the user can use quote marks to do it explicitly ("south europe" trees), but 95% of users don't do that, and instead will rely on the search engine to do it for them.

As far as I can tell from reading the docs, there's no way to do this with ES, but I'm happy to be proven wrong.

Match query - As far as I can see from reading around it, there is no proximity/distance component that gets factored in to the score, match is simply a "bag of words" search. That's what the docs explicitly say anyway - https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html

Span query - I did look into it but the docs suggest it's for a very niche thing where you have lots of knowledge of the search terms and documents, and structure there-in. "These are typically used to implement very specific queries on legal documents or patents." - I'm seeking to use this on a general search engine.

As such, a modification of phrase query seemed like the logical way to go.

@mohmad-null
Copy link
Author

@cbuescher - And while it may indeed be possible to do this with span queries, according to the docs (https://www.elastic.co/guide/en/elasticsearch/guide/current/phrase-matching.html) - the phrase query is only a high-level front end to the low-level span queries behind the scenes anyway.

My suggestion is to simply add to the high-level thing so folks don't need to figure out the low-level span query thing (which don't seem to be well documented beyond the basic API - there doesn't seem to be any overarching guidance or explanation of how/why to use them or what problems they solve beyond repeated references to "specialized fields like patent searches").

@romseygeek
Copy link
Contributor

We do plan to add a proximity boost option to the match query, although I don't think there's an open issue for it. The basic idea is to do an interval query over all terms in the match, which will score higher the close the terms are together as a whole, but I can see an argument for adding interval queries over each consecutive pair of terms as well. We'd want to be careful to avoid expanding the query too much though.

@cbuescher
Copy link
Member

there are two phrases: "road transport" and "south europe"

Cases like this can probably also be solved with n-grams shingles (e.g. 2 and 3-grams cover a lot of phrase-like expressions in english). For starters, e.g. https://www.elastic.co/blog/searching-with-shingles gives a rough idea what to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

6 participants