Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple_query_string - operator not working with default_operator OR #4707

Closed
ksaritek opened this issue Jan 13, 2014 · 24 comments
Closed

simple_query_string - operator not working with default_operator OR #4707

ksaritek opened this issue Jan 13, 2014 · 24 comments
Labels
>bug :Search/Search Search-related issues that do not fall into other categories

Comments

@ksaritek
Copy link

I tried to negates a keyword at query but it already comes in result

simple query is
"query": {
"simple_query_string": {
"query": ""This repository" -removed",
"fields": [
"content",
"headline"
]
}
},

@imotov
Copy link
Contributor

imotov commented Jan 13, 2014

Could you provide a complete reproduction that would help us to reproduce the issue? See http://www.elasticsearch.org/help/ for an example.

@clintongormley clintongormley changed the title negates a single token (-) at simple_query_string seems not working properly simple_query_string - operator not working with default_operator OR Dec 24, 2014
@clintongormley
Copy link

Recreation:

PUT /t/t/1
{
  "content": "This repository has been removed"
}


GET /t/_search
{
  "query": {
    "simple_query_string": {
      "query": "+this -removed",
      "fields": [
        "content",
        "headline"
      ]
    }
  }
}

The negated term is ignored, because with default_operator: "OR", it is optional.

@dakrone
Copy link
Member

dakrone commented Dec 24, 2014

@clintongormley not sure how this is a bug though? This is like saying "contains the term this OR does not contain removed", which the document does (it contains the term "this")

@clintongormley
Copy link

I think that a user would expect foo bar -baz (with default operator OR) to match as if they had written: +(foo bar) -baz. Otherwise, a negated clause with default operator OR is meaningless.

@dakrone
Copy link
Member

dakrone commented Feb 23, 2015

@clintongormley I'm still not sure that this is actually a bug, the negated clause with the default OR operation is not meaningless (it has meaning, it just means" anything that does not contain this"). I think what is desired is setting default_operator to AND, which is a user-changeable setting?

@ppf2
Copy link
Member

ppf2 commented Aug 19, 2015

@dakrone @clintongormley Here is another use case from the field. Is this a bug or not currently supported (a feature)? Don't think default_operator:AND will help here since the use case is to return both documents in the example. thx

@clintongormley
Copy link

@ppf2 Actually, your example works as expected if you use the simple_query_string (you used the query_stringquery instead). As a query string query, you would need to change:

"(name:Android) OR NOT (status:approved)"

to:

"(name:Android) OR (NOT status:approved)"

The latter, when run through validate-query, shows the following:

"name:android (-status:approved +*:*)"

@ppf2
Copy link
Member

ppf2 commented Aug 25, 2015

Ah got it, thx @clintongormley for the tip!

@clintongormley clintongormley assigned jdconrad and unassigned dakrone Oct 14, 2015
@jdconrad
Copy link
Contributor

Taking a look.

@jdconrad
Copy link
Contributor

This is working as expected as @dakrone has inferred. The SimpleQueryParser should be thought of as only using AND and OR as operators. There is no concept of SHOULD and MUST other than internally to create the AND and OR queries. So when doing the query "+this -removed" the AND (+) is actually ignored as it is not thought of as a MUST. Using SimpleQueryParser this will always be the case where the query ends up being documents that either have 'this' OR not 'removed' ... Also note, that while this will return all the documents, the not 'removed' still affects scoring so it's not meaningless. Going to leave this open for now for further discussion if necessary.

@clintongormley
Copy link

While it may be working as designed, I'd argue that the syntax is surprising to most users. For example:

POST t/t/_bulk
{"index":{"_id":1}}
{"foo":"one"}
{"index":{"_id":2}}
{"foo":"two"}
{"index":{"_id":3}}
{"foo":"one two"}
{"index":{"_id":4}}
{"foo":"three"}

I would expect the following:

  • "one": Return docs 1 & 3 (works)
  • "-two": Return docs 1 & 3 (works)
  • "one -two": Return doc 1 (returns 1, 3, & 4)
  • "one three -two": Return docs 1 & 4 (returns 1, 3 & 4)

To get what I want (ie "Give me docs with one or three, but exclude anything with two") I need to write it as "one three +-two". That is not at all intuitive. If I typed "windows -microsoft" into google, I wouldn't expect Google to return all of the documents on the internet which don't contain the word microsoft.

At the very least it should be well documented but, given that this query is intended to be exposed to users who will not read documentation, I would say that the syntax could be improved.

@imotov
Copy link
Contributor

imotov commented Oct 15, 2015

If I typed "windows -microsoft" into google, I wouldn't expect Google to return all of the documents on the internet which don't contain the word microsoft.

@clintongormley that's because google (as well as most other search engines in 21st century) is using AND instead of OR as a default operator, which should be the default behavior for elasticsearch as well IMO. Having OR as a default operator is causing all sort of confusion for many new users.

@rmuir
Copy link
Contributor

rmuir commented Oct 15, 2015

Not really. if you query https://www.google.com/?gws_rd=ssl#q=elasticsearch+reference+query+dsl+oracle it gladly returns high ranking hits and just tells you: Missing: oracle

Switching to AND breaks many analysis chains such as n-grams. With a good ranking algo its also not necessary, its just that DefaultSimilarity is really weak here.

@jdconrad
Copy link
Contributor

I agree that this syntax is ugly -- "one three +-two" ; however, I am reluctant to special case the not operator because right now you have one OR three OR NOT two which while may be unexpected is predictable, but if I change this it becomes one OR three AND NOT two which is no longer predictable because it ignores the default operator and it loses its consistency. It is also very difficult to predict proper sub queries outside of this simple case. Take for example "one -three two" -- is this one AND not three OR two? Do I need to reorder this? I think this would end up being more confusing because of the way operator precedence works in that it's always first come first serve.

@imotov
Copy link
Contributor

imotov commented Oct 15, 2015

What google does, is some weird "fuzzy" AND (or something like should with large minimum should match) search that google turns on a long tail queries with a large number of terms. But the basic behavior with 2-3 term queries resembles AND more than OR, would you agree?

n-grams is an advanced feature, I think if a user can figure out how to enable n-gram (or configure any other custom analysis chain) they should be able to figure out how to switch from AND to OR in the query.

Anyway, I shouldn't hijack this discussion. I apologize for that. Back to the original topic. I think that my expectation would be that foo bar +baz -qux should be translated into something like this:

{
   "bool": {
        "should" : [
            {
                "term" : { "_all" : "foo" }
            }, {
                "term" : { "_all" : "bar" }
            }
        ],
        "must" : {
            "term" : { "_all" : "baz" }
        },
        "must_not" : {
            "term" : { "_all" : "qux" }
        }
   }
}

@jdconrad
Copy link
Contributor

I should explain further what happens right now, each time an AND is switched to an OR or vice versa a new boolean query branch is created. So if you have a b c +d +e f the tree ends up looking like

bq( should bq( should bq( should a should b should c ) must d must e ) should f)

so changing the not operator to always use must will have an inconsistent change in boolean query branches since operator precedence is always left to right.

We could change it to be something like @imotov suggests (maybe this should be a different parser altogether in Lucene?), but then you have should, must, and must not... if you're truly a basic user I think and/or is easier to understand than should/must/must not.

@imotov
Copy link
Contributor

imotov commented Oct 15, 2015

I should explain further what happens right now, each time an AND is switched to an OR or vice versa a new boolean query branch is created.

Yes, and this is where it breaks my expectation. To me order of elements in the query shouldn't make any difference because "+" and "-" feel like unary operators but they behave in strange ways.

@jdconrad
Copy link
Contributor

@imotov What you're saying makes sense to me from the point of view of someone that regularly deals with search, but for someone less technical I think and/or make more sense. Honestly, the default to OR is a bit odd to me too because if someone, say my mother, types "dog food" into the google she expects it to be anded together there at least through decent scoring (as you and @rmuir mentioned earlier). I think making a new parser with the behavior of must/should/must not makes sense depending on what our target audience wants. SimpleQueryParser2 or something.

@jdconrad
Copy link
Contributor

All right after a bit more thought and discussion, I've come to agree with everyone in this issue that this behavior is unexpected for everyone. I'll work on making a Lucene patch for the SimpleQueryParser using the behavior describe by @imotov and @rmuir where the structure will be a single bq per subquery.

@clintongormley
Copy link

@jdconrad did anything ever come of this? Did you open any issue in Lucene that we can track?

@jdconrad
Copy link
Contributor

jdconrad commented Nov 7, 2016

@clintongormley Sorry, I must've gotten distracted by other issues before I had anytime to address this. I'll have to take a bit of time to remember what we had discussed.

@clintongormley
Copy link

Let's document and close

@dakrone
Copy link
Member

dakrone commented Jan 6, 2017

Okay, opened a PR to document this, and then it can be closed.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 10, 2017
dakrone added a commit that referenced this issue Jan 10, 2017
This can be confusing when unexpected.

Resolves #4707
dakrone added a commit that referenced this issue Jan 10, 2017
This can be confusing when unexpected.

Resolves #4707
dakrone added a commit that referenced this issue Jan 10, 2017
This can be confusing when unexpected.

Resolves #4707
dakrone added a commit that referenced this issue Jan 10, 2017
This can be confusing when unexpected.

Resolves #4707
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@peterdm
Copy link

peterdm commented May 11, 2018

Just stumbled on this limitation myself. I'd like to echo @imotov 's suggestion from Oct 15, 2017. (I'll paste below).

(quote) I think that my expectation would be that foo bar +baz -qux should be translated into something like this:

{
   "bool": {
        "should" : [
            {
                "term" : { "_all" : "foo" }
            }, {
                "term" : { "_all" : "bar" }
            }
        ],
        "must" : {
            "term" : { "_all" : "baz" }
        },
        "must_not" : {
            "term" : { "_all" : "qux" }
        }
   }
}

The usecase is based on the previously-mentioned 'google expectation' (or really any major search engine at this point) that can be approximated with a default_operator: "OR" accompanied by a high minimum_should_match.

The appeal of simple_query_string is in the ability to meet the syntactical needs of users who are familiar with Google operators (without throwing parse exceptions).

I'm a little bummed that I have to roll my own parser to offer a commonly-accepted negation operator. It's not the end of the world, but adds friction to anyone looking for an otherwise extremely nice (almost turnkey) drop-in query which largely meets common syntax expectations.

@clintongormley, where is the right forum to re-open this? It looks like this was closed with a doc-comment b/c it belongs in Lucene's JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

8 participants