Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop-words not removed during simple-query query-time analysis #28855

Closed
mohmad-null opened this issue Feb 28, 2018 · 3 comments · Fixed by #28871
Closed

Stop-words not removed during simple-query query-time analysis #28855

mohmad-null opened this issue Feb 28, 2018 · 3 comments · Fixed by #28871
Assignees
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@mohmad-null
Copy link

ES 6.2.1
I've noticed that the simple-query-string query type (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html) doesn't seem to handle stopwords at all at analysis time, best exemplified with "default_operator: AND".

Consider the below - We create an index and change the default analyzer to use English stopwords:

PUT /simp_idx
{
  "mappings": {
    "my_type": {
      "properties": {
  		"field_1": {
    			"type": "text"
    		}
      }
    }
  },
  	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 0,
		"analysis": {

			"filter": {
				"english_stop": {
					"type": "stop",
					"stopwords": "_english_"
				}
			},
			"analyzer": {
				"default": {
					"tokenizer": "standard",
					"filter": [
						"english_stop"
					]
				}
			}
		}
	}
}

And then populate it:

PUT /simp_idx/my_type/1
{
  "field_1": "place of beauty"
}
PUT /simp_idx/my_type/2
{
  "field_1": "place and beauty"
}

Now, if we query this with the regular query_string, we get the expected two results:

GET /simp_idx/my_type/_search
{
  "query": {
    "query_string" : {
        "query": "place of",
        "default_operator": "and"
    }
  }
}

But the same query using simple-query-string and the AND operator finds no results:

GET /simp_idx/my_type/_search
{
  "query": {
    "simple_query_string" : {
        "query": "place of",
			"fields": [ "field_1"],
        "default_operator": "and"
    }
  }
}

Remove the "of" from the query and it will work as expected.

Maybe this is intentional because the SQS is "simple", but it's not documented on the SQS page - the only explicitly stated difference is no exception raising. Seems like a bug though, hence the report.

@javanna javanna added :Search/Search Search-related issues that do not fall into other categories >bug labels Mar 1, 2018
@javanna
Copy link
Member

javanna commented Mar 1, 2018

I can reproduce this, and I don't think this is on purpose. cc @elastic/es-search-aggs

@javanna javanna added :Search Relevance/Analysis How text is split into tokens and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 1, 2018
@jimczi jimczi self-assigned this Mar 1, 2018
jimczi added a commit to jimczi/elasticsearch that referenced this issue Mar 1, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes elastic#28855
Fixes elastic#28856
jimczi added a commit that referenced this issue Mar 4, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes #28855
Fixes #28856
jimczi added a commit that referenced this issue Mar 4, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes #28855
Fixes #28856
sebasjm pushed a commit to sebasjm/elasticsearch that referenced this issue Mar 10, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes elastic#28855
Fixes elastic#28856
@jesseamancio
Copy link

I am dealing with this problem and I intend to downgrade ES to a sound release.
Anyone knows which previous version of ES is free of this bug ?

@pluk
Copy link

pluk commented Aug 21, 2018

ES 6.3.1
This bug is repeated if we have more than one field.
Stop words are not cut out in simple_query_string and finds no results.

Fields have the same type ("text")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants