Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor field expansion for match, multi_match and query_string query #25726

Merged
merged 3 commits into from Jul 21, 2017

Conversation

Projects
None yet
4 participants
@jimczi
Copy link
Member

commented Jul 14, 2017

This commit changes the way we handle field expansion in match, multi_match and query_string query.
The main changes are:

  • For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

  • For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

  • For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

  • The * notation can also be used to set default_field option onquery_string query. This should replace the needs for the extra option use_all_fields which is deprecated in this change.

This commit also rewrites simple * query to matchalldocs query when all fields are requested (Fixes #25556).

The same change should be done on simple_query_string for completeness.

use_all_fields option in query_string is also deprecated in this change, default_field should be set to * instead.

Relates #25551

@dakrone
Copy link
Member

left a comment

@jimczi I just started playing with this (hence the short review), but I found something that causes some dissonance. If someone configures "default_field": "*" in the index settings, it does not work the same way as configuring "default_field": "*" in the query does.

Here's something that demonstrates the problem:

DELETE /test?pretty=false
{}

PUT /test?pretty=false
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "query": {
        "default_field": "*"
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "f1": {"type": "text"},
        "f2": {"type": "text"}
      }
    }
  }
}

POST /test/doc/1?refresh=wait_for&filter_path=result&pretty=false
{"f1": "foo", "f2": "bar"}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar"
    }
  }
}

POST /test/_search
{
  "query": {
    "query_string": {
      "query": "foo AND bar",
      "default_field": "*"
    }
  }
}

And the output:

{"acknowledged":true}
{"acknowledged":true,"shards_acknowledged":true,"index":"test"}
{"result":"created"}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "f1" : "foo",
          "f2" : "bar"
        }
      }
    ]
  }
}
core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java Outdated
return this;
}

/**
* This setting is deprecated, set {@link #defaultField(String)} to "*" instead.

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 14, 2017

Member

I think this comment is misleading since this is a getter

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jul 14, 2017

Thanks @dakrone , good catch.
I pushed some changes to fix the bug.

@jimczi jimczi force-pushed the jimczi:query_string_all_fields branch Jul 17, 2017

@jimczi jimczi changed the title Remove all_fields:true in favour of default_field:* in the query_string query Refactor field expansion for match, multi_match and query_string query Jul 17, 2017

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2017

This PR has been updated to refactor the expansion of field names for the main es queries (the simple query string should be done in a follow up). The new description reflects the latest status of the PR (and explains the main changes). @dakrone I need to document the new behavior and work a bit on tests but the code is ready for a review I think ;).

@dakrone

This comment has been minimized.

Copy link
Member

commented Jul 18, 2017

For partial field names (with * suffix), the expansion is done only on keyword, text, date and number field types. Other field types are simply ignored.

Does this include the ip field? (I can't remember if ip counts as a number)

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2017

Does this include the ip field? (I can't remember if ip counts as a number)

It's a separate type but you are right this include the ip type as well. I've updated the description of the PR.

@dakrone
Copy link
Member

left a comment

LGTM, I left some comments about whether we want to capture exception messages, but it's up to you for addressing them.

Thanks for doing this!

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java Outdated
@@ -291,8 +278,9 @@ protected void doWriteTo(StreamOutput out) throws IOException {
if (out.getVersion().onOrAfter(Version.V_5_1_1)) {
if (out.getVersion().before(Version.V_6_0_0_beta1)) {
out.writeBoolean(false); // split_on_whitespace
Boolean useAllFields = defaultField == null ? null : "*".equals(defaultField);

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

I think "*".equals(defaultField) should use Regex.isMatchAllPattern

core/src/main/java/org/elasticsearch/index/search/MatchQuery.java Outdated
return query;
} catch (RuntimeException e) {
if (lenient) {
return new MatchNoDocsQuery("fuzzy query failed for " + fieldType.name() + ":" + term.text());

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

Can we include the exception message? It might be good to know that a lenient query failed due to something that is avoidable

core/src/main/java/org/elasticsearch/index/search/MultiMatchQuery.java Outdated
@@ -241,7 +233,7 @@ static Query blendTerm(QueryShardContext context, BytesRef value, Float commonTe
}

static Query blendTerms(QueryShardContext context, BytesRef[] values, Float commonTermsCutoff, float tieBreaker,
FieldAndFieldType... blendedFields) {
FieldAndFieldType... blendedFields) {

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

I think the indentation is off here still? :)

core/src/main/java/org/elasticsearch/index/search/QueryStringQueryParser.java Outdated
return rangeQuery;
} catch (RuntimeException e) {
if (lenient) {
return null;

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

should this return a MatchNoDocsQuery like the other ones where an exception is thrown?

core/src/main/java/org/elasticsearch/index/search/QueryStringQueryParser.java Outdated
getFuzzyPrefixLength(), fuzzyMaxExpansions, FuzzyQuery.defaultTranspositions);
} catch (RuntimeException e) {
if (lenient) {
return new MatchNoDocsQuery("fuzzy query failed for " + currentFieldType.name() + ":" + termStr);

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

Should we include the message of the exception here?

@@ -107,11 +110,6 @@ the query string. This allows to use a field that has a different analysis chain
for exact matching. Look <<mixing-exact-search-with-stemming,here>> for a
comprehensive example.

|`all_fields` | Perform the query on all fields detected in the mapping that can

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 18, 2017

Member

Should we leave this in here, but mark it as deprecated? It still technically exists and I'm not sure we want people that have all_fields: true set to not be able to look up what the option does (until it's removed)

@jimczi jimczi force-pushed the jimczi:query_string_all_fields branch Jul 21, 2017

@jimczi jimczi added >bug and removed >enhancement labels Jul 21, 2017

jimczi added some commits Jul 14, 2017

Refactor field expansion for match, multi_match and query_string query
This commit changes the way we handle field expansion in `match`, `multi_match` and `query_string` query.
 The main changes are:

- For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

- For partial field names (with `*` suffix), the expansion is done only on `keyword`, `text`, `date` and `number` field types. Other field types are simply ignored.

- For all fields (`*`), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

- The `*` notation can also be used to set `default_field` option on`query_string` query. This should replace the needs for the extra option `use_all_fields` which is deprecated in this change.

This commit also rewrites simple `*` query to matchalldocs query when all fields are requested (Fixes #25556).

The same change should be done on `simple_query_string` for completeness.

Relates #25551

@jimczi jimczi force-pushed the jimczi:query_string_all_fields branch to 8fbb2b9 Jul 21, 2017

@jimczi jimczi merged commit c378432 into elastic:master Jul 21, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

@jimczi jimczi deleted the jimczi:query_string_all_fields branch Jul 21, 2017

@clintongormley clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jul 25, 2017

Parse "*" in query_string_query as MatchAllDocsQuery
Related to elastic#25726, this resolves elastic#25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.

dakrone added a commit that referenced this pull request Jul 25, 2017

Parse "*" in query_string_query as MatchAllDocsQuery
Related to #25726, this resolves #25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.

dakrone added a commit that referenced this pull request Jul 25, 2017

Parse "*" in query_string_query as MatchAllDocsQuery
Related to #25726, this resolves #25556 for the 5.x series by parsing "*" as a
`MatchAllDocsQuery` instead of expanding it to a (potentially expensive) query
on the `_field_names`.
@Bargs

This comment has been minimized.

Copy link

commented Aug 1, 2017

Hey @jimczi, I'm seeing an odd difference between default_field and fields. fields: ["*"] sometimes throws an error when default_field: "*" does not. For example:

I'm using a weblogs data set with IP fields. If I send the following query, I get an error:

  {
    "query_string": {
      "query": "200",
      "analyze_wildcard": true,
      "fields": [
        "*"
      ]
    }
  }
{
      "error": {
        "root_cause": [
          {
            "type": "query_shard_exception",
            "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
            "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
            "index": "logstash-0"
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "logstash-0",
            "node": "38QhIDc9TTGCRXYNWAmFJA",
            "reason": {
              "type": "query_shard_exception",
              "reason": "failed to create query: {\n  \"bool\" : {\n    \"must\" : [\n      {\n        \"query_string\" : {\n          \"query\" : \"200\",\n          \"fields\" : [\n            \"*^1.0\"\n          ],\n          \"type\" : \"best_fields\",\n          \"default_operator\" : \"or\",\n          \"max_determinized_states\" : 10000,\n          \"enable_position_increments\" : true,\n          \"fuzziness\" : \"AUTO\",\n          \"fuzzy_prefix_length\" : 0,\n          \"fuzzy_max_expansions\" : 50,\n          \"phrase_slop\" : 0,\n          \"analyze_wildcard\" : true,\n          \"escape\" : false,\n          \"boost\" : 1.0\n        }\n      },\n      {\n        \"range\" : {\n          \"@timestamp\" : {\n            \"from\" : 1501607670716,\n            \"to\" : 1501608570716,\n            \"include_lower\" : true,\n            \"include_upper\" : true,\n            \"format\" : \"epoch_millis\",\n            \"boost\" : 1.0\n          }\n        }\n      }\n    ],\n    \"adjust_pure_negative\" : true,\n    \"boost\" : 1.0\n  }\n}",
              "index_uuid": "3FMTvZhWQyimxtxw3pJvvg",
              "index": "logstash-0",
              "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "'200' is not an IP string literal."
              }
            }
          }
        ]
      },
      "status": 400
    }

A multi_match with fields: ["*"] fails with the same error:

{
  "multi_match": {
    "query": 200,
    "fields": [
      "*"
    ],
    "type": "phrase"
  }
}

The following query_string query using default_field, however, works fine:

{
  "query_string": {
    "query": "200",
    "analyze_wildcard": true,
    "default_field": "*"
  }
}

For me, it would be ideal if fields: ["*"] worked like default_field: "*". If fields: ["*"] throws errors depending on the query value, I can't really send user provided values and I'm back to abusing the query_string query in order to do searches across all fields.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2017

This is just a leftover of the all_fields mode. By default the query_string and multi_match query are not lenient, they throw an error when an invalid content is used for a field type. When the all_fields was used (by default or when it was explicitly set) the leniency was forced to true. I kept this behavior for query_string query using default_field:* but it's just to make sure that the default (when default_field is not set) doesn't change. If you use fields:* or the multi_match query you'll have to set the leniency to true manually.

@Bargs

This comment has been minimized.

Copy link

commented Aug 1, 2017

Ah, I had no idea that lenient parameter existed. Thanks @jimczi, that's exactly what I needed!

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 21, 2017

Refactor simple_query_string to handle text part like multi_match and…
… query_string

This change is a continuation of elastic#25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.

jimczi added a commit that referenced this pull request Aug 21, 2017

Refactor simple_query_string to handle text part like multi_match and…
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.

jimczi added a commit that referenced this pull request Aug 21, 2017

Refactor simple_query_string to handle text part like multi_match and…
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.

jimczi added a commit that referenced this pull request Aug 21, 2017

Refactor simple_query_string to handle text part like multi_match and…
… query_string (#26145)

This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:

 * For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.

 * For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.

 * For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.

The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.

Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.