From 7c3ed47cde2b69fdb50842a1f9d192e2fc1fbb8d Mon Sep 17 00:00:00 2001 From: Andrew Cholakian Date: Tue, 3 Dec 2013 21:11:15 -0500 Subject: [PATCH 1/3] Cleanup some of the documentation for the fuzzy query. --- .../query-dsl/queries/fuzzy-query.asciidoc | 45 +++++++++++++------ .../query-dsl/queries/match-query.asciidoc | 8 ++-- 2 files changed, 37 insertions(+), 16 deletions(-) diff --git a/docs/reference/query-dsl/queries/fuzzy-query.asciidoc b/docs/reference/query-dsl/queries/fuzzy-query.asciidoc index 86a1062d16922..fd08cba81626d 100644 --- a/docs/reference/query-dsl/queries/fuzzy-query.asciidoc +++ b/docs/reference/query-dsl/queries/fuzzy-query.asciidoc @@ -1,14 +1,21 @@ [[query-dsl-fuzzy-query]] === Fuzzy Query -A fuzzy query that uses similarity based on Levenshtein (edit -distance) algorithm. This maps to Lucene's `FuzzyQuery`. +A fuzzy query that uses similarity based on the Levenshtein (edit distance) algorithm for text, +and ranges for numeric and date data. This maps to Lucene's `FuzzyQuery` for text. +Maximum edit distance is determined via the `min_similarity` parameter, +which can only take the values 1 or 2 for text for performance reasons. -Warning: this query is not very scalable with its default prefix length -of 0 - in this case, *every* term will be enumerated and cause an edit -score calculation or `max_expansions` is not set. +Users should be warned, performance can easily degrade when using this query on an index with a +large number of terms, especially when a `min_similarity` of 2 is set. +Execution speed can be dramatically improved through the use of the `prefix_length` and +`max_expansions` settings, described later in this document. -Here is a simple example: +It should also be noted that `min_similarity` can also take a `float` value, which is +converted to an integer edit distance based based on the text's properties, but that this is deprecated. +Please use only integer values for `min_similarity`. + +Here is a simple example of matching text with a `fuzzy` query: [source,js] -------------------------------------------------- @@ -17,8 +24,7 @@ Here is a simple example: } -------------------------------------------------- -More complex settings can be set (the values here are the default -values): +More complex settings can be seen below (the values here are the defaults): [source,js] -------------------------------------------------- @@ -27,20 +33,17 @@ values): "user" : { "value" : "ki", "boost" : 1.0, - "min_similarity" : 0.5, + "min_similarity" : 1, "prefix_length" : 0 } } } -------------------------------------------------- -The `max_expansions` parameter (unbounded by default) controls the -number of terms the fuzzy query will expand to. - [float] ==== Numeric / Date Fuzzy -`fuzzy` query on a numeric field will result in a range query "around" +A `fuzzy` query on a numeric field will result in a range query "around" the value using the `min_similarity` value. For example: [source,js] @@ -77,3 +80,19 @@ For example, for dates, a fuzzy factor of "1d" will result in multiplying whatever fuzzy value provided in the min_similarity by it. Note, this is explicitly supported since query_string query only allowed for similarity valued between 0.0 and 1.0. + +==== Performance Tuning + +The default settings for this query prefer correct behavior over speed. Given an index +with a large number of terms, performance can quickly degrade. The `prefix_length` and `max_expansions` +parameters can be used to remedy performance problems in larger datasets by significantly +reducing the search space at cost of not matching some valid documents. + +The `prefix_length` parameter restricts matches to those that share an exact prefix with the query `value`. +The number of matching characters is controlled with this parameter. +Using `prefix_length` greatly shortens the search space at the expense of not detecting edits that occur at the start of the term. + +The `max_expansions` parameter controls the number of alternate versions of the input term to look for. +When the query is executed only a set number of permutations, by default 50, are actually matched. +The lower the value of `max_expansions` the faster the query will be. The trade-off here is that some documents that +should match may not be returned due to their specific edit not being a part of the expansion list. \ No newline at end of file diff --git a/docs/reference/query-dsl/queries/match-query.asciidoc b/docs/reference/query-dsl/queries/match-query.asciidoc index 5460cbff1e448..9a06da5647548 100644 --- a/docs/reference/query-dsl/queries/match-query.asciidoc +++ b/docs/reference/query-dsl/queries/match-query.asciidoc @@ -38,10 +38,12 @@ definition, or the default search analyzer. string types it should be a value between `0.0` and `1.0`) to constructs fuzzy queries for each term analyzed. The `prefix_length` and `max_expansions` can be set in this case to control the fuzzy process. +Please see the documentation for the <> query type for +more information on `max_expansions` and `prefix_length`. + If the fuzzy option is set the query will use `constant_score_rewrite` -as its <> the `rewrite` parameter allows to control how the query will get -rewritten. +as its <> +the `rewrite` parameter controls how the query will get rewritten. Here is an example when providing additional parameters (note the slight change in structure, `message` is the field name): From a17a1251e88c6f057308cc43aa54160ff2f67e81 Mon Sep 17 00:00:00 2001 From: Andrew Cholakian Date: Thu, 12 Dec 2013 10:49:24 -0800 Subject: [PATCH 2/3] Cleanup some wording on docs for Fuzzy queries --- docs/reference/query-dsl/queries/fuzzy-query.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/query-dsl/queries/fuzzy-query.asciidoc b/docs/reference/query-dsl/queries/fuzzy-query.asciidoc index fd08cba81626d..d4b3f9579c6e4 100644 --- a/docs/reference/query-dsl/queries/fuzzy-query.asciidoc +++ b/docs/reference/query-dsl/queries/fuzzy-query.asciidoc @@ -2,7 +2,8 @@ === Fuzzy Query A fuzzy query that uses similarity based on the Levenshtein (edit distance) algorithm for text, -and ranges for numeric and date data. This maps to Lucene's `FuzzyQuery` for text. +and ranges for numeric and date data. +This maps to Lucene's `FuzzyQuery` when run against a text field, and maps to a range filter when run against a numeric field. Maximum edit distance is determined via the `min_similarity` parameter, which can only take the values 1 or 2 for text for performance reasons. @@ -88,8 +89,7 @@ with a large number of terms, performance can quickly degrade. The `prefix_lengt parameters can be used to remedy performance problems in larger datasets by significantly reducing the search space at cost of not matching some valid documents. -The `prefix_length` parameter restricts matches to those that share an exact prefix with the query `value`. -The number of matching characters is controlled with this parameter. +The `prefix_length` parameter restricts matches to those that have an an exact prefix match of the provided length with the query `value`. Using `prefix_length` greatly shortens the search space at the expense of not detecting edits that occur at the start of the term. The `max_expansions` parameter controls the number of alternate versions of the input term to look for. From b7b0d1d2ae668a8f6519215ef27ccd9f6a34ad17 Mon Sep 17 00:00:00 2001 From: Andrew Cholakian Date: Tue, 17 Dec 2013 07:43:18 -0800 Subject: [PATCH 3/3] Better document fuzziness within match queries --- .../query-dsl/queries/match-query.asciidoc | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/docs/reference/query-dsl/queries/match-query.asciidoc b/docs/reference/query-dsl/queries/match-query.asciidoc index 9a06da5647548..4a3296a1dc270 100644 --- a/docs/reference/query-dsl/queries/match-query.asciidoc +++ b/docs/reference/query-dsl/queries/match-query.asciidoc @@ -1,3 +1,4 @@ + [[query-dsl-match-query]] === Match Query @@ -34,12 +35,14 @@ The `analyzer` can be set to control which analyzer will perform the analysis process on the text. It default to the field explicit mapping definition, or the default search analyzer. -`fuzziness` can be set to a value (depending on the relevant type, for -string types it should be a value between `0.0` and `1.0`) to constructs -fuzzy queries for each term analyzed. The `prefix_length` and -`max_expansions` can be set in this case to control the fuzzy process. -Please see the documentation for the <> query type for -more information on `max_expansions` and `prefix_length`. +The `fuzziness` option can be set to search for values slightly different from the query value. +The fuzziness option controls the maximum Levenshtein distance when querying string fields +with distances of `1` or `2` being legal. When querying numeric fields, it can take larger values, +and is internally converted into a range query. +For more information on fuzzy queries see <>, which documents +this query type more fully. Note that all the options supported by the `fuzzy` query, such as +such as `max_expansions` and `prefix_length`, are supported within a `match` query when +`fuzziness` is specified. If the fuzzy option is set the query will use `constant_score_rewrite` as its <>