String field mappings and fielddata / doc values settings #12394

clintongormley · 2015-07-22T11:14:18Z

This issue addresses a few topics:

separating string fields out into text and keyword fields (Rethink string versus not_analyzed string mappings and support #11901)
deprecating in-memory fielddata for field-types that support doc-values (to remove in 3.x)
fielddata and doc_values settings (Improve fielddata mappings #8693) and norms
good out-of-the-box dynamic mappings for string fields

`string` → `text`/ `keyword`

Today, we use string both for full-text and for structured keywords. We don't support doc-values on analyzed string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.

Proposal:

deprecate string fields
add text fields which support the full analysis chain and don't support doc-values
add keywords fields which support only the keyword tokenizer, and have doc-values enabled by default
change index to accept true | false

Question: Should keyword fields allow token filters that introduce new tokens?

Deprecating fielddata for fields that support doc values

In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.

All fields that support doc values already have them enabled by default.

Proposal:

Deprecate fielddata implementations (except for analyzed string fields) in 2.x
Remove them in 3.x.

The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to index:false - they have to reindex.

Fielddata and doc values settings

Today we have these settings:

doc_values: true|false
fielddata.format: disabled | doc_values | paged_bytes | array
fielddata.loading: lazy | eager | eager_global_ordinals
fielddata.filters: frequency:{}, regex:{}

These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.

Proposal for fields that support doc values:

doc_values : true (default) | false
global_ordinals : lazy (default) | eager

Proposal for analyzed string fields:

fielddata: disabled (default) | lazy | eager
global_ordinals : lazy (default) | eager
fielddata.filters: frequency:{}, regex:{}

If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the global_ordinals setting.

Norms settings

Similar to the above, we have:

norms.enabled : true | false
norms.loading : lazy | eager

In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).

Proposal:

norms: true | false
only supported on text fields

Good out-of-the-box dynamic mappings for string fields

Today, when we detect a new string field, we add it as an analyzed string, with lazy fielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, eg new + york), it's a poor default for heap usage.

Proposal:

Add a text main field (with fielddata loading disabled) and a keyword multi-field by default, ie:

{
  "my_string": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

With the default settings these fields would look like this:

{
  "my_string": {
    "type":                "text",
    "analyzer":            "default",
    "boost":               1,
    "fielddata":           "disabled",
    "fielddata_filters":   {},
    "ignore_above":        -1,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "positions",
    "norms":               true,
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "default",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

{
  "my_string.keyword": {
    "type":                "keyword",
    "analyzer":            "keyword",
    "boost":               1,
    "doc_values":          true,
    "ignore_above":        256,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "docs",
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "keyword",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

The text was updated successfully, but these errors were encountered:

jpountz · 2015-07-22T11:27:02Z

+1

One issue with the default dynamic mappings that you propose is that we would fail at indexing large string fields given that the keyword mapping could generate large tokens while Lucene refuses to index tokens that are greater than 32k. So we might want to set ignore_above to a reasonable value, eg. 256?

clintongormley · 2015-07-22T11:29:21Z

++ I meant to mention that. I'll update the description and add it

rmuir · 2015-07-22T11:39:03Z

+1

javanna · 2015-07-22T11:39:24Z

+1 !!!

gmoskovicz · 2015-07-22T12:21:17Z

This is a big win. It makes strings much more clear, and should make a faster default behaviour!

imotov · 2015-07-22T14:45:41Z

I am ±0 on this one. I think the change is useful but I feel like it's solving a wrong problem. I think what we really need is a mechanism to define custom types. So, users could define "text" the way they want with reasonable defaults in a single place and then just refer to this type everywhere in a mapping. We can still pre-populate a set of default custom types such as "text" and "keyword" proposed above, but this should be done through the same mechanism that is available to users who should be able to redefine these custom types.

colings86 · 2015-07-22T14:47:00Z

+1 to this issue, it will make string settings much easier for users to understand.

pickypg · 2015-07-22T15:09:22Z

Huge +1. This simplifies so much and it does what you want out of the box.

@imotov I think that's an interesting point, but this significantly improves what is currently happening/allowed with not_analyzed strings (e.g., adding a filter). Perhaps your idea alone should be a separate feature? I like the concept of globally controlled type defaults.

dakrone · 2015-07-22T16:30:52Z

+1 on this, I think if implemented properly it could lead to possibly adding something like @imotov suggested in the future, but for now just starting with these two is a good idea.

markwalkom · 2015-07-23T09:45:53Z

+1 2

mbonaci · 2015-07-29T11:02:52Z

+[insert some ridiculously large number]
Do you still expect this to land in 2.0?

clintongormley · 2015-07-30T12:44:56Z

i hope so!

mattweber · 2015-08-05T18:26:09Z

@clintongormley What is the current thinking for:

Question: Should keyword fields allow token filters that introduce new tokens?

Is there a way to know if a token filter outputs more than a single value?

clintongormley · 2015-08-05T18:28:31Z

Hi @mattweber - long time!

@rmuir can confirm but I believe it is not a problem, in the same way that a numeric field can have multiple values. What we need to avoid is proper full text tokenization.

mattweber · 2015-08-05T18:34:35Z

@clintongormley Thanks. I was thinking token filters like (edge)ngram and word delimiter might be an issue. Probably rare though.

rmuir · 2015-08-05T18:35:28Z

No, we don't need to allow filters that have multiple values. That is really really trappy!

Yes, in lucene we annotate filters that are "safe". This can also be used to make things like wildcard "analysis" more intuitive. But elasticsearch does not make use of this right now.

rayward · 2015-08-05T23:20:06Z

😍 +1

ariasdelrio · 2015-08-06T12:49:14Z

+1

clintongormley · 2016-02-29T08:38:28Z

if people sort/aggregate on analyzed fields (I do that sometimes!) they are informed

The plan is to disable aggs on analyzed fields by default and to throw an exception if the user tries to do so. I believe this is the right thing to do as it is much more common for an unaware user to sort/aggregate on the wrong field (full text instead of raw) and blow up their heap. Aggs can be enabled on analyzed fields on a live index by updating the mappings, so it is quite easy for users who know what they are doing to resolve the situation to their liking

Though I agree in disliking index-time boosting, I dislike limiting features even more.

Index time boosting and aggs are unrelated. This is actually to do with whether field length norms are written for a not_analyzed field or not. Field length norms are for full text relevance calculations (but index time boosting is hacked in by adjusting the norm). I would argue that enabling norms on keyword (not_analyzed) fields counts as spooky action at a distance.

jpountz · 2016-02-29T08:52:15Z

I doubt any other field types support index time boosting?

Actually most fields we have do, for instance numerics. Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms? (This would not change how _all works).

clintongormley · 2016-02-29T08:58:17Z

Actually most fields we have do, for instance numerics. Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms? (This would not change how _all works).

that could work

jimczi · 2016-02-29T09:29:04Z

Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms?

What would be the solution then ? I see two use cases for index time boosting, the first one is to boost documents independently in the same field and the other one is to boost fields inside a document independently when multi fields are indexed into the same field (_all case). The first use case is implemented via the norms and the second one via the payloads. I don't see how we could implement it at query time if we don't index the boost value. @jpountz am I missing something ?

jpountz · 2016-02-29T09:59:16Z

That sounds right to me! The issue I think is that some users might also be using mapping boosts as a way to tell elasticsearch that some fields are more important for relevance than others (for multi-field queries). But this is more a job for query-time boosting I think, hence the proposal to not apply the mapping boosts via Field.setBoost like today but in MappedFieldType.termQuery via a BoostQuery?

rashidkpc · 2016-03-03T22:47:07Z

Should this be marked 5.0.0 and breaking now? @clintongormley

rcrezende · 2016-03-15T00:22:59Z

+1

honzakral · 2016-03-22T00:24:04Z

I understand the split but I feel it doesn't support a lot of useful options around using keyword fields with different analyzers. Is there a plan to support other analyzers/tokenizers than keyword?

I have several use cases for using aggregations on a result of path_hierarchy tokenizer (urls, directories, categories in e-commerce, ...), or ICU analyzers to normalize different unicode options etc.

Currently it is impossible to do that without resorting to using fielddata (which I would really like to avoid) or storing the field twice and using ingest-time denormalization (essentially duplicating the job of the analyzer) which doesn't seem clean.

jpountz · 2016-03-22T09:08:01Z

Is there a plan to support other analyzers/tokenizers than keyword?

Yes but this will likely be implemented in 5.x rather than 5.0.

I have several use cases for using aggregations on a result of path_hierarchy tokenizer (urls, directories, categories in e-commerce, ...), or ICU analyzers to normalize different unicode options etc.

ICU normalization is typically the kind of things we want to allow with this keyword analyzer option. I'm less sure about path_hierarchy however since it generates multiple tokens for a single input string, but I see your point.

…pcoming elasticsearch 5.0. Elasticsearch 5.0 has refactored mappings as described in elastic/elasticsearch#12394. Most notably the string field has been replaced by the `text` and `keyword` fields.

jpountz · 2016-03-30T12:42:37Z

This has been implemented with some differences to the original proposal:

regex fielddata filters are not supported and will be removed on upgrade
frequency-based fielddata filters are specified with fielddata_frequency_filter
the keyword field does not support the analyzer for now (expected to come in a later 5.x release)

boicehuang · 2018-08-10T02:21:01Z

I wonder why running a terms aggregation query on a keyword field enabled with doc_values consumed fielddata cache, which was unexpected for me? It is easy to reproduce with ES version 5.6.4 or 6.2.3. Can anyone help me? @jpountz @clintongormley @elasticmachine

jpountz · 2018-08-13T08:51:31Z

This is because of global ordinals mappings.

clintongormley added v2.0.0-beta1 discuss :Search/Mapping Index mappings, including merging and defining field types Meta labels Jul 22, 2015

This was referenced Aug 5, 2015

Should Analyzer setting override search-analyzer for non-analyzed fields ? #12563

Closed

Add support for doc values for analyzed fields #10061

Closed

clintongormley mentioned this issue Aug 10, 2015

Improve fielddata mappings #8693

Closed

clintongormley added v2.0.0 v2.1.0 and removed v2.0.0-beta1 v2.0.0 labels Aug 13, 2015

jasontedor mentioned this issue Aug 25, 2015

doc_values doesn't work with keyword tokenizers #13099

Closed

jimczi mentioned this issue Mar 2, 2016

Change the field mapping index time boost into a query time boost. #16900

Merged

jpountz mentioned this issue Mar 3, 2016

Deprecate string in favor of text/keyword. #16877

Merged

Bargs mentioned this issue Mar 4, 2016

Defining a multi field in a dynamic template matching the new text type doesn't work #16945

Closed

suyograo mentioned this issue Mar 4, 2016

Mapping changes in Elasticsearch 5.0 for string field type logstash-plugins/logstash-output-elasticsearch#386

Closed

clintongormley added v2.4.0 and removed v2.3.0 labels Mar 16, 2016

jpountz mentioned this issue Mar 23, 2016

Automatically add a sub keyword field to string dynamic mappings. #17188

Merged

jpountz mentioned this issue Mar 24, 2016

Upgrade elasticsearch templates so that they are compliant with the upcoming elasticsearch 5.0. elastic/beats#1221

Merged

jpountz closed this as completed Mar 30, 2016

jpountz added v5.0.0-alpha1 and removed v2.4.0 labels Mar 30, 2016

clintongormley mentioned this issue May 7, 2016

sub keyword field to string dynamic mappings - name and intent discussion #18195

Closed

acchen97 mentioned this issue Jul 27, 2016

Add support for 16-bit/half floats in ES 5.x elastic/elasticsearch-hadoop#800

Closed

jordansissel mentioned this issue Aug 4, 2016

The keyword name is confusing to me. #19812

Closed

clintongormley removed the v5.0.0-alpha1 label Oct 18, 2016

mattweber mentioned this issue Oct 26, 2016

Allow "safe" custom analyzers for keyword fields #21137

Closed

boicehuang mentioned this issue Aug 10, 2018

[Fielddata] Terms aggregation on keyword field with doc values consume huge fielddata cache, may cause OOM #32761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String field mappings and fielddata / doc values settings #12394

String field mappings and fielddata / doc values settings #12394

clintongormley commented Jul 22, 2015

jpountz commented Jul 22, 2015

clintongormley commented Jul 22, 2015

rmuir commented Jul 22, 2015

javanna commented Jul 22, 2015

gmoskovicz commented Jul 22, 2015

imotov commented Jul 22, 2015

colings86 commented Jul 22, 2015

pickypg commented Jul 22, 2015

dakrone commented Jul 22, 2015

markwalkom commented Jul 23, 2015

mbonaci commented Jul 29, 2015

clintongormley commented Jul 30, 2015

mattweber commented Aug 5, 2015

clintongormley commented Aug 5, 2015

mattweber commented Aug 5, 2015

rmuir commented Aug 5, 2015

rayward commented Aug 5, 2015

ariasdelrio commented Aug 6, 2015

clintongormley commented Feb 29, 2016

jpountz commented Feb 29, 2016

clintongormley commented Feb 29, 2016

jimczi commented Feb 29, 2016

jpountz commented Feb 29, 2016

rashidkpc commented Mar 3, 2016

rcrezende commented Mar 15, 2016

honzakral commented Mar 22, 2016

jpountz commented Mar 22, 2016

jpountz commented Mar 30, 2016

boicehuang commented Aug 10, 2018 •

edited

jpountz commented Aug 13, 2018

String field mappings and fielddata / doc values settings #12394

String field mappings and fielddata / doc values settings #12394

Comments

clintongormley commented Jul 22, 2015

string → text/ keyword

Deprecating fielddata for fields that support doc values

Fielddata and doc values settings

Norms settings

Good out-of-the-box dynamic mappings for string fields

jpountz commented Jul 22, 2015

clintongormley commented Jul 22, 2015

rmuir commented Jul 22, 2015

javanna commented Jul 22, 2015

gmoskovicz commented Jul 22, 2015

imotov commented Jul 22, 2015

colings86 commented Jul 22, 2015

pickypg commented Jul 22, 2015

dakrone commented Jul 22, 2015

markwalkom commented Jul 23, 2015

mbonaci commented Jul 29, 2015

clintongormley commented Jul 30, 2015

mattweber commented Aug 5, 2015

clintongormley commented Aug 5, 2015

mattweber commented Aug 5, 2015

rmuir commented Aug 5, 2015

rayward commented Aug 5, 2015

ariasdelrio commented Aug 6, 2015

clintongormley commented Feb 29, 2016

jpountz commented Feb 29, 2016

clintongormley commented Feb 29, 2016

jimczi commented Feb 29, 2016

jpountz commented Feb 29, 2016

rashidkpc commented Mar 3, 2016

rcrezende commented Mar 15, 2016

honzakral commented Mar 22, 2016

jpountz commented Mar 22, 2016

jpountz commented Mar 30, 2016

boicehuang commented Aug 10, 2018 • edited

jpountz commented Aug 13, 2018

`string` → `text`/ `keyword`

boicehuang commented Aug 10, 2018 •

edited