Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String field mappings and fielddata / doc values settings #12394

Closed
clintongormley opened this issue Jul 22, 2015 · 41 comments
Closed

String field mappings and fielddata / doc values settings #12394

clintongormley opened this issue Jul 22, 2015 · 41 comments
Assignees
Labels
discuss Meta :Search/Mapping Index mappings, including merging and defining field types

Comments

@clintongormley
Copy link

This issue addresses a few topics:

stringtext/ keyword

Today, we use string both for full-text and for structured keywords. We don't support doc-values on analyzed string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.

Proposal:

  • deprecate string fields
  • add text fields which support the full analysis chain and don't support doc-values
  • add keywords fields which support only the keyword tokenizer, and have doc-values enabled by default
  • change index to accept true | false

Question: Should keyword fields allow token filters that introduce new tokens?

Deprecating fielddata for fields that support doc values

In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.

All fields that support doc values already have them enabled by default.

Proposal:

  • Deprecate fielddata implementations (except for analyzed string fields) in 2.x
  • Remove them in 3.x.

The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to index:false - they have to reindex.

Fielddata and doc values settings

Today we have these settings:

  • doc_values: true|false
  • fielddata.format: disabled | doc_values | paged_bytes | array
  • fielddata.loading: lazy | eager | eager_global_ordinals
  • fielddata.filters: frequency:{}, regex:{}

These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.

Proposal for fields that support doc values:

  • doc_values : true (default) | false
  • global_ordinals : lazy (default) | eager

Proposal for analyzed string fields:

  • fielddata: disabled (default) | lazy | eager
  • global_ordinals : lazy (default) | eager
  • fielddata.filters: frequency:{}, regex:{}

If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the global_ordinals setting.

Norms settings

Similar to the above, we have:

  • norms.enabled : true | false
  • norms.loading : lazy | eager

In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).

Proposal:

  • norms: true | false
  • only supported on text fields

Good out-of-the-box dynamic mappings for string fields

Today, when we detect a new string field, we add it as an analyzed string, with lazy fielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, eg new + york), it's a poor default for heap usage.

Proposal:

Add a text main field (with fielddata loading disabled) and a keyword multi-field by default, ie:

{
  "my_string": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

With the default settings these fields would look like this:

{
  "my_string": {
    "type":                "text",
    "analyzer":            "default",
    "boost":               1,
    "fielddata":           "disabled",
    "fielddata_filters":   {},
    "ignore_above":        -1,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "positions",
    "norms":               true,
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "default",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

{
  "my_string.keyword": {
    "type":                "keyword",
    "analyzer":            "keyword",
    "boost":               1,
    "doc_values":          true,
    "ignore_above":        256,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "docs",
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "keyword",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}
@clintongormley clintongormley added v2.0.0-beta1 discuss :Search/Mapping Index mappings, including merging and defining field types Meta labels Jul 22, 2015
@jpountz
Copy link
Contributor

jpountz commented Jul 22, 2015

+1

One issue with the default dynamic mappings that you propose is that we would fail at indexing large string fields given that the keyword mapping could generate large tokens while Lucene refuses to index tokens that are greater than 32k. So we might want to set ignore_above to a reasonable value, eg. 256?

@clintongormley
Copy link
Author

++ I meant to mention that. I'll update the description and add it

@rmuir
Copy link
Contributor

rmuir commented Jul 22, 2015

+1

@javanna
Copy link
Member

javanna commented Jul 22, 2015

+1 !!!

@gmoskovicz
Copy link
Contributor

This is a big win. It makes strings much more clear, and should make a faster default behaviour!

@imotov
Copy link
Contributor

imotov commented Jul 22, 2015

I am ±0 on this one. I think the change is useful but I feel like it's solving a wrong problem. I think what we really need is a mechanism to define custom types. So, users could define "text" the way they want with reasonable defaults in a single place and then just refer to this type everywhere in a mapping. We can still pre-populate a set of default custom types such as "text" and "keyword" proposed above, but this should be done through the same mechanism that is available to users who should be able to redefine these custom types.

@colings86
Copy link
Contributor

+1 to this issue, it will make string settings much easier for users to understand.

@pickypg
Copy link
Member

pickypg commented Jul 22, 2015

Huge +1. This simplifies so much and it does what you want out of the box.

@imotov I think that's an interesting point, but this significantly improves what is currently happening/allowed with not_analyzed strings (e.g., adding a filter). Perhaps your idea alone should be a separate feature? I like the concept of globally controlled type defaults.

@dakrone
Copy link
Member

dakrone commented Jul 22, 2015

+1 on this, I think if implemented properly it could lead to possibly adding something like @imotov suggested in the future, but for now just starting with these two is a good idea.

@markwalkom
Copy link
Contributor

+1 2

@mbonaci
Copy link

mbonaci commented Jul 29, 2015

+[insert some ridiculously large number]
Do you still expect this to land in 2.0?

@clintongormley
Copy link
Author

i hope so!

@mattweber
Copy link
Contributor

@clintongormley What is the current thinking for:

Question: Should keyword fields allow token filters that introduce new tokens?

Is there a way to know if a token filter outputs more than a single value?

@clintongormley
Copy link
Author

Hi @mattweber - long time!

@rmuir can confirm but I believe it is not a problem, in the same way that a numeric field can have multiple values. What we need to avoid is proper full text tokenization.

@mattweber
Copy link
Contributor

@clintongormley Thanks. I was thinking token filters like (edge)ngram and word delimiter might be an issue. Probably rare though.

@rmuir
Copy link
Contributor

rmuir commented Aug 5, 2015

No, we don't need to allow filters that have multiple values. That is really really trappy!

Yes, in lucene we annotate filters that are "safe". This can also be used to make things like wildcard "analysis" more intuitive. But elasticsearch does not make use of this right now.

@rayward
Copy link

rayward commented Aug 5, 2015

😍 +1

@ariasdelrio
Copy link

+1

@clintongormley
Copy link
Author

if people sort/aggregate on analyzed fields (I do that sometimes!) they are informed

The plan is to disable aggs on analyzed fields by default and to throw an exception if the user tries to do so. I believe this is the right thing to do as it is much more common for an unaware user to sort/aggregate on the wrong field (full text instead of raw) and blow up their heap. Aggs can be enabled on analyzed fields on a live index by updating the mappings, so it is quite easy for users who know what they are doing to resolve the situation to their liking

Though I agree in disliking index-time boosting, I dislike limiting features even more.

Index time boosting and aggs are unrelated. This is actually to do with whether field length norms are written for a not_analyzed field or not. Field length norms are for full text relevance calculations (but index time boosting is hacked in by adjusting the norm). I would argue that enabling norms on keyword (not_analyzed) fields counts as spooky action at a distance.

@jpountz
Copy link
Contributor

jpountz commented Feb 29, 2016

I doubt any other field types support index time boosting?

Actually most fields we have do, for instance numerics. Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms? (This would not change how _all works).

@clintongormley
Copy link
Author

Actually most fields we have do, for instance numerics. Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms? (This would not change how _all works).

that could work

@jimczi
Copy link
Contributor

jimczi commented Feb 29, 2016

Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms?

What would be the solution then ? I see two use cases for index time boosting, the first one is to boost documents independently in the same field and the other one is to boost fields inside a document independently when multi fields are indexed into the same field (_all case). The first use case is implemented via the norms and the second one via the payloads. I don't see how we could implement it at query time if we don't index the boost value. @jpountz am I missing something ?

@jpountz
Copy link
Contributor

jpountz commented Feb 29, 2016

That sounds right to me! The issue I think is that some users might also be using mapping boosts as a way to tell elasticsearch that some fields are more important for relevance than others (for multi-field queries). But this is more a job for query-time boosting I think, hence the proposal to not apply the mapping boosts via Field.setBoost like today but in MappedFieldType.termQuery via a BoostQuery?

@rashidkpc
Copy link

Should this be marked 5.0.0 and breaking now? @clintongormley

@rcrezende
Copy link

+1

@honzakral
Copy link
Contributor

I understand the split but I feel it doesn't support a lot of useful options around using keyword fields with different analyzers. Is there a plan to support other analyzers/tokenizers than keyword?

I have several use cases for using aggregations on a result of path_hierarchy tokenizer (urls, directories, categories in e-commerce, ...), or ICU analyzers to normalize different unicode options etc.

Currently it is impossible to do that without resorting to using fielddata (which I would really like to avoid) or storing the field twice and using ingest-time denormalization (essentially duplicating the job of the analyzer) which doesn't seem clean.

@jpountz
Copy link
Contributor

jpountz commented Mar 22, 2016

Is there a plan to support other analyzers/tokenizers than keyword?

Yes but this will likely be implemented in 5.x rather than 5.0.

I have several use cases for using aggregations on a result of path_hierarchy tokenizer (urls, directories, categories in e-commerce, ...), or ICU analyzers to normalize different unicode options etc.

ICU normalization is typically the kind of things we want to allow with this keyword analyzer option. I'm less sure about path_hierarchy however since it generates multiple tokens for a single input string, but I see your point.

jpountz added a commit to jpountz/beats that referenced this issue Mar 24, 2016
…pcoming elasticsearch 5.0.

Elasticsearch 5.0 has refactored mappings as described in
elastic/elasticsearch#12394. Most notably the string field has been replaced
by the `text` and `keyword` fields.
@jpountz
Copy link
Contributor

jpountz commented Mar 30, 2016

This has been implemented with some differences to the original proposal:

  • regex fielddata filters are not supported and will be removed on upgrade
  • frequency-based fielddata filters are specified with fielddata_frequency_filter
  • the keyword field does not support the analyzer for now (expected to come in a later 5.x release)

@boicehuang
Copy link
Contributor

boicehuang commented Aug 10, 2018

I wonder why running a terms aggregation query on a keyword field enabled with doc_values consumed fielddata cache, which was unexpected for me? It is easy to reproduce with ES version 5.6.4 or 6.2.3. Can anyone help me? @jpountz @clintongormley @elasticmachine

@jpountz
Copy link
Contributor

jpountz commented Aug 13, 2018

This is because of global ordinals mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Meta :Search/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

No branches or pull requests