Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sub keyword field to string dynamic mappings - name and intent discussion #18195

Closed
djschny opened this issue May 7, 2016 · 13 comments
Closed
Assignees
Labels
>docs General docs changes :Search/Mapping Index mappings, including merging and defining field types

Comments

@djschny
Copy link
Contributor

djschny commented May 7, 2016

As discussed with @jpountz in #17188 (comment) opening up a separate ticket for discussion here.

Some items for consideration:

  • Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch
  • By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.
@clintongormley clintongormley added discuss :Search/Mapping Index mappings, including merging and defining field types labels May 7, 2016
@clintongormley
Copy link

In the original issue (#12394) I went into great detail to explain the reasoning behind this change, but to address your questions here:

Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch

In the past, the string field could be used for full text search and for aggregations, by loading all the terms into the heap in fielddata. The behaviour of these fields depended largely on the type of value that was specified, eg "The quick brown fox..." implied the use of full text search (but not aggregations or sorting), while "London" might be a single identifier used for single-term lookups, aggregations and sorting. But "New York", which is probably intended for the second use case could actually only be used for the first.

We can't deduce which use case a user intends when we receive a string field - it could be either. The solution for this is to provide a main text field for full text search (with fielddata disabled so that users don't unwittingly flood their heap by trying to run aggregations or sorting on that field), and a sub-field of type keyword for the single-term lookup, sorting, and aggregations use case.

The benefit of this is that, without any config, you get both access patterns for string fields out of the box. The downside is that you index string values twice.This is exactly the same pattern that Logstash has used for string fields for a long time so users of Logstash are unlikely to see any change.

It is very easy to optimize disk space usage here: just map your fields as text or keyword or add a dynamic mapping for textwhich specifies whether a field should be only text or only keyword.

By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.

No we aren't. This field is not named after the keyword analyzer, it is named after the field type keyword. The field type got its name in the same way as the keyword analyzer did: we don't want full text, we want to treat this value as a single keyword. What other name would your recommend to describe the datatype for this field?

And keyword fields in the future will not be restricted to the keyword analyzer. We will add support for limited analysis which allows, eg lowercasing or performing unicode normalization, or unicode collations.

For me, the only debate is whether this sub-field should be called keyword or raw, which is the name used today in Logstash. For bwc, raw would probably be better, but I think that keyword is more descriptive. My current feeling is that we should continue to use keyword. Logstash is free to keep their index template which uses raw instead.

@jpountz
Copy link
Contributor

jpountz commented May 8, 2016

+1 to what Clinton said. The fact that we did not map strings both for text search and keyword search/aggs in the past caused bad out-of-the-box experiences since you almost certainly had to reindex once you realized that you could not aggregate on whole string values.

Regarding disk usage, it will be higher with default mappings for sure, but the problem is mitigated by the use of ignore_above: 256. There is a trade-off for sure, but I think having to reindex to run aggregations is more disappointing than higher-than-expected disk usage.

However I'm also open to changing the name to either raw like logstash or original like @rjernst suggested. I have a slight preference for keyword though.

@clintongormley clintongormley added >docs General docs changes and removed discuss labels May 13, 2016
@clintongormley clintongormley self-assigned this May 13, 2016
@clintongormley
Copy link

Discussed it in Fix it Friday - we prefer the keyword field. Logstash can continue to use raw with dynamic templates, should they so choose.

I will improve the docs to explain that we're optimising for the OOB experience, but disk usage can be improved with some simple mappings.

@djschny
Copy link
Contributor Author

djschny commented May 19, 2016

What other name would your recommend to describe the datatype for this field?

not_tokenized

@jordansissel
Copy link
Contributor

jordansissel commented Aug 4, 2016

Logstash can continue to use raw

Much of the road to 5.0 has been a theme of consistency. We've used raw for a long long time, and now are suddenly calling this thing keyword -- this is inconsistent. Logstash should not keep inconsistency and is looking at fixing that very soon, which is why I'm here talking about our new friend keyword. I do not believe Logstash can continue using raw because after 5.0 this becomes a user experience problem that ES uses keyword for strings where Logstash uses raw.

That said, for me personally, keyword is the wrong name. "United States" is two words, "San Jose Sharks" is three words, and yet the keyword name implies a singular word. A user agent string is even further something I would consider a keyword and yet I use Logstash's raw feature to allow me to do aggregations on user agents. My chief concerns on naming things is about how much I expect it to confuse users.

With the hands-on-workshop, I teach people about analyzers/tokenizers by showing what happens to a string by default in Elasticsearch, then we talk about treating these entire strings as a single field value (or "term"). Because we're on the topic of analyzers, it is easy to say "We solve this by using this thing called not_analyzed, and logstash calls this field the 'raw' value". It is early for this keyword feature, but I have trouble coming up with such a story for teaching.

@dadoonet
Copy link
Member

dadoonet commented Aug 5, 2016

And raw is a shorter name :)

I think consistency is a good point here.

But I'd like to be able to apply some token filters on this type of fields at some point so I don't think that having "raw" + an analyzer would make sense in term of meaning.
"Keyword" + an analyzer has more meaning IMO.

I think we should mark this discussion as a blocker for the next release because it will be hard to change after we released the beta.

@jordansissel
Copy link
Contributor

jordansissel commented Aug 8, 2016

I've been thinking the past few days how to find a way to convince myself that keyword is the right name. Here is the story on how I can explain to myself why keyword might be the right name:

I thought keyword was poor because I view Elasticsearch field mappings as a way to say "The data is of this type". This worked well for me to understand and explain various obvious-to-me data types in Elasticsearch such as dates, longs, floats, strings, etc.

In this model, I was telling Elasticsearch what the data is, and trying to distinguish strings vs keyword vs text was not fitting my mental model.

The Elasticsearch documentation on mappings says this:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.

In this description, it seems that the mapping is presented as how Elasticsearch uses the data, not what the data is. If I view things with the how in mind, instead of the what, I think keyword makes sense -- I can tell Elasticsearch how to treat something like "United States" (such as text or keyword).

The above explanation may be confusing, but I think I can use this model -- how instead of what -- to tell stories in trainings, etc, about reasons for using text vs keyword. "Treat it as a keyword", for example.

I am still nervous about the difficult schema change this will require on the Logstash side; in the battle for consistency, Logstash will want to change the multifield .raw to match what Elasticsearch uses: .keyword.

@jpountz
Copy link
Contributor

jpountz commented Aug 9, 2016

If this proves to be a challenge to logstash, I'd personally be ok with keeping the field called keyword but having it named xxx.raw in the default mappings. Am I right to assume this is something you'd be happy with?

@cdahlqvist
Copy link

There are a lot of users with massive amounts of data ingested through Logstash where the current .raw field convention is used. Changing the default from .raw has the potential to unnecessarily break a lot of systems and cause problems for users using the default templates or custom index templates based on these. Please take this into consideration before deciding to change the existing .raw field naming convention.

@jordansissel
Copy link
Contributor

@cdahlqvist We're discussing the options and impacts of .raw vs .keyword over on logstash-plugins/logstash-output-elasticsearch#462

I have a rough draft of a proposal here: logstash-plugins/logstash-output-elasticsearch#462 (comment)

@jordansissel
Copy link
Contributor

@jpountz I'd be OK having ES's default to xxx.raw, yes. The benefit there is to not divide users across the release boundary of 5.0 (new users and old users would both get .raw if we did this)

@clintongormley
Copy link

@jordansissel I agree with the conclusion you reached in #18195 (comment) and I think that keyword is fundamentally the right name for this field (including for the reasons cited in #18195 (comment)). Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that (unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful transition in Logstash. I don't have great suggestions for how to make this easier, but the options are probably as follows:

  • New users - use keyword from the outset
  • Existing users with custom templates - most of these won't be much impacted
  • Existing users with short retention periods - could use raw and keyword for the duration of the transition
  • Existing users with long retention periods - could change the template to just use raw going forwards

@jordansissel
Copy link
Contributor

+1 clint's comments and keeping 'keyword'.

I think we can help users through this period of transition. It may be
hard, but I think it's the right direction.

On Wednesday, August 10, 2016, Clinton Gormley notifications@github.com
wrote:

@jordansissel https://github.com/jordansissel I agree with the
conclusion you reached in #18195 (comment)
#18195 (comment)
and I think that keyword is fundamentally the right name for this field
(including for the reasons cited in #18195 (comment)
#18195 (comment)).
Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that
(unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful
transition in Logstash. I don't have great suggestions for how to make this
easier, but the options are probably as follows:

  • New users - use keyword from the outset
  • Existing users with custom templates - most of these won't be much
    impacted
  • Existing users with short retention periods - could use raw and
    keyword for the duration of the transition
  • Existing users with long retention periods - could change the
    template to just use raw going forwards


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#18195 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIC6vUIeZey6EZJHL9KAaDqxjsgRugYks5qebhpgaJpZM4IZVF6
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

No branches or pull requests

6 participants