Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize dual text/keyword mappings #53181

Open
jpountz opened this issue Mar 5, 2020 · 29 comments
Open

Formalize dual text/keyword mappings #53181

jpountz opened this issue Mar 5, 2020 · 29 comments
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team

Comments

@jpountz
Copy link
Contributor

jpountz commented Mar 5, 2020

Our default dynamic mappings rules create both a text and a keyword field whenever they hit a JSON string:

{
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

And over the years, many clients implemented similar logic:

  • when an exact query is fired, use the keyword field,
  • when an aggregation is used, use the keyword field,
  • otherwise use the text field.

Is it logic that we should embed in Elasticsearch? Maybe we can find better ideas, but here is a proposal to get the discussion started:

  • Create a new exact_match query, which tries to match against the whole string. It fails for text fields and has the same behavior as match on keyword, numbers, ...
  • Create a new text_keyword field, which is essentially a wrapper around a text and a keyword field. Running aggregations or an exact_match query against this field use the sub keyword field while match, query_string, multi_match and simple_query_string queries use the text field.
  • Update default dynamic mappings to create this field for strings instead of the current text + sub keyword mapping.
@jpountz jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types team-discuss labels Mar 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@jpountz
Copy link
Contributor Author

jpountz commented Mar 10, 2020

Relates to #53020

@epixa
Copy link
Contributor

epixa commented Apr 12, 2020

Would the solution put into place to address text/keyword multi-fields also handle wildcards? In discussions I've seen it assumed that it would, but I wanted to clarify.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 12, 2020

Hopefully wildcard is going to be less subject to this issue, as I can't think of many reasons to map a field both as wildcard and keyword, since they support the same operations, or wildcard and text, since wildcard can do infix search already. That said, if you have plans to use multi-fields with a wilcard field, we'd be interested to know more.

ECS has some fields that are mapped as keyword / text that we plan to migrate as keyword. I believe we'll need to think about dedicated migration paths for this case since removing the text multi-field would be a breaking change. For instance, one idea that has been raised is whether wildcard fields could create a virtual text subfield that emulates the behavior of a text field to ease this transition. This requires more thought and we might end up with a different approach but I thought sharing this example would help explain the kind of solution that we're considering.

@epixa
Copy link
Contributor

epixa commented Apr 13, 2020

Are wildcard fields just a superset of keyword in terms of features? I may have been mistaken, but I thought simply switching from keyword to wildcard would result in feature loss for aggregations and such. Admittedly I don't have a concrete example.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 14, 2020

@epixa As we've been thinking about how to migrate from existing mappings to wildcard, we agreed that it would be much easier if wildcard supported the same operations keyword using the same semantics. It supports match queries, terms aggregations, sorting, etc. with the same semantics as keyword. However it takes a different approach to indexing that makes it slower at exact queries or aggregations, but faster at wildcard/regexp queries on high-cardinality fields.

@webmat
Copy link

webmat commented Apr 14, 2020

I'd like to point out that ECS went with the reverse convention, on how to index strings. Since ECS started around monitoring, rather than full text search, the default datatype is keyword for string fields. Then only a few fields have a .text multi-field added (less than 20, iirc).

I'm pointing this out because here we're talking about potentially building a shorthand notation that encodes the Elasticsearch default. As the proposal stands, it couldn't be used by users who are trying to build ECS-compatible indices.

  • Update default dynamic mappings to create this field for strings instead of the current text + sub keyword mapping.

I'm not sure I understand the 3rd point in the body of the issue. "This field": are we talking about wildcard?

@colings86
Copy link
Contributor

@webmat The idea here would be that ECS would not need to define multi-fields at all. ECS would define the field type as text_keyword (or whatever name we come up with) with no multi-fields. Internally Elasticsearch would handle the fact that there is a text field and a keyword field underlying the field type but to the user it would appear as one field with no multi-fields. The idea here is that users should not need to worry about whether they need the keyword or text form of the field and so just reference the field in one way and Elasticsearch should figure out which underlying field (text or keyword) is the right one to use (so whether its used in an aggregation, in a free text query, an exact match query, etc.)

@webmat
Copy link

webmat commented Apr 16, 2020

This makes sense, and would indeed be a good simplification. But this would force all string fields defined this way to be indexed both ways?

ECS followed the Beats convention of trying to do keyword only as much as possible, for performance reasons.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@mayya-sharipova
Copy link
Contributor

Instead of introducing a new exact_match query, can we use term query for exact matching and make term query fail on text fields with a message like "use match query instead"?


I like the idea of having a new text_keyword field, which is a wrapper around a text and a subfield keyword field and queries/aggs are delegated to one of those fields automatically.

Another way to organize text_keyword is to index both the exact form and tokenized form into the same Lucene field. This is inspired by the new wildcard field. The exact form can be surrounded by some fake symbols, e.g. "0<exact_value>00". If exact and tokenized form are the same, we can only keep the exact value, which will allow us to save space for many solution fields that mostly single-valued.

@jpountz
Copy link
Contributor Author

jpountz commented May 5, 2020

this would force all string fields defined this way to be indexed both ways

Only if you want to support both text search (provided by text) and exact match/sorting/aggregations (provided by keyword).

can we use term query for exact matching and make term query fail on text fields with a message like "use match query instead"?

We could.

index both the exact form and tokenized form into the same Lucene field

The space saving idea is interesting. I wonder if that would cause problems. Preserving support for scoring and multi-term queries would be challenging but I believe it could work?

A problem with the proposal of this issue that we identified when discussing the wildcard field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.

@mayya-sharipova
Copy link
Contributor

A problem with the proposal of this issue that we identified when discussing the wildcard field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.

This, indeed a substantial problem, and makes this proposal not worth it.

fuzzy, wildcard, regexp

Speaking of these queries, if we go with a text field and a keyword subfield, do these queries apply only to keyword subfield? or it will be applied across two fields (boolean OR/multi-match)?

Preserving support for scoring

It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.


Another idea to optimize space could be to have a text field and a keyword subfield as we planned, always index a field value to the keyword field, but only index it into the text field only if its analyzed version is different.

@jpountz
Copy link
Contributor Author

jpountz commented May 6, 2020

Speaking of these queries, if we go with a text field and a keyword subfield, do these queries apply only to keyword subfield? or it will be applied across two fields (boolean OR/multi-match)?

This is the question that helped us discover this problem. :)

It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.

Agreed.

@mayya-sharipova
Copy link
Contributor

We had a team discussion, and we are in favour to proceed with a text_keyword field:

  • this will be a single field on the elasticsearch side
  • internally it will be mapped to two Lucene fields: keyword and text field.
  • all term-level queries (term, fuzzy, wildcard, prefix, range, new exact_match query) will be delegate to the internal keyword field
  • all full-text queries will be delegated to the internal text field
  • aggs will be run on the internal keyword field
  • doc_values mapping parameter will be applied to the internal keyword field and can be disabled
  • index_options will be applied to the text field
  • enabled parameter will be applied to both fields (if a user need to disable keyword or text, they should use traditional text/keyword fields).

Some things still left for the discussion:

  • should we allow user's access to the individual internal fields (for example a user wants to run a term query on the internal text field)?
  • what about significant_terms agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?

@markharwood
Copy link
Contributor

what about significant_terms agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?

The significant_text agg is designed for text fields (typically based on samples of top hits using sampler agg, re-analyzes source on the fly).
The significant_terms agg typically targets keyword fields (accessing doc values.). It can target text fields too but we advise against it because it requires fielddata.

I'd be happy to see us formalise these patterns by making them only target their respective field types.

@mayya-sharipova mayya-sharipova self-assigned this Jun 1, 2020
@jtibshirani
Copy link
Contributor

I had a couple questions about the proposal:

  • I'm curious about example cases where we see text_keyword being useful. The newly-introduced wildcard field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword as the dynamic mapping type for strings?)
  • I wonder if we should be adding another string field type whose use + configuration overlaps a lot with existing ones. I'm worried that it's slowly becoming overwhelming for users to understand all the field types we offer and which ones to use to model their data. Did we consider alternatives like adding an option on the text field type, something like index_exact?

@webmat
Copy link

webmat commented Jun 1, 2020

I think @jtibshirani makes a good point. In ECS we added a .text multi-field to some keyword fields as a way to work around not having wildcard and query-time case insensitivity. When both of these are widely available, I think the need to for these multi-fields in ECS will mostly go away.

One place I think indexing both as keyword and text is still very useful is for initial dataset exploration, however.

@astefan
Copy link
Contributor

astefan commented Jun 2, 2020

Another interesting point that @jtibshirani is bringing forth, and which I've been thinking about as well, is the possibly overwhelming growing list of field types we offer. I think I understand the need for the new field types, and I've been loosely monitoring their addition more or less from an user point of view (finding out a new field type is being added, on the surface understanding the need for it and then just looking at the docs). And there are a lot of "specialized" field types out there, the later ones being added more from an "internal" usage need imho.

I'd be curious if anyone else is thinking the same ^ and if we could better handle the way users look at our growing list of field types in the future. An example would be the way we document the field types. Now, almost all non-core field types are under the "specialized" section. I would argue that the IP field, for example, shouldn't be in the "specialized" section, but maybe in the "core" one. It has a long history, it's fairly easy to understand and it doesn't require an edge case scenario to be used (like it happens with most of the other specialized field types). I would even push this further and suggest a new field types section - "Advanced", maybe - where flattened, constant_keyword, histogram... should be moved.

@markharwood
Copy link
Contributor

For me the biggest shift I've seen in requirements is the move away from traditional ideas of indexing human-authored text to indexing machine-generated text.
Traditionally tokenisation was useful normalisation that did all of the following:

  1. Breaking strings into words by splitting on punctuation
  2. Lowercasing
  3. Removing plurals, past tense etc (ie stemming)
  4. Injecting useful synonyms.

While useful on prose, none of the above is helpful when searching stacktraces, weblogs etc.
We just want matching on arbitrary parts of character sequences - which is where the wildcard field comes in. It marks a break with token-based matching. Users no longer need to think of the indexed terms defined by a choice of Analyzer (whose logic is often a black box to most).

This distinction between indexing for prose and indexing for exact-matching is perhaps the biggest change to reflect in our mapping choices.

@jtibshirani
Copy link
Contributor

jtibshirani commented Jun 2, 2020

@astefan I also think improving the documentation on field types could be a big help to our users. I filed #57548 based on your thoughts -- we can continue the discussion there to keep this issue focused on text/ keyword mappings.

@mayya-sharipova
Copy link
Contributor

I'm curious about example cases where we see text_keyword being useful. The newly-introduced wildcard field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword as the dynamic mapping type for strings?)

@jtibshirani Indeed, wildcard is very useful for ECS and logging solutions, but being special "keyword" type, it doesn't deal with full-text search. I think with the main goal of a new text_keyword field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following:

  • Make automatic decisions what sub-field to use for term queries, full text queries, agg instead of asking this from a user in Kibana or through query DSL. This will lead to easier adoption of elasticsearch for new users.
  • Easing pain to deal with multi-fields. New users get confused what is X.keyword field. Solutions ( we may still have some multifields in them even most fields are indexed as wildcards) have to make decisions which field of the multifields to use (e.g. SQL *, = operators), this decisions will be dealt on the es side.

@jtibshirani
Copy link
Contributor

I think with the main goal of a new text_keyword field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following...

Thanks @mayya-sharipova, this makes sense! If this is the main use case it would be nice to verify that we plan to make this change (starting to use a combined text/keyword for dynamic string mappings, instead of say switching to wildcard).

@mayya-sharipova
Copy link
Contributor

We had a discussion within the search team and have decided the following:

I am closing this issue because:

  • we don't plan to introduce a new exact_match query
  • with a new multi-fields info in the fields caps it will be easier to forward queries/aggs to the corresponding fields; this makes a proposal for a joint text_keyword field less attractive. We may later re-consider text_keyword field type if it brings savings in space.

@jimczi
Copy link
Contributor

jimczi commented Dec 7, 2020

I am reopening this issue since we think that the feature request is still valid and could be beneficial for some use cases.
The dynamic mapping that creates two fields (one text and one sub-field named keyword) can be confusing for users so we'd prefer to have a single field that knows how to behave depending on the context. That would be simpler than exposing multi-field informations in field_caps since Kibana or SQL for instance wouldn't need to implement any logic.
They would just pass the new field and Elasticsearch would apply different extraction logic based on the context (exact match query, full text query, aggregations, ...).

@javanna
Copy link
Member

javanna commented Jun 30, 2022

We've re-discussed this with the team and concluded that this is something that we should put on our radar.

While multi-fields are powerful and useful for a number of usecases, we would not use the text/keywrod multi-field in our default dynamic mappings for strings if we could go back: it forces consumers to decide which of the two field variants to access, eventually pushing the complexity to users for something that could be done transparently and only causes confusion for users.

@ruflin
Copy link
Member

ruflin commented Aug 18, 2022

I'm coming to this discussion from making it easy to enable ECS in data streams in Elasticsearch without having to add all the fields, basically Elasticsearch knowing about ECS or set the right mappings by default. Because of this I looked at all the ECS fields and what I stumbled over are the kind of awkward keyword / text multi fields. Lets take organization.name as an example. Many user are happy having this as just a keyword, some might want to have it as text too. A user querying on it, should not have to figure out to query on organization.name.text instead of organization.name.

Thinking about this in the context of ECS is that orgnization.name is a keyword. Users can decide for any field to use keyword_text not only the ones that ECS defines and the queries stay the same.

Two things I wonder, how would this apply to synthetic source and runtime fields. @nik9000

I on purpose wrote keyword_text as keyword is our default for all the observability data and some are also text but that is probably a nit pick.

@javanna
Copy link
Member

javanna commented Aug 18, 2022

Synthetic source already loads from the keyword sub-field if available when it encounters a text field. Also support for fields that are stored separately is coming.

In the context of runtime fields, we have seen users stumble upon the need for picking the right sub-field in scripts. They commonly forget the .keyword suffix and they get an error when doing so. Here too, we could make it easier for them. We are making it easier with #81246 by loading from _source automatically for any text field, but I feel like this issue is the next step we should take in that direction.

Having a unified keyword_text field would abstract the fact that are two lucene fields and allow Elasticsearch to automatically pick the right variant depending on the usage. Users would use a single name, and the order or name of the sub-fields no longer matters. We could use such field type in our Elasticsearch dynamic mappings (where we use text.keyword) as well as in ECS (where we use keyword.text).

@nik9000
Copy link
Member

nik9000 commented Aug 18, 2022

Synthetic source already loads from the keyword sub-field if available when it encounters a text field. Also support for fields that are stored separately is coming.

#87480 covers stored text and keyword fields.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests