Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use ECS conventions for dynamic mappings #85692

Closed
jpountz opened this issue Apr 5, 2022 · 20 comments
Closed

Optionally use ECS conventions for dynamic mappings #85692

jpountz opened this issue Apr 5, 2022 · 20 comments
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team

Comments

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2022

Description

The Elastic Common Schema (ECS) has naming and mapping conventions for an increasing set of fields. You might have already seen data sets that included fields called @timestamp, host.name or http.response.status_code, all these fields are standardized in ECS.

Having normalized field names and mappings is in the best interest of users, it makes it easier to correlate events and metrics that come from different data sources. By the way, our integrations rely heavily on ECS.

One frustrating point is that even though some end users would leverage the ECS logging library for logging, which ensures that ECS field names get used (@timestamp for the time field, host.name for the name of the host, etc.), Elasticsearch would not always honor the mappings that ECS suggests for these fields, because Elasticsearch simply doesn't know about ECS. The only way users can work around this problem is by creating an index template themselves, that includes ECS mappings for the fields that they are using. Note that it's not desirable to create an index template that includes dynamic template for every possible ECS field as there are now several thousands of fields that are standardized in ECS.

Could we instead package ECS within Elasticsearch and introduce a new option for dynamic mappings to prefer ECS conventions for mappings when they exist? This would simplify significantly ingestion of custom sources of data that follow ECS conventions for field names, such as datasets produced by ECS logging.

For reference, it would also likely help simplify some of our integrations. Some sources of data have optional fields that might differ depending on vendors and other factors. Currently, the practice consists of creating index templates that include field mappings for every possible optional field, which results in pretty large templates such as the one for the Netflow integration, where lots of fields end up never being populated. If Elasticsearch had ECS built-in, these integrations could simply not map these optional vendor-specific fields and rely on Elasticsearch to map them automatically by following ECS conventions.

@jpountz jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types team-discuss labels Apr 5, 2022
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 5, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@zez3
Copy link

zez3 commented Jun 20, 2022

This will simplify a lot the Integration development not hawing to remember and correlate which type belong to which field.

@javanna
Copy link
Member

javanna commented Jul 4, 2022

We discussed this with the team, and later I had a chat with @ruflin about it. Will add some more info so we can discuss it further with the team.

ECS defines a set of around 100 basic fields that are shared across different datasets, but the total number of fields it defines is over 1000. ECS differs from the Elasticsearch default dynamic mappings as follows:

  1. different ip fields that can't be automatically detected, hence need to be mapped manually (Detect ip addresses as part of dynamic field mappings #64400)
  2. string fields that are mapped as keyword only instead of the text/keyword multi-field
  3. string fields that are mapped with a keyword/text multi-field instead of the default text/keyword multi-field (Formalize dual text/keyword mappings #53181 would help?)

The ip fields issue would be solved by adding auto-detection for ip fields (#64400), and the keyword/text multi-field by formalizing the default mappings for string fields (#53181), but while the existing dynamic mappings are type based (driven by the json type of each field being parsed), ECS is name based and there aren't necessarily common naming patterns between different fields of the same type. As a result, such fields need to be manually mapped. Dynamic templates can be used so that only fields that effectively appear in documents are mapped, but the dynamic templates for all those 1000+ fields take space in the mappings anyways, and for each field that effectively gets mapped, its mapping would then be effectively duplicated under the properties section.

Ideally, fields that don't appear in any document would not take any space in the mappings, and fields that do would be mapped, without appearing both under dynamic templates as well as under properties.

When it comes to maintaining these mappings, the 100 basic fields are well defined and never change. The remaining fields that are less commonly used may change, but mostly what happens is that additional fields are added.

@javanna
Copy link
Member

javanna commented Jul 7, 2022

Adding some thoughts from another round of discussion with the team.

We are open to adding auto-detection for ip addresses, it would work similarly to how date auto-detection works. This is not a particularly popular feature request (surprisingly?). If we introduced it we may not want to enable it by default for backwards compatibility concerns, but then few users would benefit from it if it's opt-in, which is not ideal either. It was mentioned that ip auto-detection would solve a big part of the problems that triggered the creation of this very issue, but we could not exactly understand why: ip fields need to be mapped manually, the disadvantage is that incoming documents will hold only one or two of those while ECS mappings need to define all the possible ip fields to ensure that if they get indexed they get the right type. How many are there in total, our of curiosity? Would it be possible to define them in a dynamic template? Is there some common naming convention for ip fields?

It was unclear how much addressing two of the three issues above (default mapping of text fields and ip auto detection) would help. It feels like the size of these mappings is the main factor for integrating the ECS conventions into Elasticsearch over using for instance dynamic templates to define their mappings, as a lot of these fields need to be mapped one-by-one. When it comes to cluster state size there are two aspects:

  1. the same mappings (including dynamic templates which are anyways part of the index mappings) are duplicated for each index in a data stream, which affects the cluster state size. This looks like a pain-point around how index templates are applied, and how dynamic templates/mappings are stored, which other users need to deal with too, and we wonder if we should find a more general solution that all users can benefit from, rather than addressing it for ECS specifically

  2. a lot of fields need to be mapped one-by-one, even when they won't ever appear in any document. This contributes to the cluster state size, but mostly because mappings are duplicated for each index, and maybe the main problem with this is that fields are materialized in the mappings although they will never appear in any document, which causes noise. Using dynamic templates to define mappings would address the concern of having to map fields unnecessarily. On the other hand the size of the dynamic templates section would still be considerable especially if fields are still mapped one-by-one. Would it be possible to define one dynamic template per data type, either leveraging naming conventions or match/unmatch combined with regex matching? Potentially we could allow a single dynamic template to match a list of fields to make this easier, so that the same field definitions don't need to be repeated over and over.

To clarify, cluster state size is a concern, which both dynamic templates and mapped fields contribute to. Though the biggest issue we currently have is the memory footprint of mappings, which only mapped fields contribute to (see #86440). A reasonable approach would be to try and move as many as possible of the fields definition to the dynamic templates section, while working out a way to keep the size of dynamic templates contained.

I wonder if the exercise of moving the current ECS mappings to using dynamic templates is a valid path forward regardless of the outcome of the discussion around integration ECS mappings within Elasticsearch. I think that if we decided to adopt ECS conventions in Elasticsearch dynamic mappings, we would very likely load these from a file holding a set of dynamic templates.

There was also some resistance around ownership in case ECS mappings become part of Elasticsearch: who is then responsible for maintaining them, adding fields etc. ?

We will continue the discussion once we receive feedback on these thoughts.

@felixbarny
Copy link
Member

This has also come up in the context of providing better default mappings for logs (LX).

We didn't come to a definite answer whether or not we'd want all ECS fields to be mapped and indexed by default.
We were leaning towards only adding a sub-set of ECS fields to the default mappings which are commonly used for filtering/slicing/dicing and either not mapping dynamic fields at all or mapping them as runtime fields. The reason is that some ECS fields are not queried often and are mostly additional metadata that you'd mostly look at when you're looking at a handful of logs. There's also a lot of gray area of fields that are queried infrequently so that the index-time and disk-space overhead are not worth the faster queries. Users can always change the mappings later on if they find that they'd like a certain field to be indexed going forward.

As dynamic runtime fields also increase the cluster state (#88265), I think it would be better to disable dynamic mapping but to apply runtime field-like semantics to unmapped fields (look them up from _source) (#81357).

@javanna
Copy link
Member

javanna commented Aug 17, 2022

I am removing the team-discuss label for now: there is ongoing work in defining the set of core fields that we would like to integrate within Elasticsearch, and once that is defined, we will look at their mappings and re-evaluate how to move forward. There is agreement on using dynamic templates to ensure that fields that never appear in documents are not mapped. It's important to reiterate what the main goal of this effort is: make it easier for users to use ECS, without them even knowing what ECS is. Relevant indices should get ECS mappings applied automatically.

@ruflin
Copy link
Member

ruflin commented Sep 6, 2022

I have opened #89743 as a Draft PR to discuss potential default mappings in detail. It is not a fully implementation but the idea is to first agree on what the mapping should be we need to solve this use case and then in a second step discuss the implementation in Elasticsearch. At first it could be just a component template that can be referenced and later on a config flag to make usage much easier.

Please have a look at the PR and especially the comments for each fields. There is quite a list of open questions / discussion points but it is much easier to have this directly in the PR itself.

@zez3
Copy link

zez3 commented Oct 11, 2022

Is this elastic/integrations#3642 related ?

@ruflin
Copy link
Member

ruflin commented Nov 30, 2022

I had some good follow up discussions with @P1llus around this topics. We can split the problem in two parts:

  1. Default mappings provided by Elasticsearch for logs-*-* etc
  2. Simple ECS mappings to be used for integrations

The way I think of it is that 1. will be a subset of 2. The following focuses mostly on 2. and belongs more into the integrations development. I'm updating the conversation here as we already touched on many of the points.

The goal is to simplify building of integrations and allow everyone to use ECS. Some of the core guiding principles we discussed:

  • A specific version of a package is always the same, independent of which version of Elasticsearch it is installed
  • Any ECS field that is added by an integration package or a user should by default be mapped correctly, be it a runtime field or indexed
  • Older versions of Elasticsearch should have newer version of ECS available
  • It is possible to overwrite ECS fields for example to set scaled_float instead of float or the description part
  • If a field is not used in a data stream, it should not show up

Proposal

Based on the above, we came up with the following proposal. This is a high level proposal, details would have to be worked out.

Installation of ECS fields

An integration package installed on Elasticsearch 8.1 can require ECS 8.3 This means bundling ECS component templates with Elasticsearch is not a viable solution. Instead Fleet should install a package which contains the component templates for the different ECS versions. The package contains all previous ECS versions means 8.0, 8.1 etc. of ECS are available in parallel.

These component templates are also available to any user that wants to use ECS.

Packages require ECS component template with version

A developer of an integration package can specify which ECS version should be used in the package. When the package is installed by Fleet, the correct ECS component template is referenced to be used by the package. Depending on how many component templates there are per ECS version, the reference might be slightly different.

Dynaming mappins for ECS fields

ECS has grown over the last years a lot. The challenge is that we do not want to map all these fields by default as it would create a pretty large component template. Instead as ECS follows conventions, most fields can be mapped with just a few conventions. @P1llus worked on such a dynamic template for all the current ECS fields which can be found here. It is not split up into Core or Extended. My assumption is removing extended would shorten the file by ~1/3 but it is not clear if it is worth doing this. For testing this dynamic template, @P1llus also has created an example doc with all ECS fields.

Having this dynamic templates for ECS available would mean, integrations developer can stop adding ECS fields but only reference a version. ECS fields could not be forgotten anymore by accident. If users add their own fields, these would also be correctly mapped to ECS. It goes further, if a user creates foo.name where name is a common pattern in ECS, it would also be matched correctly to ECS conventions.

All fields are still indexed

Taking the above approach means all ECS fields are indexed by default. In #89743 we are discussing to partially move away from this. But the above proposal is only for the integrations and replaces the complexity we have today around ECS fields, suddenly removing indexing would be a breaking change.

The above approach also allows us to do improvements like offer ecs-runtime-8.1 component template for an integration to be referenced. Moving forward on the above would solve many issues around 2 (see beginning of the issue) but will not address 1 yet. The discussion on this one is still open.

@javanna
Copy link
Member

javanna commented Dec 2, 2022

heya @ruflin thanks a lot for the update. I looked at the linked mappings and left a couple of comments, very much inline with yours.

An integration package installed on Elasticsearch 8.1 can require ECS 8.3 This means bundling ECS component templates with Elasticsearch is not a viable solution. Instead Fleet should install a package which contains the component templates for the different ECS versions. The package contains all previous ECS versions means 8.0, 8.1 etc. of ECS are available in parallel.

I am trying to figure out what this means in practice. In order to have ECS mappings integrated into Elasticsearch, wouldn't it be a requirement that they are not installed by an external component but rather managed internally by Elasticsearch?

@ruflin
Copy link
Member

ruflin commented Dec 2, 2022

I am trying to figure out what this means in practice. In order to have ECS mappings integrated into Elasticsearch, wouldn't it be a requirement that they are not installed by an external component but rather managed internally by Elasticsearch?

Not necessarily, but it should be a service that you trust and is always running besides Elasticsearch. In our case this is Kibana. The way it would work: Elasticsarch & Kibana 8.1 are running. A new version of the ECS package is published. Kibana / Fleet detects it and in the background installs the new ECS component templates which do not exist yet.

@javanna
Copy link
Member

javanna commented Dec 2, 2022

Thanks for expanding. If ECS mappings though are ordinary component templates that are provided externally, what is the plan for a tighter integration? We were initially thinking of something like dynamic: ecs at the index level, have we moved away from that idea?

@ruflin
Copy link
Member

ruflin commented Dec 2, 2022

I think that is where the separation between 1 and 2 in #85692 (comment) is. I still would like to get to a point where users do not even have to use component templates but just have to enable it with a config setting but to me it seems going with 2 is a low hanging fruit and 1 is the long term goal. I could also see that dynamic: ecs-8.3 is behind the scene just a component template and it relies on Fleet having it loaded?

@P1llus
Copy link
Member

P1llus commented Apr 21, 2023

Adding some comments here after speaking with @felixbarny.

If the plan is to add the dynamic ECS template to the global logs-* index template, there is a few notes that might be good to be aware of:

  • ECS Dynamic template should have its own repo for now, with a release per ECS version.
  • Elasticsearch should bundle all available release versions of the dynamic template as component templates upon installation of elasticsearch.
  • The logs-* default dynamic template, should be linked to the ECS dynamic template that is the newest release.
  • Input packages UI should have a dropdown that chooses ECS version, default to newest, and a link to that component template should be made.
  • Integrations today either do not have dynamic ECS templates, or they have their own. Either way, as long as any field mappings are defined, it will override and never use the dynamic ECS template, which is unfortunate. We should find a way with on how integrations can "opt-in" to use the bundled component templates maybe?
  • Elastic-package system tests, and CI tests require the dynamic ECS template for testing purposes, once we bundle it with ES, we need to reach out to the ecosystem team so we can remove the local copy from elastic-package, and ask it to fetch one from the remote repo.

Only my 2 cents, there might be other views from other people around this subject as well of course.

@felixbarny
Copy link
Member

Elasticsearch should bundle all available release versions of the dynamic template as component templates upon installation of elasticsearch.
We should find a way with on how integrations can "opt-in" to use the bundled component templates maybe?

I discussed this with @ruflin today. He brought up the idea of creating an ECS integration package which contains the component templates for all ECS versions and it's ensured that this is always up-to-date and pre-installed.

The benefit compared to bundling the different ECS versions in Elasticsearch is that this ECS integration can be installed in older ES versions. Therefore, an integration can reference an ECS template that's newer than the stack version it gets installed on.

ES would then only bundle the current version of ECS, to be used for the default index templates (see also #95538).

@MakoWish
Copy link

MakoWish commented Apr 28, 2023

Yes! I would like to see this! Anything and everything (Beats, Integrations, SIEM detection indices...) should share a common set of Component Templates. This would 100% eliminate field conflicts. I thought that was the entire idea behind shared Component Templates, so I was a bit confused when I saw that Integrations are not using them.

@ruflin
Copy link
Member

ruflin commented Apr 12, 2024

In #96171 we added the ecs@mappings component template with dynamic mappings for all of ECS. I assume this should complete this issue?

@javanna
Copy link
Member

javanna commented Apr 12, 2024

I assume this should complete this issue?

Sounds good to me, and great news! I just want to confirm that the original idea of introducing a new ECS dynamic mode is no longer a goal here, and we are good with the current approach that is based on component templates for the time being.

@ruflin
Copy link
Member

ruflin commented Apr 12, 2024

With the logs-- templates in Elasticsearch we managed to roll this out to all users that are using the data stream naming scheme. It would be convenient for others to just turn it on with a flag but would not optimise for it as I rather help users to migrate to the data stream naming scheme as with it they also get all the other benefits. We should have a discussion in the context of logsdb if there is an option to have these templates in there automatically (also outside logs--).

@javanna
Copy link
Member

javanna commented Apr 22, 2024

Ok thanks for the feedback. I am going to close this, we can always re-discuss the possibility of a boolean flag in the future, for now that's not something we are going to focus on.

@javanna javanna closed this as completed Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

8 participants