Allow customizing managed data streams at different levels of granularity #97664

felixbarny · 2023-07-13T16:40:43Z

What are we trying to achieve?

On several occasions, we've been discussing to add ways to enable users to customize data streams that are set up via Fleet and via the built-in index templates, without having to create a copy of the index template and taking the onus to maintain the whole index template going forward. Instead, we'd want to offer dedicated extension points for users so that they can configure different settings/mappings/lifecycles at different levels of the data stream naming scheme:

All data streams (*-*-*)
All data streams with a certain type ({type}-*-*)
All data streams with a certain type and dataset ({type}-{dataset}-*)
All data streams with a certain type, dataset, and namespace ({type}-{dataset}-{namespace})
All data streams with a certain type and namespace ({type}-*-{namespace})
All data streams with a certain namespace (*-*-{namespace})

Some concrete use cases:

A user wants to send the observability signals of their tier 1 applications to a separate namespace to keep the data in the hot tier for longer and to have a longer retention
Setting the default retention for logs to 30 days and for metrics to 90 days
Enable synthetic _source for the logs-foo-* data stream that is using the logs-*-* index template, without having to create a copy of the index template with a logs-foo-* index pattern.

Why this should be in Elasticsearch

The previous discussions (elastic/kibana#149484, elastic/kibana#121118) have mostly been focussed on Fleet. But I have a strong preference for not putting this into Fleet but into Elasticsearch so that data streams that are not managed by Fleet (such as the data streams for the built-in index templates logs-*-* and metrics-*-*) can benefit from that as well.

Why is this important

This gets more important in the context of the reroute processor as documents can be routed to data streams that aren't managed by or known to Fleet. Also, we're considering to move APM index templates out of Fleet and into Elasticsearch (see #97546).

A potential solution

I've proposed one potential solution to this here: elastic/kibana#121118 (comment)

Essentially, we'd add a couple of component templates into the index templates that are managed by Fleet and Elasticsearch. For example, the composed_of section of the logs-*-* index template that is built into Elasticsearch would be extended by component templates that have a placeholder in them (exact naming tbd).

  composed_of:
   - logs@custom
   - logs-*-{{data_stream.namespace}}@custom
   - logs-{{data_stream.dataset}}@custom
   - logs-{{data_stream.dataset}}-{{data_stream.namespace}}@custom

Valid placeholders are any constant_keyword fields.

If a user wants to customize a concrete data stream logs-foo-bar, they can create the following component templates:

logs@custom
logs-*-bar@custom
logs-foo@custom
logs-foo-bar@custom

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-07-13T16:41:09Z

Pinging @elastic/es-data-management (Team:Data Management)

BBQigniter · 2023-07-14T12:04:00Z

imho this also relates to #91370

joshdover · 2023-07-14T15:02:31Z

Big +1 on solving this with an ability to reference a "templated" component template name. I have one suggestion on the solution which may make it a bit simpler.

I think there will be a slight issue with trying to use constant_keyword fields as candidates for replacement variables as those fields themselves may be defined in component templates (and in Fleet's case, are). It's a chicken-and-egg problem of having to lookup the component templates to know which other component templates match. I guess this could be solved by looking up the component templates without variables first, and then matching the rest, but I think it's more complicated than necessary and confusing from a user perspective.

Instead, I'd suggest instead we have some ability to name the wildcards in the main template's index pattern (similar to a named regexp capture group) and then reference those as variables in the composed_of array, like this:

index_patterns:
  - logs-(*:dataset)-(*:namespace)
composed_of:
  - logs@custom
  - logs-*-{{namespace}}@custom
  - logs-{{dataset}}@custom
  - logs-{{dataset}}-{{namespace}}@custom

I think this is simpler, more obvious, and less tied to any specific convention. It also has the nice side benefit that it constrains the possibilities to only strings that appear in the actual name of the index/data stream, rather than fields in the document that may not be part of the index name.

joshdover · 2023-07-14T15:05:47Z

@BBQigniter do you think this will fully solve the problems described in #91370 or is there more we need to accommodate?

BBQigniter · 2023-07-17T06:40:35Z

@joshdover not completely sure but your proposal looks good for me :)

felixbarny · 2023-07-24T09:03:20Z

Instead, I'd suggest instead we have some ability to name the wildcards in the main template's index pattern (similar to a named regexp capture group) and then reference those as variables in the composed_of array

Seems like a much better and simpler idea compared to relying on constant_keyword fields! Love it!

From what I can tell, these are some aspects of #91370 that this proposal wouldn't tackle:

Having separate component templates for different ECS namespaces. However, we're moving towards using dynamic template-based approach to mapping ECS fields: [Logs+] Adding ECS dynamic templates #96171, Dynamic ECS mapping progress integrations#5055
Customize data streams at the integration granularity. While this proposal allows to customize at the data stream level, if an integration contains multiple data streams, you can't easily apply configurations to all data streams of an integration. I suppose this can be achieved by Fleet automatically adding a custom component template for an integration.

joshdover · 2023-07-24T09:26:52Z

Customize data streams at the integration granularity. While this proposal allows to customize at the data stream level, if an integration contains multiple data streams, you can't easily apply configurations to all data streams of an integration. I suppose this can be achieved by Fleet automatically adding a custom component template for an integration.

Fleet adding an explicit component template would work.

Another option would be to make the dotted part of the dataset part of the pattern, so you could something like this (not sure I like the names I used, but you get the idea):

index_patterns:
  - logs-(*:dataset_prefix).(*:dataset_suffix)-(*:namespace)
composed_of:
  - logs@custom
  - logs-*-{{namespace}}@custom
  - logs-{{dataset_prefix}}@custom
  - logs-{{dataset_prefix}}.{{dataset_suffix}}@custom
  - logs-{{dataset_prefix}}.{{dataset_suffix}}-{{namespace}}@custom

I've wondered if the two-parted dataset should be part of the DSNS convention or not - we use this pattern fairly consistently, though not everywhere.

felixbarny · 2023-07-24T10:02:01Z

As not all datasets have a prefix and suffix separated by a dot, the logs-*.*-* index pattern wouldn't match all data streams.

But either way, it seems the placeholders in component template references could also be used to add extension points to all data streams of an integration.

ruflin · 2023-07-31T11:07:20Z

++ on moving forward with the placeholder approach. It will not solve all problems but I think it will solve quite a few.

@dakrone Would be great to get your feedback on this.

dakrone · 2023-08-03T17:59:25Z

Thanks for bringing this up Felix, and others for the discussion so far. We met today as a team to discuss this. We have a couple of reservations and some thoughts I'll try to share.

First, the proposed solution of having placeholders where wildcards are essentially "captured" (the logs-(*:dataset_prefix).(*:dataset_suffix)-(*:namespace) suggestion), I don't think this is going to be a good solution. We rely on knowing exactly how templates are composed in order to be able to validate changes to both index and component templates when they're added/updated/removed. If we went with the pattern capturing solution it would mean that we could no longer validate the templates, because we wouldn't know what the composition is going to be until index or data stream creation time.

Second, the other option that I see currently would be for us to use a naming scheme for customizing component templates, for example, we'd change all of our logs integrations and built-in templates to reference the logs@custom component template, so that any user-customization can be done there. We'd do this at varying levels of granularity, so we'd end up with a logs-nginx@custom one for the Nginx integration, and so on. This would require Fleet to specify the correct names of the customization when installing an integration. This would be better than the previous solution, since we would still be able to validate both component and index templates. ...@custom component templates would not have to exist because we can specify in the index template to skip failing before they're created.

The challenging part of the second solution is that we run into a composition problem when it comes to a change that a user wants to make with respect to a particular attribute of a data stream. For example, imagine a user that wants to make a change to the "global" data stream configuration, to set a project-level retention to all *-*-* data. Ideally this would mean they would add a {"data_retention": "30d"} configuration to a global@custom component template. But what happens when this global component template is composed into an index template that does not specify "data_stream": {} in its configuration? Or an index template that is managed but that disables data stream lifecycles? We could either be lenient and allow the composition, ignoring it (which is unfortunate because it introduces leniency), or disallow it and force a user to reckon with varying configuration parameters at different levels of granularity that may or may not be allowed. The example is for retention, but it can be extrapolated to any template configuration such as index settings, mappings, or aliases. Depending on how strict and exactly how we want to use the ...@custom component templates, we make minimize or increase the chance of this risk.

I don't think the placeholder meets the needs we have without introducing unacceptable leniency. The second is more workable but has some pieces and use-cases that we'd need to work through to make sure that we don't end up with a rigid or brittle system. What do you think?

felixbarny · 2023-08-03T18:54:07Z

If we went with the pattern capturing solution it would mean that we could no longer validate the templates, because we wouldn't know what the composition is going to be until index or data stream creation time.

If all individual component templates are valid themselves, in what situation can the composition be invalid?

This would require Fleet to specify the correct names of the customization when installing an integration.

This sounds similar to elastic/kibana#121118. We've closed this issue because we'd like a solution that doesn't rely on Fleet to set up the data streams in the right way so that we can have the same extension points for the index templates that ship with Elasticsearch, such as logs-*-*. Even if relying on Fleet for this, there wouldn't be a way to allow customization at the namespace level, without having to create an index template for each namespace (which may not be known upfront but dynamically determined by a reroute processor).

dakrone · 2023-08-03T19:50:36Z

If all individual component templates are valid themselves, in what situation can the composition be invalid?

It's not just component templates that must be valid, but also their use by the index template. For (a contrived) example, this is valid and allows an index to be created:

PUT /_component_template/one
{
  "template": {
    "mappings": {
      "properties": {
        "field": {
          "type": "text"
        }
      }
    }
  }
}

PUT /_index_template/it
{
  "index_patterns": ["foo"],
  "data_stream": {},
  "composed_of": ["one"],
  "template": {
    "mappings": {
      "properties": {
        "alias-field": {
          "type": "alias",
          "path": "field"
        }
      }
    }
  }
}

But if you tried to change the name of the field, you get an error:

PUT /_component_template/one
{
  "template": {
    "mappings": {
      "properties": {
        "other-field": {
          "type": "text"
        }
      }
    }
  }
}
// Returns:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "updating component template [one] results in invalid composable template [it] after templates are merged"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "updating component template [one] results in invalid composable template [it] after templates are merged",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "composable template [it] template after composition with component templates [one] is invalid",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "invalid composite mappings for [it]",
        "caused_by" : {
          "type" : "mapper_parsing_exception",
          "reason" : "Invalid [path] value [field] for field alias [alias-field]: an alias must refer to an existing field in the mappings."
        }
      }
    }
  },
  "status" : 400
}

This is just one contrived example.

joshdover · 2023-08-04T10:29:08Z

Even if relying on Fleet for this, there wouldn't be a way to allow customization at the namespace level, without having to create an index template for each namespace (which may not be known upfront but dynamically determined by a reroute processor).

I think this is the biggest downside - these potential index templates needed are not known at integration installation time. They may only exist later.

Now you could argue that the user won't really need to make any namespace-specific customizations until there is a known namespace they want to customize, so creating a new index template is a viable option. But now the user needs to either (1) manually copy the index template and keep it up-to-date with changes to the integration; or (2) use Fleet/Integration APIs in Kibana to add customizations to handle this for them, which is a confusing experience to have to switch between ES and Kibana APIs for template management.

A similar alternative that would not have this downside is to add data stream naming scheme template management APIs to Elasticsearch directly so that users could more easily manage this directly from ES. IMO this might be the best middle ground, but I'd like to hear from @felixbarny on whether or not this fully solves the problem.

Another idea is to solve the validation problem at indexing time instead of template creation, with a fallback to a "failure data stream" - the idea we discussed at EAH for documents that fail to be processed or indexed. This case feels pretty similar and could make use of the same mechanism. That said, I believe that's a fairly large enhancement that we have not begun work on and it would be unfortunate to block on this.

felixbarny · 2023-08-04T13:58:43Z

A similar alternative that would not have this downside is to add data stream naming scheme template management APIs to Elasticsearch directly so that users could more easily manage this directly from ES. IMO this might be the best middle ground, but I'd like to hear from @felixbarny on whether or not this fully solves the problem.

Could you elaborate on how that would work?

One potential issue with that may be how the precedence of these custom component templates is defined. How are they ordered among themselves, and how are they ordered with the component templates that already exist on the data stream?

joshdover · 2023-08-08T09:37:20Z

I'm thinking a higher level API for managing templates that are part of the data stream naming scheme, like we've brainstormed in the past. This would solve the problem of being able to direct users to use a single API surface for template management (Elasticsearch) and having Elasticsearch manage the namespace-specific settings.

I think these APIs would need to support all of the granularity levels at the main issue description, in addition to global defaults. Under the hood it would need to dynamically create and update the required index and component templates, validating them all before committing the change.

This API would probably also need to distinguish between user-customized settings and package-managed ones. The package API would be restricted to Kibana's system user only to keep end users to use the @custom templates. I'd recommend a generic form like:

PUT /_data_stream_template/{type}-{dataset}-{namespace}/(@package|@custom)
{
  "settings": { },
  "mappings": { },
  "data_retention": { },
}

For a basic case like setting a type-wide default, no new index templates need to be created, only updating the logs@custom component template which is referenced in all index templates:

PUT /_data_stream_template/logs-*-*/@custom
{
  "lifecycle": {
    "data_retention": "7d",
  }
}

// Under the the hood ES does
PUT /_component_template/logs@custom
{
  "template": {
    "lifecycle": {
      "data_retention": "7d",
    }
  }
}

Namespace-specific customizations require more work under the hood to create index templates with higher priority if needed. In this example, a new index template for every data stream managed by the system would need to have namespace-specific template created with higher priority, referencing a namespace-specific component template:

PUT /_data_stream_template/logs-*-foo/@custom
{
  "lifecycle": {
    "data_retention": "7d",
  }
}

// Under the the hood ES does something sort of like this (for each logs dataset):
PUT /_component_template/logs-*-foo@custom
{
  "template": {
    "lifecycle": {
      "data_retention": "7d",
    }
  }
}

PUT /_index_template/logs-my.dataset-foo
{
  "index_patterns": ["logs-my.dataset-foo"],
  "data_stream": { },
  "priority": 250, // higher than whatever logs-my.dataset template is
  "composed_of": [
    "logs@global",
    "logs@custom",
    "logs-my.dataset@package",
    "logs-my.dataset@custom",
    "logs-*-foo@custom",
    "logs-my.dataset-foo@custom"
  ],
  "allow_missing": [
    "logs@custom",
    "logs-my.dataset@custom",
    "logs-*-foo@custom",
    "logs-my.dataset-foo@custom",
  ]
}

This has an added benefit of having Elasticsearch be the source of truth for how these customization layers are added on top of one another, instead of spreading that out across Fleet and Elasticsearch's default templates.

felixbarny · 2023-08-11T14:57:31Z

One potential challenge I see is what happens when you create a new data stream after adding a namespace customization.
Example:

PUT _data_stream/logs-ds1-foo
PUT /_data_stream_template/logs-*-foo/@custom
PUT _data_stream/logs-ds2-foo

How do we ensure that ds2 also gets the customizations from step 2?
Also, the implementation to create copy of the index template with a higher priority seems problematic: When making changes to the original index template, those changes won’t be reflected in the copy.

Dataset customizations (such as logs-foo-*) aren't trivial either, as a data stream, such as logs-foo-default may be created by the built-in logs-*-* index template that doesn't import a component template for logs-foo@custom, so we'd also need to create a copy of that index template.

But maybe it's fine to rely on copying the index templates? On the pro side, it makes existing data streams more immune to breaking changes caused by modifications in the global templates. However, they also don't benefit from improvements in these templates. Maybe that's the right tradeoff if it allows us to statically verify that the merged index templates are valid.

joshdover · 2023-08-29T15:39:20Z

We had a brief brainstorming session on this today and discussed these requirements & constraints:

Users can make modifications on a type, dataset, namespace levels, or a combination of 2 or 3
Users need to be able to define customizations without forking index templates from integrations or included in Elasticsearch
Customizations are applied automatically, in a declarative way without needing to update every index outside the customization update
Precedence between combinations will be strictly defined by the system and not user-definable
These customizations will always take precedence over index templates
Dependent settings across separate customizations are not supported, they must be contained in the same customization
- Example: a field alias can’t depend on a field defined in a different customization or the index template
We need to validate as much as possible when the customization is created

Next step is for @tylerperk to flesh these requirements out more and we'll then meet again for another brainstorming session on potential solutions.

Use `<data_stream.type>@custom` instead of `apm@custom`. This is an enhancement over what Fleet sets up; it is an additive improvement in the direction of elastic#97664. The rollup data streams' `@custom` component templates now include the duration, like what Fleet sets up. Add a YAML REST test, and a unit test ensuring consistency across the index templates.

Use `<data_stream.type>@custom` instead of `apm@custom`. This is an enhancement over what Fleet sets up; it is an additive improvement in the direction of #97664. The rollup data streams' `@custom` component templates now include the duration, like what Fleet sets up. Add a YAML REST test, and a unit test ensuring consistency across the index templates.

felixbarny added the :Data Management/Data streams Data streams and their lifecycles label Jul 13, 2023

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 13, 2023

This was referenced Jul 13, 2023

[Fleet] Add namespace-specific index and component templates elastic/kibana#121118

Closed

[Fleet] Add support for customizing integration data streams at more levels of granularity elastic/kibana#149484

Open

dakrone added the team-discuss label Aug 3, 2023

dakrone removed the team-discuss label Aug 3, 2023

felixbarny mentioned this issue Aug 14, 2023

add enhance logs and extract timestamp docs elastic/observability-docs#3118

Merged

leandrojmp mentioned this issue Oct 2, 2023

Allow to easily add custom pipelines and templates per integration, currently it is done per dataset. elastic/kibana#146792

Closed

felixbarny mentioned this issue Oct 17, 2023

[Fleet] Enhance integration pipelines by adding a additional @custom pipelines for global, type, and integration processors elastic/kibana#168019

Closed

axw mentioned this issue Jan 10, 2024

x-pack/plugin/apm-data: fix @custom component templates #104182

Merged

nchaulet mentioned this issue Jan 23, 2024

[Fleet] Potential breaking change with APM data streams (maybe others) and Fleet ingest pipeline customization hooks elastic/kibana#175254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow customizing managed data streams at different levels of granularity #97664

Allow customizing managed data streams at different levels of granularity #97664

felixbarny commented Jul 13, 2023

elasticsearchmachine commented Jul 13, 2023

BBQigniter commented Jul 14, 2023

joshdover commented Jul 14, 2023

joshdover commented Jul 14, 2023

BBQigniter commented Jul 17, 2023

felixbarny commented Jul 24, 2023

joshdover commented Jul 24, 2023

felixbarny commented Jul 24, 2023

ruflin commented Jul 31, 2023

dakrone commented Aug 3, 2023

felixbarny commented Aug 3, 2023

dakrone commented Aug 3, 2023

joshdover commented Aug 4, 2023

felixbarny commented Aug 4, 2023

joshdover commented Aug 8, 2023 •

edited

Loading

felixbarny commented Aug 11, 2023

joshdover commented Aug 29, 2023

Allow customizing managed data streams at different levels of granularity #97664

Allow customizing managed data streams at different levels of granularity #97664

Comments

felixbarny commented Jul 13, 2023

What are we trying to achieve?

Why this should be in Elasticsearch

Why is this important

A potential solution

elasticsearchmachine commented Jul 13, 2023

BBQigniter commented Jul 14, 2023

joshdover commented Jul 14, 2023

joshdover commented Jul 14, 2023

BBQigniter commented Jul 17, 2023

felixbarny commented Jul 24, 2023

joshdover commented Jul 24, 2023

felixbarny commented Jul 24, 2023

ruflin commented Jul 31, 2023

dakrone commented Aug 3, 2023

felixbarny commented Aug 3, 2023

dakrone commented Aug 3, 2023

joshdover commented Aug 4, 2023

felixbarny commented Aug 4, 2023

joshdover commented Aug 8, 2023 • edited Loading

felixbarny commented Aug 11, 2023

joshdover commented Aug 29, 2023

joshdover commented Aug 8, 2023 •

edited

Loading