[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

joshdover · 2022-05-24T15:10:00Z

We have three new indexing features in Elasticsearch that can reduce the overall storage size of data significantly:

We'd like to enable integration developers to start testing the ingest and query performance of enabling these features before we start making any changes in the integrations themselves or allowing end users to enable these from the Fleet UI.

Today, each of these can already be enabled by leveraging the *@custom component templates that Fleet installs for each integration data stream, to varying degrees of ease of use (details below). We could improve the UX around this for integration developers by adding an explicit API in Fleet to enable this, however it may not be necessary.

How to do this today

See https://github.com/elastic/integrations/blob/main/docs/how_to_test_new_indexing_features.md

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-05-24T15:10:02Z

Pinging @elastic/fleet (Team:Fleet)

ruflin · 2022-05-25T08:00:56Z

Should we be able to enable this on a package level or data stream level?

joshdover · 2022-06-01T12:51:44Z

I'd prefer to keep this as simple as possible, and only do it on the package level if we don't need to be able to do it on a data stream level.

joshdover · 2022-07-08T13:28:45Z

@kpollich @jen-huang We could enable the ecosystem team to add each of these toggles if we first provided a basic framework for adding a package-level setting that is used at install time. I think that framework need to include:

Storing settings on the epm-packages object
Exposing an API field for updating the settings on a package (likely on the package install endpoint)
Exposing the settings to the various installation paths during install
Exposing a preconfiguration field for settings (optional)

This would probably be quite low effort to provide the basic plumbing and I think the Ecosystem folks would be able to use that to make the specific changes required for each feature.

andresrc · 2022-07-22T15:35:41Z

@joshdover for the specific case of synthetic source, given that the change does not depend on specific fields, etc. Do you think it would be feasible, as a first step, to introduce this toggle (package or data-stream level, whatever we think it's better) just on the Fleet side without requiring any new setting in the packages? This would remove the need to do an additional release of every existing package.

Once we add the setting, it can be used as the default value for the toggle for that package / data-stream that we should make sure is "use synthetic source" for every new package that we create.

joshdover · 2022-07-22T15:52:17Z

@andresrc I think starting with synthetic source would make the most sense. We could use that to build out the plumbing I described in #132818 (comment) and then leverage it for other opt in features in the future.

One tricky thing about synthetic source is the limitation on keyword fields that have ignore_above configured. Synthetic source does not support this yet. This makes the feature a little unusable with the majority of our data streams. So if we were to enable support for this on the Fleet API, I think we'd want to put it on the package level but then only apply it to data streams that do not have any ignore_above keyword fields.

andresrc · 2022-07-27T08:53:13Z

(edited after #132818 (comment))

@joshdover @jsoriano I as we move forward with the testing we keep finding corner cases: the ignore_above parameter, the fact that some logs data streams seem to store event.original just in _source. There's also the possibility mentioned in different place that this might be a breaking change in some cases.

Given this, I would like to propose a gradual approach.

Phase I

Add a toggle at the data stream level with a default value of not using synthetic source. This would allow greater granularity at enabling the feature to test for potential breakages and leverage the benefits when possible. We have very big packages which mix logs and metrics data streams. Package-level granularity would not be very practical.

When the toggle is enabled:

Synthetic source will be enabled.
The ignore_above property would be removed from all fields that have it.

The toggle (or the screen containing the toggle) would also show some badge or similar warning that this is technical preview / beta feature.

With this, we could start gradually recommending the use of synthetic source in specific data streams without changes as we feel confident about them and without generating the risk of breaking changes.

Phase II

Depending on the results of the previous phase, we can consider different options:

Add a setting in the package. When it is enabled at the data stream level it will always use synthetic source with no option of disabling it. This would be used only for new data streams.
Enable synthetics by default for new integrations that are installed.
Add a package level / global level toggle that can override the individual data streams to enable synthetic source.

We will also need to consider the decisions made around the future of the ignore_above property.

Phase III

(probably in a major)

Always use synthetic source by default.

jsoriano · 2022-07-27T09:13:53Z

@andresrc thanks, the plan sounds good to me. The only change I would do is to make the phase II conditional to the results of phase I. If things go well I think that we could go directly to the first point of phase III, enabling synthetic source by default in all new policies, and avoid adding a setting that would have expiration date. Such a setting may be confusing for package developers.

andresrc · 2022-07-27T14:58:40Z

Thanks @jsoriano , edited the comment

joshdover · 2022-07-28T14:38:28Z

@andresrc

Phase I
Add the toggle at the data stream level with a default value of not using synthetic source.

Is this a UI toggle or an API toggle? If it's an API, I don't think we really need to create anything new as it seems the same as asking the users to set this on the custom component template using it's existing Elasticsearch API.

If we want UI support, we could add the toggle to component template editor UI to allow setting synthetic source there (right now you have to do it in Dev Tools). This would help us avoid needing to determine where to list these data steams and settings for them (we don't have any such UI today).

Phase II
Add a setting in the package. When it is enabled at the data stream level it will always use synthetic source with no option of disabling it. This would be used only for new data streams.

One tricky thing is that we may need to specify an ES version number requirement for the data stream since the restrictions around synthetic source are changing between releases (eg. support for ignore_above). Or would we simply bump the kibana.version constraint to be the same minimum version required for the data stream to use synthetic source?

jsoriano · 2022-07-28T15:21:59Z

using it's existing Elasticsearch API

@joshdover I think these flags should be managed by Fleet, in case we later decide to enable these features everywhere automatically, or depending on certain conditions.
It may also happen that multiple settings are needed, as in the case of TSDB or doc-value-only fields. It could be also the case for synthetic source if we decide to remove ignore_above when it is enabled.

While Fleet is aware of the setting, I don't have a strong preference for API or UI. UI would be better to make it easier to recommend its use to users or not so experienced package developers, but it'd be nice if the flags are also exposed through the Fleet API in any case.

andresrc · 2022-07-28T16:49:37Z

@joshdover I have updated the comment with some additional considerations, including what to do with ignore_above

Is this a UI toggle or an API toggle? If it's an API, I don't think we really need to create anything new as it seems the same as asking the users to set this on the custom component template using it's existing Elasticsearch API.

I would prefer to start with an UI toggle as it would be easier for final users to try it for certain data streams where they might get the benefit.

If we want UI support, we could add the toggle to component template editor UI to allow setting synthetic source there (right now you have to do it in Dev Tools). This would help us avoid needing to determine where to list these data steams and settings for them (we don't have any such UI today).

If we are doing more "magic" here (i.e. removing the ignore_above) would the component template editor UI still be the right place? But yes, the issue that we don't have a place to list the data streams for a given integrations, probably something that we would need to add to the (installed) integration page.

joshdover · 2022-08-08T10:22:45Z

If we are doing more "magic" here (i.e. removing the ignore_above) would the component template editor UI still be the right place? But yes, the issue that we don't have a place to list the data streams for a given integrations, probably something that we would need to add to the (installed) integration page.

If we need to modify existing mappings for this "magic" then this probably isn't the right place for it. Let's figure out a good place within the Integrations UI. I think having something on the "settings" tab for an installed integration could make sense.

Marking this issue as needing design, since that is the next step before we can do implementation work.

tommyers-elastic · 2022-08-10T14:33:55Z

we have discovered an issue whilst testing this on integration data streams. fleet's 'final pipeline' component template contains the event.agent_id_status field, which is a keyword with ignore_above. this is automatically applied and currently causes enabling synthetic _source to fail.

after a brief discussion it was suggested that simply removing ignore_above from this mapping is safe, since the values of the field are discreet and well known (one of missing, auth_metadata_missing, mismatch, verified). the relevant code is here.

tommyers-elastic · 2022-08-12T08:48:52Z

FYI - i have found another barrier to enabling this in certain integrations which declare fields with dynamic mappings. the keyword field type generated by the dynamic mapping processor applies ignore_above by default. https://github.com/elastic/beats/blob/ea207346d651448b8917b0791b2b117b9f9b9212/libbeat/template/processor.go#L293

I'm looking into ways around this now, but wanted to give you a heads up that currently this fails at index time.

edit:
just doing some reading around this and it looks like the following issue is closely related. at this point i'm not 100% sure how it is pieced together but i will update when i understand it better. #129344

kpollich · 2022-08-12T18:21:19Z

I spoke with @mukeshelastic a bit offline about potential avenues for implementation here.

Generally, I am in favor of an implementation that consists of the following

A Fleet managed API for setting indexing options including tsdb, doc-value-only, and synthetic source on a per data stream basis
An interface on the integration settings page that allow users to opt in to each of these indexing feature on each data stream for a given package
(Future) Package spec changes that allow package maintainers to opt in to these settings by default for given data streams

There is one major caveat that @tommyers-elastic has begun broaching above: the incompatibility between the ignore_above and synthetic source. Currently, if an index contains a mapping with ignore_above set and it's updated to opt in to synthetic source, the operation will fail as Tom has mentioned in #132818 (comment).

We have a few options to work around this limitation, mentioned in various comments above:

Implement logic in Fleet such that we can intelligently add/remove ignore_above from mappings based on the enabled/disabled state of synthetic source
Petition the Elasticsearch team to alter the mappings API such that ignore_above is silently (or with some non-fatal warning) ignored when synthetic source is enabled.

I'll attempt to detail the tradeoffs for each approach below.

1. Fleet intelligently adds/removes `ignore_above` setting

With this approach, Fleet will need to update all mappings for a given data stream such that they do not include an ignore_above setting upon an opt-in to synthetic source. Then, if synthetic source is disabled at some point in the future, Fleet will need to re-apply the ignore_above setting to those same mappings, and any mappings that have since been added to index.

This means Fleet will also need to detect the ignore_above setting in the @custom component template for a given package when opting in or out of synthetic source.

The risks here are mainly around implementing a substantial amount of logic for setting/unsetting these mapping settings across component templates for a package. There's room for error here, potential performance costs to consider, and generally a lot of moving parts.

2. Elasticsearch accepts the `ignore_above` setting, but ignores it or warns in cases where it conflicts w/ synthetic source

This option would involve petitioning the Elasticsearch team to update the mapping API such that we can enable synthetic source even if an index contains mappings with ignore_above set. This would provide a much better UX for Fleet, as we wouldn't have to take so much action to opt in and out of synthetic source for users.

This has its own set of risks, though, as a cross-team effort that would be beholden to the appropriate ES team's priorities. This is also likely a special case that breaks assumptions users will make based on how other Elasticsearch APIs behave. For instance, the data streams APIs support a subset of the index APIs' features, but Elasticsearch doesn't silently ignore unsupported field, parameters, etc from the index APIs when they're provided to data streams APIs. It responds with errors. Bucking this trend and implementing a different pattern for this special case introduces unpredictability and inconsistency into Elasticsearch APIs and how they handle "settings conflicts" like this.

I think in general, the second option is probably preferable. It would likely provide a more stable experience around specifically the synthetic source workflow we're proposing here. It seems to me that the risks around having Fleet implement "magic" to update mapping settings when synthetic source is enabled/disabled are less than those associated with API ergonomics. I'm also not an Elasticsearch engineer, though, so we should try to get this proposal in front of an appropriate team on that side of things.

@joshdover I'm sure you will have some thoughts on these two approaches and their tradeoffs, and I'm sure you can correct anything I'm misunderstanding. I'd appreciate some input from you and from anyone else who might have further thoughts here.

ruflin · 2022-08-15T06:32:45Z

Thanks for the detailed write up @kpollich and the trade offs mentioned. If I think long term, ideally having synthetic source just becomes a setting on the data stream and all the magic around it is just handled by Elasticsearch (Option 2). But this might take a bit longer.

My suggestion would be to start talking about Option 2 to the Elasticsearch team but do a basic implementation for Option one 1 Fleet on the data stream level. This would be an experimental feature or similar with limited support. For example @custom is not supported. If a users uses @custom it is up to the user to make sure it is also aligned. I'm simplifying but all that Fleet would have to do is take the fields.yml mapping + mappings in the Fleet source code, strip all ignore_above, put it into the template and trigger a rollover. My hope would be that the effort of this would be much smaller as not all the edge cases you mentioned above have to be covered. These would be known limitations.

jsoriano · 2022-08-15T08:22:51Z

I would also say to start with Option 1 at least for the initial implementation of the opt-in feature, this can help to validate the feature and the specifics can evolve over time with option 2 or other alternatives that may appear.

@nik9000 wdyt about the option 2 described in #132818 (comment) ?

kpollich · 2022-08-15T12:34:40Z

My suggestion would be to start talking about Option 2 to the Elasticsearch team but do a basic implementation for Option one 1 Fleet on the data stream level. This would be an experimental feature or similar with limited support. For example @Custom is not supported. If a users uses @Custom it is up to the user to make sure it is also aligned.

I'm +1 on this implementation strategy. Flagging the settings UI for these index settings as experimental gives us some leeway on things like this. I think we should probably capture the caveat with @custom templates in a docs writeup about the experimental index settings, as well.

nik9000 · 2022-08-15T14:03:52Z

2. Petition the Elasticsearch team to alter the mappings API such that ignore_above is silently (or with some non-fatal warning) ignored when synthetic source is enabled.

Petition accepted. Sort of. I think we'll actually be able to support ignore_above with it's original intention. I'd follow elastic/elasticsearch#87480 and, likely, a follow up PR specifically about ignore_above.

But, like, I think it wouldn't be super bad to start by removing ignore_above. I know some other folks tried that with a smaller data set and found that they had fields that were too long and caused rejected documents. That's why I had another think about supporting it properly and I think figured out how.

nik9000 · 2022-08-15T15:48:05Z

If a users uses @Custom

I feel bad for the owner of this github username. Same for whoever owns @timestamp. I've talked with @nik because we've been mixed up a few times in the past. Github's a funny place.

joshdover · 2022-11-01T09:45:02Z

Since we shipped the experimental support for synthetic source in 8.5, I think we should consider getting doc-value-only fields support in next, with a similar experimental UX. I believe the amount of effort involved in adding support it this way should be quite small and may have a very large impact (20% storage savings, 20% improved indexing perf), so I'd prefer we enable this to start being tested sooner than later.

@kpollich do you have a rough estimate of effort required to add support for this as an experimental toggle? My understanding is that when enabled, we'd need to modify the component templates to set index: false on any field types that support this feature. Should we open a dedicated issue for just this part, similar to #140095?

kpollich · 2022-11-01T11:58:02Z

@kpollich do you have a rough estimate of effort required to add support for this as an experimental toggle? My understanding is that when enabled, we'd need to modify the component templates to set index: false on any field types that support this feature. Should we open a dedicated issue for just this part, similar to #140095?

A separate issue would be great. The way we've added synthetic source as a toggle is fairly extensible, so I think a lot of the groundwork is already laid here. This is probably a one week lift to implement a toggle.

joshdover · 2022-11-01T19:11:54Z

@kpollich I opened this issue: #144357

ruflin · 2022-11-03T10:12:34Z

On the priority side, could we get the enabling of time series in first instead of doc values only part? This would allow us to better test TSDB indices.

ruflin · 2022-11-03T12:24:43Z

The feature for TSDB might even be split up into 2 parts: Support in Fleet if it is set in the package (elastic/package-spec#357). @kpollich Is this supported today? And second to enable it on demand by switching over. Note: Switching back for TSDB is not possible as far as I know.

joshdover · 2022-12-23T13:28:00Z

@kpollich @jen-huang I have updated this issue to track both the experimental toggles for testing as well as the long-term support for the real GA feature support. All tasks and bugs related to these features should be added here.

joshdover · 2022-12-23T13:29:55Z

This meta issue also needs an owner on the Fleet team. I want to make sure someone has the time to fully understand the goals of these new indexing features and how integrations should leverage them. I think some of the discrepencies in behavior that have gotten implemented (see #147684 for examples) may have been avoided with fewer people working on these items.

kpollich · 2023-02-10T16:26:19Z

Created a few issues based on offline discussions over on Google docs with @lucabelluccini

[Fleet] Find a better location in the UI for experimental indexing features #150913
[Fleet] Add has_experimental_data_stream_indexing_features: true metadata to integration index/component templates when experimental indexing features are enabled #150917

Something else that hasn't come up yet - are there licensing restrictions around these indexing features at all?

https://www.elastic.co/guide/en/elasticsearch//reference/master/tsds.html doesn't mention any licensing restrictions for TSDS.

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source mentions synthetic _source is still in preview, but no licensing restrictions.

https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html#doc-value-only-fields seems to imply doc value only fields are GA, but no licensing restrictions specified.

As far as I can tell, these features are all available with a basic license, so I don't see any issues here. I could be wrong though, though I'm not sure how to confirm.

joshdover · 2023-02-14T13:24:58Z

@giladgal where could we find this information regarding licensing for TSDS related features?

giladgal · 2023-02-28T19:41:07Z

@giladgal where could we find this information regarding licensing for TSDS related features?

Licensing is not described in the documentation. The information about licensing is in the file headers and in the subscriptions web page.

kpollich · 2023-04-21T14:11:00Z

Hi all. I'm closing this in favor of a new meta issue tracking the few outstanding long-term support/stability tasks around TSDS. See https://github.com/elastic/ingest-dev/issues/1773. Thanks for all your help here!

joshdover added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team labels May 24, 2022

This was referenced Jun 14, 2022

Set index: false on fields that are rarely used for filtering elastic/integrations#3419

Open

[Change Proposal] Add support for synthetic source elastic/package-spec#340

Closed

joshdover changed the title ~~[Fleet] Add opt-in support for time series indexing~~ [Fleet] Add opt-in support for time series indexing, doc-value-only fields, and synthetic source Jun 16, 2022

joshdover changed the title ~~[Fleet] Add opt-in support for time series indexing, doc-value-only fields, and synthetic source~~ [Fleet] Opt-in support for time series indexing, doc-value-only fields, and synthetic source Jun 16, 2022

joshdover mentioned this issue Jun 16, 2022

Add documentation for testing new indexing features elastic/integrations#3517

Merged

kpollich mentioned this issue Jun 27, 2022

Add flag to enable time series data streams elastic/package-spec#357

Merged

2 tasks

joshdover added the needs design label Aug 8, 2022

jlind23 added 8.7 candidate and removed 8.6 candidate labels Sep 22, 2022

joshdover mentioned this issue Nov 1, 2022

[Fleet] Add experimental toggle for enabling doc-value-only indexing to data streams #144357

Closed

7 tasks

kpollich mentioned this issue Nov 3, 2022

[Fleet] Add experimental toggle + support package spec field for TSDB #144530

Closed

5 tasks

kpollich changed the title ~~[Fleet] Opt-in support for time series indexing, doc-value-only fields, and synthetic source~~ [Fleet] [Meta] Opt-in support for time series indexing, doc-value-only fields, and synthetic source Nov 4, 2022

gizas mentioned this issue Nov 10, 2022

Validate Kubernetes Integration with TSDB elastic/integrations#4618

Closed

24 tasks

juliaElastic mentioned this issue Dec 1, 2022

[Fleet] Apply experimental data stream settings on package install #146803

Closed

joshdover changed the title ~~[Fleet] [Meta] Opt-in support for time series indexing, doc-value-only fields, and synthetic source~~ [Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source Dec 23, 2022

kpollich mentioned this issue Jan 3, 2023

[Fleet] Add feature flag for experimental indexing features #148317

Closed

4 tasks

kpollich self-assigned this Jan 3, 2023

kpollich mentioned this issue Jan 26, 2023

[Fleet] Allow toggling TSDB off for data streams #149591

Closed

jen-huang removed Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project 8.7 candidate labels Mar 17, 2023

jsoriano mentioned this issue Apr 12, 2023

[Fleet] Add support for aggregate_metric_double fields in packages #154867

Closed

This was referenced Apr 17, 2023

[Fleet] Provide docs for experimental indexing features elastic/ingest-docs#151

Closed

[Fleet] Remove experimental indexing toggles #155516

Closed

kpollich closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

joshdover commented May 24, 2022 •

edited by jlind23

elasticmachine commented May 24, 2022

ruflin commented May 25, 2022

joshdover commented Jun 1, 2022

joshdover commented Jul 8, 2022 •

edited

andresrc commented Jul 22, 2022

joshdover commented Jul 22, 2022

andresrc commented Jul 27, 2022 •

edited

jsoriano commented Jul 27, 2022

andresrc commented Jul 27, 2022

joshdover commented Jul 28, 2022

jsoriano commented Jul 28, 2022

andresrc commented Jul 28, 2022

joshdover commented Aug 8, 2022 •

edited

tommyers-elastic commented Aug 10, 2022

tommyers-elastic commented Aug 12, 2022 •

edited

kpollich commented Aug 12, 2022

ruflin commented Aug 15, 2022

jsoriano commented Aug 15, 2022 •

edited

kpollich commented Aug 15, 2022

nik9000 commented Aug 15, 2022

nik9000 commented Aug 15, 2022

joshdover commented Nov 1, 2022

kpollich commented Nov 1, 2022

joshdover commented Nov 1, 2022 •

edited

ruflin commented Nov 3, 2022

ruflin commented Nov 3, 2022

joshdover commented Dec 23, 2022

joshdover commented Dec 23, 2022

kpollich commented Feb 10, 2023

joshdover commented Feb 14, 2023

giladgal commented Feb 28, 2023

kpollich commented Apr 21, 2023

[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

Comments

joshdover commented May 24, 2022 • edited by jlind23

How to do this today

elasticmachine commented May 24, 2022

ruflin commented May 25, 2022

joshdover commented Jun 1, 2022

joshdover commented Jul 8, 2022 • edited

andresrc commented Jul 22, 2022

joshdover commented Jul 22, 2022

andresrc commented Jul 27, 2022 • edited

Phase I

Phase II

Phase III

jsoriano commented Jul 27, 2022

andresrc commented Jul 27, 2022

joshdover commented Jul 28, 2022

jsoriano commented Jul 28, 2022

andresrc commented Jul 28, 2022

joshdover commented Aug 8, 2022 • edited

tommyers-elastic commented Aug 10, 2022

tommyers-elastic commented Aug 12, 2022 • edited

kpollich commented Aug 12, 2022

1. Fleet intelligently adds/removes ignore_above setting

2. Elasticsearch accepts the ignore_above setting, but ignores it or warns in cases where it conflicts w/ synthetic source

ruflin commented Aug 15, 2022

jsoriano commented Aug 15, 2022 • edited

kpollich commented Aug 15, 2022

nik9000 commented Aug 15, 2022

nik9000 commented Aug 15, 2022

joshdover commented Nov 1, 2022

kpollich commented Nov 1, 2022

joshdover commented Nov 1, 2022 • edited

ruflin commented Nov 3, 2022

ruflin commented Nov 3, 2022

joshdover commented Dec 23, 2022

joshdover commented Dec 23, 2022

kpollich commented Feb 10, 2023

joshdover commented Feb 14, 2023

giladgal commented Feb 28, 2023

kpollich commented Apr 21, 2023

joshdover commented May 24, 2022 •

edited by jlind23

joshdover commented Jul 8, 2022 •

edited

andresrc commented Jul 27, 2022 •

edited

joshdover commented Aug 8, 2022 •

edited

tommyers-elastic commented Aug 12, 2022 •

edited

1. Fleet intelligently adds/removes `ignore_above` setting

2. Elasticsearch accepts the `ignore_above` setting, but ignores it or warns in cases where it conflicts w/ synthetic source

jsoriano commented Aug 15, 2022 •

edited

joshdover commented Nov 1, 2022 •

edited