Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable TSDB downsampling ILM configuration #130437

Closed
elastichelix opened this issue Apr 17, 2022 · 21 comments · Fixed by #138748
Closed

Enable TSDB downsampling ILM configuration #130437

elastichelix opened this issue Apr 17, 2022 · 21 comments · Fixed by #138748
Assignees
Labels
Feature:ILM impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:large Large Level of Effort Meta Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more

Comments

@elastichelix
Copy link
Contributor

elastichelix commented Apr 17, 2022

TSDB Downsampling ILM Configuration

Stakeholders

  • Slack channel: #tsdb
Name Role
Lewis Smith-Tong @lewissmithtong PM
Beth Richardson @elastichelix Tech lead
Cristina Albu @cristina-eleonora Design

Purpose of project and known requirements

As part of the TSDB project, we want to enable automatic downsampling of time series data via ILM. Downsampling will be provided as an ILM action. Downsampling configuration will be simple by extracting dimensions and metrics from the index mapping. The only information that will be required from the user is the time interval. This project seeks to modify the existing ILM UX to enable configuring and modifying the downsampling rollup action for read-only time series indices in the hot, warm, and cold phases.

Resources

Tasks

  • [TSDB Downsampling ILM] Add an optional rollup action to hot, warm, and cold ILM phases for time series indices
  • [TSDB Downsampling ILM] Validate that rollup action intervals based on previous rollup intervals (rollup of rollups)

Technical analysis

Data flow

As described in the ILM configuration section of the downsampling design, we will add the ability to configure an ILM rollup action on the hot, warm, and cold phases in ILM.

As part of the rollup action, a fixed interval must be configured for the rollup, which is the interval to which the data will be rolled up. These must use the same notation as the date_histogram aggregation

For example:

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "warm": {
        "actions": {
          "rollup" : {
            "fixed_interval": "1h"
          }
        }
      }
    }
  }
}

Rollup of rollups

As described in the downsampling design doc section for rollup of rollups, the rollup action intervals must adhere to some limitations: they must be greater than any previous rollup action and must be a multiple of that interval.

So for example, on the warm phase the user includes a 3 hour interval, if they choose to add a rollup action on the cold phase the interval must be a multiple of 3h. It cannot be 5h for example.

Another example: a user can have a rollup on the hot phase of 2d and then no rollup action on the warm phase, and then a rollup action on the cold phase, but it must be a multiple of 2d, for example it cannot be 1d nor can it be 3d.

Note that if this is too expensive in the UI, this can also be validated on policy creation time, but ideally this would be supported in the ILM configuration UI.

Overview of changes

  • Add new RollupAction to policies
  • Add rollup action to phases
  • Create new rollup configuration shared field component
  • Update deserializer to include rollup configuration
  • Update serializer to send the new rollup action based on configuration
  • Validate that the rollup action intervals for subsequent phases is valid: it must be greater than any previous rollup action and must be a multiple of that interval (rollup of rollups)

Note that during modification the fixed interval can be changed to any value and does not need to be validated or limited by existing fixed interval on the policy.

Open questions

@elastichelix elastichelix added Meta Feature:ILM Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more labels Apr 17, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/platform-deployment-management (Team:Deployment Management)

@elastichelix
Copy link
Contributor Author

@csoulios Do we have an existing link for implementing the rollup action on the API side for policy creation? Also, we are assuming that the validation on fixed_interval would take place on both the API and the UI preferably. Will that interval validation be present when that API is initially implemented?

@elastichelix
Copy link
Contributor Author

elastichelix commented Apr 26, 2022

@yuliacech brought up some great points based on her previous work on the previous discontinued rollup v2 project. I will add some of her questions here, which should be addressed:

Dependent or related actions

So for the current plan for downsampling, I think we might want to consider how downsampling affects other actions. For example, when searchable snapshots are enabled in Hot phase, some actions (force merge, read only, shrink) are removed from other phases. Something similar will probably need to happen for downsampling.

@csoulios Do you know if these same concerns around removing or limiting existing actions from other phases would need to be addressed in the UI for policy configuration when adding a rollup action?

On a related note, I know that we require read-only to be in place, does a read-only action need to be part of the policy phase when adding a rollup action as well?

Index Management updates

Also, @yuliacech also pointed out as highlighted in this issue from rollups v2, there may be a benefit to highlight in Index Management those indexes that are being downsampled with a rollup action in ILM. @csoulios or @wchaparro is Index Management UI in scope for this work?

@csoulios
Copy link

@csoulios Do we have an existing link for implementing the rollup action on the API side for policy creation?

Right now we are implementing the /souce_index/_rollup/rollup_index endpoint to perform the actual rollup action (elastic/elasticsearch#85708)

The Rollup ILM action is coming next. I will update this issue with the PR when I submit it.

Also, we are assuming that the validation on fixed_interval would take place on both the API and the UI preferably. Will that interval validation be present when that API is initially implemented?

That's right. There will be a validation on the elasticsearch side that throws an error for invalid values. However, it would be more user friendly if the UI can validate the interval before sending it to ES.

@csoulios
Copy link

@csoulios Do you know if these same concerns around removing or limiting existing actions from other phases would need to be addressed in the UI for policy configuration when adding a rollup action?

I don't think we should remove any of the ILM actions from the following phases. It doesn't make much sense to have the ReadOnly action, since the rollup index is already read-only. So, ReadOnly action will eventually do nothing

On a related note, I know that we require read-only to be in place, does a read-only action need to be part of the policy phase when adding a rollup action as well?

No, it doesn't have to be part of the policy. It will be an implicit step in the DownsampleAction

@csoulios
Copy link

I pushed the PR that implements the Downsampling ILM Action.

Adding it here for reference: elastic/elasticsearch#87269

cc @ghudgins

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServicesUx)

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:x-large Extra Large Level of Effort labels Jul 20, 2022
@csoulios
Copy link

csoulios commented Jul 27, 2022

Fyi, PR elastic/elasticsearch#87269 that implements the Rollup ILM action has been merged in both main and kibana specific branch

@exalate-issue-sync exalate-issue-sync bot added loe:large Large Level of Effort impact:critical This issue should be addressed immediately due to a critical level of impact on the product. and removed loe:x-large Extra Large Level of Effort impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Jul 28, 2022
@Dosant Dosant self-assigned this Aug 3, 2022
@Dosant
Copy link
Contributor

Dosant commented Aug 16, 2022

I started working on this, made some progress, and explored the ILM code. I can answer some open questions and also have some further related questions:

Open questions from this issue:

Q: Is there an existing UX for defining appropriate fixed intervals?

Looks like ILM currently has two different UIs for this. Seems like this is a tech debt to unify those:

First is for min age:
Screenshot 2022-08-16 at 12 34 33

Another is for shard size in Shrink action:

Screenshot 2022-08-16 at 12 34 52

I reused the second one because the usage is very similar.

Q: Needs UX design/mockup

Don't think we need those because the UI pattern is established, and we already have the interval component.

Screenshot 2022-08-16 at 12 48 14

But need help with text:

  • Is the user-facing action named "Rollup" as in the API, or should it be "Downsample"? Or something else?
  • Need a description text for the action

(cc @alexfrancoeur: @vadimkibana mentioned you were going to look into the design)

Q: When the policy is configured is it required for the policy to include a read-only action for the hot phase at configuration time or is the read-only requirement validated only at policy application time?

As I understood from the discussion, we shouldn't do any conditions on UI based on rollup actions for readonly actions : #130437 (comment)
Also, we don't do anything specific based on other actions (for example, no condition based on isUsingSearchableSnapshotInHotPhase and not hiding any other action if rollupAction is enabled)

Did I get it right? Please correct me

Q: How challenging will it be to validate intervals in the UI? Need to assess options for ensuring intervals are valid based on actions in previous phases.

We will add UI validation using the existing pattern. There is an example with min age where min age is validated between phases.

NOTE: Currently, elasticsearch doesn't validate the interval when saving policy. It doesn't validate interval constraints between phases, and it doesn't validate if this is a valid fixed_interval expression. @csoulios, do you know if this will be improved? is there an issue?

Q: Is Index Management screen updates part of this issue?

Also, @yuliacech also pointed out as highlighted in #102638 from rollups v2, there may be a benefit to highlight in Index Management those indexes that are being downsampled with a rollup action in ILM. @csoulios or @wchaparro is Index Management UI in scope for this work?

I understand this should be separate and not covered by this work. Mainly because this needs more clarifications:

  • Needs Design/text on the index management screen
    • The design should also differentiate between old rollups and new tsvb rollup indices
  • Also, this is a different code that could be delivered separately, would be convenient to extract it into a separate issue and clarify it separately

If this is not acceptable, please call this out, and we should clarify what is missing

@ghudgins
Copy link

ghudgins commented Aug 22, 2022

no one asked me but here are a few comments if it helps!

  • I don't think it's necessary for us to refactor the UI for consistency in this phase. Agree with your choice here
  • For rollup vs. downsample - I think we should match Elasticsearch and call it rollup despite calling it downsampling everywhere internally. this is what the v1 experimental feature was called as well CC @csoulios @giladgal to make sure they agree. call it downsample for the reasons listed below
  • Here is a stab at descriptive text. I lean on our @elastic/kibana-docs team to check this during PR review: "Reduce index size by storing summaries of data. Note: you will not be able to perform all aggregations on downsampled data. Read more about rollups (<-- doc link)"
  • Improving validation can happen in a later phase. Might be good to make sure intervals are not smaller than previous as something simple in the UI
  • Agree we should descope index management screen changes and instead make a new issue for this.

Only other question from these screenshots @Dosant - is there a specific phase we need to show for the "buffer period" where we allow late arriving docs? Is that in this design yet?

CC @sixstringcode

@gchaps
Copy link
Contributor

gchaps commented Aug 22, 2022

@debadair Can you please review the text that Graham suggested?

@cjcenizal
Copy link
Contributor

@ghudgins I agree with your assessment, except I have questions about using the term "Rollup". We already support Rollup Jobs that create rollup indices. Does the ILM downsampling action produce the exact same type of rollup indices that are produced by Rollup Jobs?

@Dosant
Copy link
Contributor

Dosant commented Aug 23, 2022

@ghudgins, thanks!

Improving validation can happen in a later phase. Might be good to make sure intervals are not smaller than previous as something simple in the UI

I am adding UI validation as described in the issue. It is just that currently, there is no validation on es side when create a policy.

Only other question from these screenshots @Dosant - is there a specific phase we need to show for the "buffer period" where we allow late arriving docs? Is that in this design yet?

Not sure what you mean by "buffer period", this wasn't mentioned in the description and I didn't see this as part of the API in the es docs that I've reviewed.

This is what I saw on the subject here: https://github.com/elastic/elasticsearch-adrs/blob/master/analytics/tsdb/tsdb-rollups-design.md

After the source index goes through rollover, all new documents are stored in a new index in the data stream. But late arriving documents can still go into the index that was rolled over but not yet downsampled. After the index has been downsampled, all late arriving documents will be discarded. ILM does not take any specific measure to protect users against it. This is left to the user to configure the timing of the ILM policy, based on their setup.

@cjcenizal,

Does the ILM downsampling action produce the exact same type of rollup indices that are produced by Rollup Jobs?

As I understand they are quite different

@ghudgins
Copy link

@cjcenizal - yes they are different. however, we will eventually deprecate the entire rollup system in lieu of ILM supported rollups....happy to keep the terminology separate if that makes it easier but they are logically the same thing and we intentionally did this version of rollups instead of the job-based one after the findings of doing v1.

@ghudgins
Copy link

Not sure what you mean by "buffer period", this wasn't mentioned in the description and I didn't see this as part of the API in the es docs that I've reviewed.

i'll follow up with @giladgal & @csoulios as I believe this "tier" was part of the original design -
image

@cjcenizal
Copy link
Contributor

cjcenizal commented Aug 23, 2022

Thanks Anton and Graham. I suggest we create terms that enable users to easily differentiate the indices created by the downsample action from the indices created by rollup jobs. This is similar to the decision by the ES Data Management team to differentiate legacy index templates from composable index templates when they introduced the latter in 7.8.

Here are a couple examples of what I have in mind:

  • Legacy rollup indices vs. ILM rollup indices
  • Rollup indices vs. downsampled indices

Once we've landed on the right terms we can teach users about the differences between the two by using these terms consistently in our UI and docs.

@giladgal
Copy link

giladgal commented Aug 24, 2022

For rollup vs. downsample - I think we should match Elasticsearch and call it rollup despite calling it downsampling everywhere internally. this is what the v1 experimental feature was called as well CC @csoulios @giladgal to make sure they agree.

At the time the decision was to use different terminology because functionally neither one is a superset of the other. I'd rather not change that decision at this stage unless someone feels strongly about it.

@ghudgins
Copy link

I retract my suggestion 😄 and will edit the above comment!

@debadair
Copy link
Contributor

In general, we try to mirror the terminology used in the API, but the potential confusion with previous rollups is definitely a concern.

With some wordsmithing, I think we can bridge the terminology and dodge the potential confusion.

This is more words than the other action descriptions, but it might be worth it:

Downsample
Roll up documents within a fixed interval to a single summary document.
 Reduces the index footprint by storing time series data at reduced granularity.

That gives the Rollup interval label for the setting some context, and connects the dots between downsampling and rollups.

One thing to note: We don't (yet) have a good destination for a Learn more link. I'll work with the folks on the ES side to fix that.

@csoulios
Copy link

Ι will try to address as many open questions as possible

When the policy is configured is it required for the policy to include a read-only action for the hot phase at configuration time or is the read-only requirement validated only at policy application time? This needs analysis by the TSDB team per @csoulios . Not 100% clear, confirm: #130437 (comment)

No, it is not required for the policy to include a read-only action. The index is implicitly set in read-only mode by the rollup action.

is there a specific phase we need to show for the "buffer period" where we allow late arriving docs? Is that in this design yet?

Buffer period is the period after the index has been rolled over and before it is downsampled. For example, if we rollover an index after 1 day and we downsample it after 3 days, the buffer period (when it can accept late arrivals) is 2 days. I don't think there is any way we should enforce any validation on this as this is totally up to the user to time the transitions of the index.

Does the ILM downsampling action produce the exact same type of rollup indices that are produced by Rollup Jobs?

The format of indices produced by ILM downsampling is totally different from the indices produced from rollup jobs.

For rollup vs. downsample - I think we should match Elasticsearch and call it rollup despite calling it downsampling everywhere internally. this is what the v1 experimental feature was called as well CC @csoulios @giladgal to make sure they agree.

Generally speaking, in the time-series data world downsampling is a subset of the rollup functionality (summarize data only by changing the time interval). Current release of rollups will only support the downsampling functionality. However, later we may choose to add support for more rollup features (such as dimension reduction). If we now name this feature "downsampling", I am afraid that later we will have to rename it so that it correctly describes the supported functionality (or keep the name "downsampling" but adding more capabilities to it)

The experimental rollup functionality (Rollup Jobs) is soon going to be replaced by the new downsampling/rollup feature. I would say that we have to be explicit that this is the "new rollups".

So, if I had to pick a name of it, this would be "Rollups for time series data/indices"

@Dosant
Copy link
Contributor

Dosant commented Aug 31, 2022

Continue working on this in: #138748

  1. In yesterday's meeting, we confirmed that the new action should be named "Downsample." I updated labels and code from from rollup -> downsample. Only left to update the API calls when es changes from rollup -> downsample on their side.

This is how the labels look now, looking for feedback:

Screenshot 2022-08-31 at 17 00 48

Does the interval label make sense (Downsampling interval)? Or should it be named somehow differently?

  1. As we discussed, downsample action should also be available in frozen phase. I added this. So this change makes it available in the hot, warm, cold, and frozen phases. But ATM es doesn't allow setting the action in the frozen phase (validation fails). Keep only hot,warm,cold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:ILM impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:large Large Level of Effort Meta Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants