Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to easily add custom pipelines and templates per integration, currently it is done per dataset. #146792

Closed
leandrojmp opened this issue Dec 1, 2022 · 20 comments
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@leandrojmp
Copy link

leandrojmp commented Dec 1, 2022

Hello,

Currently on my company we manage all of our logstash pipelines and indices templates, since we are migrating to version 8 we thought about using some Elastic Agent integrations to shift the time spent on managing the pipelines/templates to other tasks.

But when we started looking out on how the integrations works when you need to add custom processors or custom fields, we saw that we would spend more time mantaining the integrations with our custom processors and fields than what would we spend using our logstash pipelines and templates.

For example, today we have a couple of component templates where we have mappings for both ecs fields like source.* and destination.* and some custom fields like siem.* that we add in all of our indices, we also have some logstash pipelines and ingest pipelines that we simple drop in into the pipeline folder or add as a index.final_pipeline for some indices.

The first integration we tested was the Cisco Duo integration, if we want to add a single custom field, like siem.datacenter we would need to create and edit at least 5 custom ingest pipelines and 5 custom component templates, because the way the integrations works is to have a pipeline/template per dataset, not per integration.

Also, all integrations share the same lifecycle policy, which does not work in many cases, we cannot use the same lifecycle policy for an integration that would generate 500 MB/day and for another integration that would generate 200 GB/day, but to edit the lifecycle policy you need to edit the template which in the end would result in editing a lot of files.

I've created this issue after a chat with @ruflin in this another issue in the Elasticsearch repository.

I've also made two posts on discuss explaining better those issues, this one and this one.

Describe the feature:

The feature would allow to set a custom ingest pipeline, a custom template and a custom lifecycle policy per integration on Kibana while adding it in Fleet, currently there is no easy way to do it.

@botelastic botelastic bot added the needs-team Issues missing a team label label Dec 1, 2022
@ruflin ruflin added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 2, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Dec 2, 2022
@ruflin
Copy link
Member

ruflin commented Dec 2, 2022

Lets split the problem up into two parts:

  • Extension of integrations with pipelines and templates
  • ILM policies

There is a chance that the same solution might work for both but lets discuss it separately.

Extension of integrations with pipelines and templates

My understanding the flow you would like to see is:

  • Pick integration
  • Select your component templates you want to use
  • Have these appended to the existing mappings for all data streams of this integration (important bit here is append)

Same for ingest pipelines. The ones you add, do processing after the integration did its work. It reminds me a bit of elastic/elasticsearch#61185.

For the custom templates, to these apply to all dataset in an integration or do you have different ones for metrics vs logs as an example?

Lets assume for a moment we could offer the following flow. I'll describe it as UI bits but of course would have to available through an API.

  1. You select all the datasets (or your full integration) that you want to add component templates
  2. You select the component templates you want to add
  3. Apply

When integrations are updated, all your changes stay in place. Same applies for the ingest pipelines.

Side note: In case we move forward with elastic/elasticsearch#85692 (comment) eventually, it would hopefully remove your need to maintain your own ECS component templates but it does not get rid of the problem.

ILM policy management

There have been quite a few discussions internally on this and there are several solutions we discussed. Before diving into solutions, let me ask some more questions:

  • Integration specific ILM policy: You specifically mentioned the integration. Do you expect to set the same policy for all datasets in the integration?
  • Do you use namespaces for your datasets? If yes, do you expect to have different ILM policies per namespace?
  • At what stage would you like to set the ILM policy? Creation of Elastic Agent policy? Integrations settings? ...?

@leandrojmp Thanks for filing and your integration contributions!

@leandrojmp
Copy link
Author

Hello, @ruflin,

Since ILM policies are controlled by the index.lifecycle.name setting in the template, I think that having a way to easy change the pipelines and integrations would solve the issue.

What I would expected to be able when adding an integration is:

  • Easily add a custom ingest pipeline per integration
  • Easily add one or more component templates per integration

One of the issues is that it doesn't look like that there is an standard on how the integrations works, some integrations have one ingest pipeline that call other ingest pipelines depending on how the data looks like, other integrations seems to split the data in multiple datasets and use multiples ingest pipelines and templates, this makes very hard to organize things.

Also, splitting the data from an integration in multiples datasets and multiple data streams sometimes can result in many small indices, one of the main recommendations from Elastic is to avoid small indices, yet the integrations and many internal indices go against this recommendation.

One example that I have is the Cisco Duo Integration, it uses the Cisco Duo API to collect 5 types of logs:

  • Administrator Logs
  • Authentication Logs
  • Enrollment Logs
  • Summary Logs
  • Telephony Logs

In this integration Elastic choose to use a different template, data stream and ingest pipeline to each one of these types of logs, if I want to add one of the custom fields that I have in all of our indices, I would need to edit at least 5 templates and 5 ingest pipelines.

And there is the issue when you update an integration, the old ingest pipelines are not deleted, so you may end with a lot of unused ingest pipelines that you need to manually remove to make things more organized.

@ruflin
Copy link
Member

ruflin commented Dec 8, 2022

One of the issues is that it doesn't look like that there is an standard on how the integrations works, some integrations have one ingest pipeline that call other ingest pipelines depending on how the data looks like, other integrations seems to split the data in multiple datasets and use multiples ingest pipelines and templates, this makes very hard to organize things.

Do you have some examples here? In general I expect all integrations to work with multiple datasets (if there are of course multiple datasets) and have ingest pipelines and templates for each. This exactly makes your "per integration" goal tricky.

Also, splitting the data from an integration in multiples datasets and multiple data streams sometimes can result in many small indices, one of the main recommendations from Elastic is to avoid small indices, yet the integrations and many internal indices go against this recommendation.

This is partially changing. Elasticsearch struggled historically with too many shards but since the introduction of the data stream naming scheme many improvements have been made on this front. @jpountz Do we have any more public info this somewhere?

In this integration Elastic choose to use a different template, data stream and ingest pipeline to each one of these types of logs, if I want to add one of the custom fields that I have in all of our indices, I would need to edit at least 5 templates and 5 ingest pipelines.

The interesting part here is that these are all logs datasets. I agree with you, we should offer you an easy way to apply a common template / pipeline for all 5 (or more) without having to do all the additional API calls.

And there is the issue when you update an integration, the old ingest pipelines are not deleted, so you may end with a lot of unused ingest pipelines that you need to manually remove to make things more organized.

Are you referring to your ingest pipelines here or the ones managed by the integration? The ones from the integration should be cleaned up AFAIK. @kpollich

Appreciate all the details you provided. In summary, in most scenarios you look for "management per integration" (like you described in the title) and care less about the underlying datasets. I assume if there would also be metrics dataset, it would be likely grouped into types of data, logs and metrics that need separate updating.

@kpollich
Copy link
Member

kpollich commented Dec 8, 2022

Are you referring to your ingest pipelines here or the ones managed by the integration? The ones from the integration should be cleaned up AFAIK. @kpollich

Ingest pipelines tied to a previous version of a since-upgraded integration are removed during the upgrade process, that is correct.

@leandrojmp
Copy link
Author

leandrojmp commented Dec 8, 2022

Do you have some examples here? In general I expect all integrations to work with multiple datasets (if there are of course multiple datasets) and have ingest pipelines and templates for each. This exactly makes your "per integration" goal tricky.

Yeah, you are right @ruflin , my mistake, I compared with the Crowdstrike integartion that have a couple of ingest pipelines, but in the end it has only just one dataset, so if an integration has multiple datasets, it will have one datastream, one ingest pipeline and one template per dataset, which is the current issue.

The interesting part here is that these are all logs datasets. I agree with you, we should offer you an easy way to apply a common template / pipeline for all 5 (or more) without having to do all the additional API calls.

custom ingest pipelines

I was thinking about it today because I added a Google Workspace integration that has 8 differente datasets, which means 8 more custom ingest pipelines and custom templates to manage.

Currently the custom ingest pipelines works by having a pipeline processor in the end of every managed ingest pipeline, for example:

  {
    "pipeline": {
      "name": "logs-google_workspace.admin@custom",
      "ignore_missing_pipeline": true
    }
  }

In this case the integration is the google_workspace and the dataset is admin.

To make this work for every dataset in this integration without breaking the current behavior every ingest pipeline of the integration could have an extra pipeline processor.

  {
    "pipeline": {
      "name": "logs-google_workspace.admin@custom",
      "ignore_missing_pipeline": true
    }
  },
  {
    "pipeline": {
      "name": "logs-google_workspace@custom",
      "ignore_missing_pipeline": true
    }
  }

Then the ingest pipeline logs-google_workspace@custom would apply to every dataset in the integration.

Another way would be to have only the logs-google_workspace@custom in every ingest pipeline of an integration, and this integration pipeline would call the individual custom pipelines using a conditional, this logs-google_workspace@custom pipeline would need to be created while installing the integration.

  {
    "pipeline": {
      "name": "logs-google_workspace.admin@custom",
      "if": "ctx.event?.dataset == 'google_workspace.admin'"
      "ignore_missing_pipeline": true
    }
  },
  {
    "pipeline": {
      "name": "logs-google_workspace.alert@custom",
      "if": "ctx.event?.dataset == 'google_workspace.alert'"
      "ignore_missing_pipeline": true
    }
  },
  {
    "pipeline": {
      "name": "logs-google_workspace.DATASET@custom",
      "if": "ctx.event?.dataset == 'google_workspace.DATASET'"
      "ignore_missing_pipeline": true
    }
  }

With this you do not break the current behavior and makes it easier to add custom processors per both integration and dataset.

custom component templates

A similar approach would also work for the component templates, currently the custom mappings uses a template called logs-INTEGRATION.DATASET@custom, like the following example:

Screenshot from 2022-12-08 15-26-32

You could have a logs-google_workspace@custom component template in every integration dataset template, this would allow to edit all the custom fields in just one component template instead of editing one template per dataset per integration.

I could also add a setting for the index.lifecycle.name in this component template, but I'm not sure it would work, the documentation says that you need to create a new index template with a higher priority to be able to override the default ILM policy.

So maybe the best option would be to allow to choose and already existing component template, regardless of the name, and an already existing ILM policy, while adding the Integration, then this would automatically add the component template in the managed templates and change the default ILM policy to the custom one.

duplicates ingest pipelines

@kpollich I've got some leftover in this case here, but will manually remove the old ingest pipeline.

Screenshot from 2022-12-08 15-49-31

@ruflin
Copy link
Member

ruflin commented Dec 13, 2022

This is an interesting approach you are proposing here. I especially like it as it follows the logic of how we handle and name ingest pipelines and templates at the moment. Internally we had very similar discussions around how to enable namespace specific configs and pipelines so we would have logs-{dataset}-{namespace}@custom etc. You are now going the opposite direction and add it on the integration level. @kpollich @joshdover Would be great if you could chime in here with your thoughts. Having a component template on a less granular level could solve quite a few issue at once.

@leandrojmp I was looking for some docs around "many shards is more acceptable now" and found the following blog post: https://www.elastic.co/blog/three-ways-improved-elasticsearch-scalability Hope this helps to explain a bit more in detail why the data stream naming scheme is ok.

@ruflin
Copy link
Member

ruflin commented Dec 15, 2022

Adding a link to #121118 as I was previously talking about the counter part to the above discussion is the namespace specific feature discussion (more granular). Ideally, these two concepts follow the same underlying logic.

@leandrojmp
Copy link
Author

leandrojmp commented Dec 15, 2022

@ruflin I think that I read this post in the past but forget about it, there is also this blog post about the number of shards that was updated since the rule of thumb of 20 shards per 1 GB of HEAP is not valid anymore.

But old habits die hard and those two things where the truth for so long that sometimes is hard to not care about them.

What I need, and proposed in the previous comment, is to have a way to add custom mappings and custom ingest pipelines in the integration level, currently you can only do it in the dataset level and some integrations can have multiple datasets, which multiplies the number of files you need to manage.

This discussion about having custom templates/pipelines per namespace does not change much on how things are now, but it seems that it would add an way to have the same integration running multiple times in different namespaces and use different custom templates/pipelines for each namespace, it seems that it adds more granularity, but doesn't make it easier to manage a lot of custom templates and pipelines, on contrary, it adds more things to manage.

@joshdover
Copy link
Contributor

joshdover commented Dec 15, 2022

Yeah, we've talked about doing something like this before. There's levels of granualrity that may be desired: global, per-type (eg. logs-*), per-package (eg. *-nginx.*-*), per-dataset (what we support today - eg. logs-nginx.access-*), per-dataset-per-namespace, (eg. logs-nginx.access-foo) and even maybe just per namespace (eg. *-*-foo).

It starts to get really complicated really fast from a UX perspective if we add all component templates for all of these permutations out-of-the-box. IMO we should prioritize adding the most commonly requested ones within the existing UX and then explore a more tailored UI that allows the user to choose which level they want this applied to and then the appropriate templates are generated on-demand, rather than having many hundreds of templates that are there for use, but empty.

@ruflin
Copy link
Member

ruflin commented Dec 16, 2022

But old habits die hard and those two things where the truth for so long that sometimes is hard to not care about them.

Yes, and we need to communicate it more on our end.

This discussion about having custom templates/pipelines per namespace does not change much on how things are now, but it seems that it would add an way to have the same integration running multiple times in different namespaces and use different custom templates/pipelines for each namespace, it seems that it adds more granularity, but doesn't make it easier to manage a lot of custom templates and pipelines, on contrary, it adds more things to manage.

Agree, I more brought it up in the context of having "common" conventions but it seems we are already aligned on this.

templates are generated on-demand

@joshdover Besides the UX challenge, the problem here is that currently all component templates are required to exist? There is no ignore_missing like for ingest pipelines?

@joshdover
Copy link
Contributor

@joshdover Besides the UX challenge, the problem here is that currently all component templates are required to exist? There is no ignore_missing like for ingest pipelines?

Just saw you wrote this just before me #146804 (comment). I will open the ES issue

@joshdover
Copy link
Contributor

joshdover commented Dec 16, 2022

@leandrojmp
Copy link
Author

leandrojmp commented Oct 2, 2023

Just curious, besides the ignore_missing option for custom ingest pipeline, is there any other improvements on the management of Integrations?

We are currently starting to use more integrations and planning to use the Elastic Agent/Endpoint to collect logs on our hosts to replace another log collector.

But it seems that all the management issues are still the same.

Is there at least an easier way to use separate ILM policies for different integrations? We have some integrations that does not generate 50 GB per year and others that generate 50 GB per week, but Elastic considers both the same since all integrations use the same lifecycle policy.

Not being able to use different ILM policies for different integrations in an easy way really makes the management of data streams really complicated, is something like this planned?

@leandrojmp
Copy link
Author

It seems that the following issues may help solve some of the current problems with management of Elastic Agent Integrations.

@joshdover
Copy link
Contributor

Hi @leandrojmp!

Is there at least an easier way to use separate ILM policies for different integrations? We have some integrations that does not generate 50 GB per year and others that generate 50 GB per week, but Elastic considers both the same since all integrations use the same lifecycle policy.

This can be done on an dataset basis today by adding a custom lifecycle policy to the associated @custom templates. For example, for the System integration, you could edit the logs-system.syslog@custom component template to specify a different ILM policy.

The issues you mentioned will help provide more of the underlying plumbing to be able to make customizations at other levels, rather than only at the dataset level. We still do need to add easier ways to make customizations at any level, such as providing a focused UI that centered on customizing data streams.

@leandrojmp
Copy link
Author

Hello @joshdover,

Yeah, the way this is done today is on the dataset level, I mentioned this earlier in this issue and this is the reason for this feature request, unfortunately doing customizations on the dataset level is impractical in production for many reasons.

For example, the Google Workspace has 14 datasets, to customize anything you would need to clone and manage 14 templates, and this is just for one integration.

The customizations needs to be done at least on the integration level to be useful.

Just wanted to know if this is still in the roadmap, the new issues linked made clear that they are indeed in the roadmap.

@leandrojmp
Copy link
Author

Hello,

Do we have any update on this after one year? What are the issues tracking the improvements?

@joshdover
Copy link
Contributor

joshdover commented Dec 27, 2023

@leandrojmp We added support for custom pipelines at the package, type, and global levels. This is shipping in 8.12: #170270

We have not yet added support for customizing settings and mappings at the integration / package level. We need to make a decision still on whether or not we will be able to support this at the Elasticsearch level in a generic way that isn't specific to only Fleet integrations (elastic/elasticsearch#97664). If we decide not to pursue that in the short-term, we will likely prioritize #149484 to solve this for Fleet integrations.

Unfortunately, I do not have a timeline to share at this time.

@joshdover
Copy link
Contributor

With the pipeline aspect done, the remaining work for index settings and mappings is tracked in #149484

@joshdover joshdover closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

5 participants