Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Cloudwatch is indexing data in the wrong data stream #5467

Open
lucabelluccini opened this issue Mar 7, 2023 · 8 comments
Open

AWS Cloudwatch is indexing data in the wrong data stream #5467

lucabelluccini opened this issue Mar 7, 2023 · 8 comments

Comments

@lucabelluccini
Copy link
Contributor

lucabelluccini commented Mar 7, 2023

This issue concerns https://docs.elastic.co/integrations/aws/cloudwatch

In particular, the cloudwatch_logs.

The problem is the Cloudwatch Logs are being indexed in the wrong data stream because of the presence of a wrong dataset.
See

The consequence of the presence of data_stream.dataset set to generic is while the assets are installed to work with logs-aws.cloudwatch_logs-default data stream, the logs will actually end up in logs-generic-default. So no ingest pipelines, mappings or settings of the expected data stream will be used.

This problem has been investigated with the help of @P1llus.

⚠️ Note that given the fact this setting was available to final users, reverting it must be carefully planned as:

  • users might have tweaked the setting
  • mappings might have been tweaked by users

As it is a quite used integration, we should warn the users if we revert back.

A possible approach is:

  • for new policies, do not offer the dataset setting and set it to the correct value
  • for existing policies where the field exists, we should guide the users on how to go back to the "proper" data stream

Example rendered policy just saving with default values:

  - id: aws-cloudwatch-cloudwatch-23aacfba-ffe0-4123-b981-ad713db31b9c
    name: aws-1
    revision: 1
    type: aws-cloudwatch
    use_output: fd5e4b10-ae2d-11ed-91ee-0320f091a587
    meta:
      package:
        name: aws
        version: 1.32.0
    data_stream:
      namespace: default
    package_policy_id: 23aacfba-ffe0-4123-b981-ad713db31b9c
    streams:
      - id: >-
          aws-cloudwatch-aws.cloudwatch_logs-23aacfba-ffe0-4123-b981-ad713db31b9c
        data_stream:
          dataset: generic <<<<<<<<
        region_name: null
        start_position: beginning
        scan_frequency: 1m
        api_sleep: 200ms
        tags:
          - forwarded
          - aws-cloudwatch-logs
        publisher_pipeline.disable_host: true
@lucabelluccini
Copy link
Contributor Author

Assigning to @elastic/obs-cloud-monitoring

@aspacca
Copy link
Contributor

aspacca commented Mar 7, 2023

@lucabelluccini not sure how much the use of generic as dataset and offering the option to the users to change it is a bug or rather a design decision

cloudwatch logs are just a logs "container", where different kinds and sources of logs format can be collected

I would expect that users ingest their own apps logs from cloudwatch (like a lambda or anything else) and set the datastream to something specific or in any case add their own ingest pipeline to deal with the specific format of their custom logs

moreover: you could have multiple cloudwatch logs, collecting multiple logs types, and you don't them to end up in the same data stream

what it has to be fixed here is the following in my opinion:

The consequence of the presence of data_stream.dataset set to generic is while the assets are installed to work with logs-aws.cloudwatch_logs-default data stream, the logs will actually end up in logs-generic-default. So no ingest pipelines, mappings or settings of the expected data stream will be used.

I will check what are the "ingest pipelines, mappings or settings" for logs-aws.cloudwatch_logs-default, but there should be anything but for enforcing the data_stream field

@lucabelluccini
Copy link
Contributor Author

Hello, I've never developed a package, but aws/cloudwatch_logs package is the 1 of the 2 integrations which are setting the dataset to generic.
The custom logs integration is the 2nd one (which makes sense).

When using a package, except the "custom" family, I've always observed the events going to the data stream named <type>-<dataset>-<namespace>, where <dataset> is a combination of <package name>.<dataset>.

I would expect that users ingest their own apps logs from cloudwatch (like a lambda or anything else) and set the datastream to something specific or in any case add their own ingest pipeline to deal with the specific format of their custom logs

If this is the way we expect customers to use the cloudwatch_logs, then:

If this is intended, then I would enhance the documentation to explain how the integration should be used and investigate why we have assets for this integration.
If this is not intended, then comments on the initial issue description apply.

@aspacca
Copy link
Contributor

aspacca commented Mar 9, 2023

If this is the way we expect customers to use the cloudwatch_logs, then:

  • What is the reason of having assets?

most of the assets are relevant to the integration package on its own, not related to the a template or an ingest pipeline but for https://github.com/elastic/integrations/blob/b72fe2e619dc1fa8f5a6c0731e3661c940c00b1f/packages/aws/data_stream/cloudwatch_logs/elasticsearch/ingest_pipeline/default.yml
As you can see it's a very basic ingest pipeline that some basic metadata: it would be nice if such pipeline could be set in a template applied to the outcome of <type>-<dataset>-<namespace>, according to the data_stream.dataset configured by the users. but as far as I know that's not possible at the moment. if the users provide a custom data_stream.dataset the pipeline won't be used: but there's no parsing of specific content format there (since there's no such thing) and nothing will be lost but of the metadata set in the pipeline.
there's still the problem that the final datastream won't have the exact mapping applied from the template: again this is a limitation in fleet server per my understanding, that's does not support applying the proper operations to a <dataset> that comes from user configuration

it was introduced because you could have multiple policies with the aws cloudwatch logs integration, each of them ingesting logs of different "generic" application, that you want to split on different datastreams. imagine ingesting two different applications logs, one in json format and one not: you will have to apply different processors in ingest pipeline, and the users should not be forced to write in the same data stream for both type of logs

If this is intended, then I would enhance the documentation to explain how the integration should be used and investigate why we have assets for this integration.

yes, it is intended.
briefly the documentation should explain that not format parsing is done by default, and the user should add their custom pipeline for that.
they can either apply it to the generic dataset, or if they have multiple logs formats to ingest through the same aws cloudwatch logs integration, it is suggested to split them on different dataset: creating a different policy for each so that they can use different dataset value

I hope it is clearer now.

@P1llus
Copy link
Member

P1llus commented Mar 9, 2023

@aspacca I think the issue is a bit around consistency.

We have quite a few packages which focuses on raw data rather than an out of the box integration (httpjson, tcp, udp, log, s3, gcs, abs and so on) and they all have a few things in common:

  1. They allow the data_stream.dataset to be configured
  2. They do not have any assets, because as you said, they will not be applied if you change the dataset name.
  3. They only focus on a single input

There are a few issues with the input packages for cloudwatch (and eventhub), in which they apply some basic pipelines, which in most cases would not be used, and users gets confused when they don't work (and the dataset has been changed).

However the biggest issue right now is 2 things:

  1. The default name of the dataset, should be consistent with the name of the package and the datastream (in this case aws.cloudwatch_logs), if we don't then even the default options won't provide the basic functionality they are after. This also affects when users of default options try to use @custom component templates.
  2. The dataset configuration is only available for one of two inputs, for some reason there is also an S3 input there?

@aspacca
Copy link
Contributor

aspacca commented Mar 9, 2023

  • The default name of the dataset, should be consistent with the name of the package and the datastream (in this case aws.cloudwatch_logs), if we don't then even the default options won't provide the basic functionality they are after. This also affects when users of default options try to use @custom component templates.

I get your point and it does not invalidate what you mentioned before:

They allow the data_stream.dataset to be configured

  • The dataset configuration is only available for one of two inputs, for some reason there is also an S3 input there?

Introducing the s3 input in the package was an erroneous decision that was taken when creating the package in the first place. We realised that and if I remember correctly that input is now deprecated: we plan to totally remove it indeed

They do not have any assets, because as you said, they will not be applied if you change the dataset name.

could clarify exactly what assets are you referring about (there are different ones)? sharing a link to a sample assets folder is enough for me to do the comparison :)

@P1llus
Copy link
Member

P1llus commented Mar 9, 2023

  • The default name of the dataset, should be consistent with the name of the package and the datastream (in this case aws.cloudwatch_logs), if we don't then even the default options won't provide the basic functionality they are after. This also affects when users of default options try to use @custom component templates.

I get your point and it does not invalidate what you mentioned before:

They allow the data_stream.dataset to be configured

  • The dataset configuration is only available for one of two inputs, for some reason there is also an S3 input there?

Introducing the s3 input in the package was an erroneous decision that was taken when creating the package in the first place. We realised that and if I remember correctly that input is now deprecated: we plan to totally remove it indeed

They do not have any assets, because as you said, they will not be applied if you change the dataset name.

could clarify exactly what assets are you referring about (there are different ones)? sharing a link to a sample assets folder is enough for me to do the comparison :)

If we look at another input package (example TCP): https://github.com/elastic/integrations/tree/main/packages/tcp/data_stream/generic

We do not have any ingest pipelines at all, instead we allow the user to configure one here:
https://github.com/elastic/integrations/blob/main/packages/tcp/data_stream/generic/manifest.yml#L33

The result for that is:

  1. There won't be any confusing default pipeline created, that usually will stop working.
  2. No reason to use @custom components (which also don't work out of the box after changing the dataset.

In terms of the field mapping, we usually want the user to define them, and unfortunately no workaround for that at the moment, however more and more packages (specially input related ones) are starting to use the dynamic ECS template: #5055 so that at least ECS fields should not need to be mapped manually.

@botelastic
Copy link

botelastic bot commented Mar 8, 2024

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants