Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

Closed
cmacknz opened this issue Apr 3, 2023 · 4 comments · Fixed by #35031
Closed
Assignees
Labels

Comments

@cmacknz
Copy link
Member

cmacknz commented Apr 3, 2023

Background

Starting from 8.6 the default global processors for Beats that are run by agent are configured in code instead of being read from the default Beat configuration file. Beats managed by agent no longer read a configuration file at startup and instead wait for their initial configuration to be sent by agent. This change was done in #34149.

The implementation from #34149 makes the processors global by configuring them for each input run by Elastic Agent. Looking at the beat-rendered-config.yml file available in the agent diagnostics for example shows the processors at the input level:

- data_stream:
    dataset: system.auth
    type: logs
  exclude_files:
  - .gz$
  id: logfile-system.auth-8913b026-f53e-43f4-909b-d7c91335f141
  index: logs-system.auth-vault
  multiline:
    match: after
    pattern: ^\s
  paths:
  - /var/log/auth.log*
  - /var/log/secure*
  processors:
  - add_host_metadata:
      when:
        not:
          contains:
            tags: forwarded
  - add_cloud_metadata: null
  - add_docker_metadata: null
  - add_kubernetes_metadata: null

This is in contrast to the default configuration file which defines a single instance of each processor for the process by defining them at the top level of the configuration (see Where are processors valid for details):

processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~

Problem

Similarly in 8.6 there was a change to the aws-s3 input to create a new beat.Client for each new SQS worker in #33658 to improve performance. This results in a new input pipeline being constructed for each SQS worker, each of which gets its own instance of the per input processors as of 8.6.

This was not a problem until 8.7, when it was discovered that each instance of a beat input pipeline was referencing an accidentally global instance of the per input processors. This was fixed in #34761. The change in #34761 now results in each input pipeline constructing a new instance of the global processors.

Each of these global processors is expensive to create and includes code to try to perform expensive work only at initialization time. The problem is this only works if there is a single instance of the processor, otherwise each unique instance of the processor attempts to reinitialize itself often performing the exact same request multiple times. For example:

  1. In the case of Docker or Kubernetes each instance is subscribing/watching the Docker/k8s API for changes. And each instance is keeping its own in-memory state of the Docker/k8s resources. This ideally should only be done once per process for both the sake of the Beat and the Docker/K8s API.
  2. The repeated construction of these processors by inputs like aws-s3 and filestream will be slow. These processors generally have a high instantiation cost. Like for docker the constructor creates a new docker client and tries making an API call before returning. Or in add_cloud_metadata there was in inadvertent change that introduces a 3s worst-case construction cost ([libbeat] add_cloud_metadata - startup blocked by AWS IMSDv2 token fetch #33058).

Impact

As of 8.7 we observing extremely high CPU usage for Beats run under agent and the agent itself in situations where inputs are frequently created. For example in the case of the add_cloud_metadata processor we are observing the agent itself being spammed by repeated log messages from the add_cloud_metadata initialization sequence:

{"log.level":"info","@timestamp":"2023-04-01T00:14:59.605Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.715Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.716Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.824Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.062Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.194Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.282Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.394Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.477Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.542Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.629Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}

This comes from the code below which includes a sync.Once block that is being defeated by a new instance of the processor being created for each individual input:

func (p *addCloudMetadata) init() {
p.initOnce.Do(func() {
result := p.fetchMetadata()
if result == nil {
p.logger.Info("add_cloud_metadata: hosting provider type not detected.")
return
}
p.metadata = result.metadata
p.logger.Infof("add_cloud_metadata: hosting provider type detected as %v, metadata=%v",
result.provider, result.metadata.String())
})
}
func (p *addCloudMetadata) getMeta() mapstr.M {
p.init()

Solution

When Beats run under agent we need to create the default global processors at the Beat process level, instead of the input level to match what is done in the global configuration file.

@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team v8.7.0 labels Apr 3, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz added the Agent label Apr 3, 2023
@cmacknz cmacknz changed the title [Elastic Agent] Default processors should be created at the process level instead of per input [Elastic Agent] Default processors are created per input can result in high agent CPU usage Apr 3, 2023
@cmacknz cmacknz changed the title [Elastic Agent] Default processors are created per input can result in high agent CPU usage [Elastic Agent] Default processors created per input can result in high agent CPU usage Apr 3, 2023
@rdner
Copy link
Member

rdner commented Apr 4, 2023

It does not seem like the processors are marked as global in #34149 they're just added to the config, right? We would need a way to distinguish them from the rest.

However, I'm against using anything global. Global processors are very dangerous as we learnt in #34761

Perhaps instead we should modify the processors themselves to better manage their connections (singleton connections shared by all the processors of this type) instead of hacking the global processors. This makes way more sense to me.

@cmacknz
Copy link
Member Author

cmacknz commented Apr 4, 2023

These processors were global in every release prior to 8.6. Global processors are a standard feature of Beats, and a commonly requested feature for agent (https://github.com/elastic/ingest-dev/issues/2442). They aren't something we can get away with not supporting.

I agree that where the processors are instantiated shouldn't matter, but in this case we have a critical performance degradation that can easily be fixed by reverting to a known good configuration. This is significantly less risk and effort than rewriting each of the default add_x_metadata processors.

I don't see a reason why we shouldn't take the easiest path and restore the behaviour we had before, even if we don't like the way these processors work architecturally.

@rdner
Copy link
Member

rdner commented Apr 4, 2023

This write-up might be helpful for the person working on this issue #34716 (comment)

Ignore the issue itself, the comment contains a lot of processor-related code that will help to navigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants