Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add presets for performance tuning to ES output configuration #3797

Closed
11 tasks
cmacknz opened this issue Nov 21, 2023 · 11 comments · Fixed by elastic/beats#37259
Closed
11 tasks

Add presets for performance tuning to ES output configuration #3797

cmacknz opened this issue Nov 21, 2023 · 11 comments · Fixed by elastic/beats#37259
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Nov 21, 2023

This is the agent side implementation issue for elastic/kibana#166870 to add support for output configuration presets.

After some discussion we believe the best path forward for implementing these presents is to keep the definitions of each preset in the agent, with Fleet only specifying the preset name. There are two reasons for preferring this approach:

  1. Presets will work the same way for both standalone agents and Fleet managed agents.
  2. The preset definition can vary with the agent version. This is viewed as an advantage as it avoids needing to have users or Fleets account for parameter implementation or additions specific to each agent version.
  3. Old agent versions will ignore the preset names and use their current parameter defaults. If Fleet were to auto-configure the output parameters based on a preset in the UI the agent may partially apply the parameters, since new parameters would be ignored.

Design

Presets will be selected by adding the preset key to Elasticsearch output in an agent policy. Initially the valid values for the preset key will be: balanced, throughput, scale, latency and custom.

An example configuration is:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    # Must be one of "balanced", "throughput", "scale", "latency", "custom" 
    # Unknown preset values move the output to the failed state with an appropriate error.
    preset: "throughput"
    bulk_max_size: 1024
    worker: 8

The actual rendering of the preset key into detailed output parameters should be as close to the output implementation as possible. The Elastic Agent should simply pass the preset key through to each supervised component. Since not all Elasticsearch output implementations are exactly the same, this allows the presets to vary depending on the implementation. The exact parameters for preset: throughput may be different for filebeat and endpoint-security for example. For Beats this means the rendering of the preset to detailed output parameters should happen in the Beat itself.

When preset is configured the effective agent output configuration with all parameters must be inspectable for debugging. At minimum the full set of parameters and the preset they were generated from must be included in the output of elastic-agent diagnostics. For Beats this can be done in the existing beat-rendered-config.yml file or a new file generated from a new diagnostics hook as appropriate. It would additionally be nice if we could add an elastic-agent inspect output command to show the rendered output configuration, but this can be done as a follow up since it will not be straight forward.

Preset Definitions

Configuration Current Default Balanced Optimized for Throughput Optimized for Scale Optimized for Latency (?)
bulk_max_size 50 1600 1600 1600 50
workers 1 1 4 1 1
queue.mem.events 4096 3200 12800 3200 4100
flush.min_events 2048 1600 1600 1600 2050
flush.timeout 1 10 5 20 1
compression 0 1 1 1 1
idle_timeout 60 3 15 1 60
Performance          
Stateful Throughput 1x 3x 5x 3x 1x
Serverless Throughput 1x 5-10x 10-20x 5-10x 1x
Serverless Throughput (Relative to Stateful) 0.1x 0.2-0.3x 0.3-0.5x 0.2-0.3x 0.1x
Connections 1x 0.3x 4x 0.04x 1x
Network Traffic 1x 0.1x 0.1x 0.05x 0.1x
High-throughput Queue Latency * 1x 1x 1x 1x 1x
Low-throughput Queue Latency ** 1x 10x 5x 20x 1x

When the preset is custom, the Fleet UI would be setting the parameters directly and there will be no need to render them at the agent in the manner done for the other presets.

Acceptance Criteria:

  • The preset key in an agent output is passed through to each component (this should work by default with no changes).
  • Beats implements the balanced, throughput, scale, latency, and custom presets
    • All output performance parameters are ignored for every preset except for custom. custom is the only preset that respects user specified output tuning parameters.
    • If no preset is specified the default is custom. This ensures that agents upgrading to the version that introduces presets continue to work without modification.
    • Unknown preset types are rejected with an error.
  • The parameters the preset has rendered is viewable in the output of elastic-agent diagnostics for each Beat.
  • A test exists proving that unknown presets mark the output as failed, and user specified output performance parameters are ignored when the preset is not custom.
  • A test exists proving that the presets are rendered correctly in the output of the agent diagnostics command.
  • A test exists proving that each preset can successfully ship data to Elasticsearch.
  • The preset key is documented in the default elastic-agent.yml, elastic-agent.reference.yml, the default configuration for each Beat, and the reference configuration for each beat. The default and reference configuration should set the preset to balanced. This is the intended default for new installations.
outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    preset: "balanced"
  • The Beats and Elastic Agent documentation is updated to document the presets along with their behavior and current values.
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Nov 21, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@nimarezainia
Copy link
Contributor

Would elastic-agent inspect command show the value of the actual settings or just the preset? looks like only the preset is shown in the .yml file and if the user wants to see the values set, they need to pull the diagnostics.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 22, 2023

Would elastic-agent inspect command show the value of the actual settings or just the preset?

It would only show the preset with my proposed implementation above. This is because the expansion of preset into the individual parameters should happen in the sub-processes. In our case it would happen in Beats, and today elastic-agent inspect cannot get information back from Beats (although we could change this if we really wanted to).

I think having the presets pass through the agent and expand into parameters in the sub-processes is the correct approach, because it lets the parameters vary per sub-process. The preset: latency parameters can be different between Beats and the Otel shipper this way.

@nimarezainia
Copy link
Contributor

I think having the presets pass through the agent and expand into parameters in the sub-processes is the correct approach, because it lets the parameters vary per sub-process. The preset: latency parameters can be different between Beats and the Otel shipper this way.

Yes got it. we may end up (most likely) in scenarios where a policy has a mixture of current agents and otel based ones and the parameters will be different for each. That can't be managed at Fleet and needs translation at agent.

@faec
Copy link
Contributor

faec commented Nov 27, 2023

I have questions about where some of the performance numbers are coming from, especially "stateful throughput" and "network traffic":

  • Stateful throughput for "balanced" preset relative to baseline is 3x, but I never saw ratios like that in my testing. The most dramatic throughput gaps I saw in single-worker benchmarks were on the order of 30%, and even that excluded the input performance that users will see in practice -- what is the 3x number based on?
  • "Network traffic" is confusing as a metric name -- it can't mean bytes per second, since then the "throughput" preset (4 workers) would have a much higher value. So it must mean relative net traffic to ingest an equivalent data load. But:
    • All of the settings have compression level 1, except the "default" (which is inaccurate since the current release already defaults to compression 1 -- so compression settings are actually the same for all columns).
    • Even if we say uncompressed is the baseline, compression level 1 doesn't give a 90% bandwidth reduction over uncompressed.
    • Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 27, 2023

I believe @strawgate originally did these tests and can hopefully answer those questions.

@faec
Copy link
Contributor

faec commented Nov 27, 2023

A smaller note: the only real difference between "default" and "latency" presets is that "latency" enables compression... but based on the benchmarks from the compression change, all else being equal, enabling compression increases latency (slightly), so if we were really optimizing for latency we would turn it off.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 27, 2023

IIRC the latency preset was meant to minimize latency but also be a quick way to go back to the defaults before elastic/beats#36990 since our original defaults were essentially optimized for latency.

I think in general we want compression on everywhere by default to minimize data transfer costs, so I think we should leave compression enabled in the optimized for latency preset. This makes it more optimized for latency than the other presets, but it does not give the lowest achievable latency.

@strawgate
Copy link

strawgate commented Nov 27, 2023

  • Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

Yeah, this is a bit confusing now, the "Default" table was from before we started down any defaults changes (including compression) and so represents the pre-8.11 defaults.

  • Even if we say uncompressed is the baseline, compression level 1 doesn't give a 90% bandwidth reduction over uncompressed.

In all of my tests the bandwidth reduction for compression_level: 1 exceeded 90%. When running an actual agent with integrations and with the benchmark we did for cloud billing, both showed a 95% reduction in traffic. The beat benchmark catalogue shows a 70% traffic reduction but the logs it generates are pseudo-random (The first ~1/3rd or something of each line is made to look like an nginx log and the rest is filled-in with random ascii characters).

Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

I think this is a typo and should be 0.1x as you've indicated.

Stateful throughput for "balanced" preset relative to baseline is 3x, but I never saw ratios like that in my testing. The most dramatic throughput gaps I saw in single-worker benchmarks were on the order of 30%, and even that excluded the input performance that users will see in practice -- what is the 3x number based on?

Yeah I think we had originally measured this relative to the ES cluster and we probably need to grab new numbers from the benchmarks and update the throughput part of the table.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 30, 2023

Updated the issue based on the latest round of discussion:

  • Removed references to the default preset in favor of balanced.
  • Specified that the only preset that respects user specified performance parameters is custom.
  • Specified that the default preset when none is specified is custom.
  • Specified that the default and reference configuration files for Beats and Agent should set the preset to balanced.

@faec
Copy link
Contributor

faec commented Dec 1, 2023

Following on slack discussion: in the interests of making the 8.12.0 release, splitting the diagnostics-specific tasks into a followup issue since those features are for convenience rather than core behavior (the existing diagnostics already provide enough information to determine the effective config, even with presets applied).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants