Add presets for performance tuning to ES output configuration #3797

cmacknz · 2023-11-21T19:08:18Z

Relates [Fleet] Add presets for performance tuning to ES output configuration kibana#166870
Relates Include effective ES output config in the Beats diagnostics beats#37263

This is the agent side implementation issue for elastic/kibana#166870 to add support for output configuration presets.

After some discussion we believe the best path forward for implementing these presents is to keep the definitions of each preset in the agent, with Fleet only specifying the preset name. There are two reasons for preferring this approach:

Presets will work the same way for both standalone agents and Fleet managed agents.
The preset definition can vary with the agent version. This is viewed as an advantage as it avoids needing to have users or Fleets account for parameter implementation or additions specific to each agent version.
Old agent versions will ignore the preset names and use their current parameter defaults. If Fleet were to auto-configure the output parameters based on a preset in the UI the agent may partially apply the parameters, since new parameters would be ignored.

Design

Presets will be selected by adding the preset key to Elasticsearch output in an agent policy. Initially the valid values for the preset key will be: balanced, throughput, scale, latency and custom.

An example configuration is:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    # Must be one of "balanced", "throughput", "scale", "latency", "custom" 
    # Unknown preset values move the output to the failed state with an appropriate error.
    preset: "throughput"
    bulk_max_size: 1024
    worker: 8

The actual rendering of the preset key into detailed output parameters should be as close to the output implementation as possible. The Elastic Agent should simply pass the preset key through to each supervised component. Since not all Elasticsearch output implementations are exactly the same, this allows the presets to vary depending on the implementation. The exact parameters for preset: throughput may be different for filebeat and endpoint-security for example. For Beats this means the rendering of the preset to detailed output parameters should happen in the Beat itself.

When preset is configured the effective agent output configuration with all parameters must be inspectable for debugging. At minimum the full set of parameters and the preset they were generated from must be included in the output of elastic-agent diagnostics. For Beats this can be done in the existing beat-rendered-config.yml file or a new file generated from a new diagnostics hook as appropriate. It would additionally be nice if we could add an elastic-agent inspect output command to show the rendered output configuration, but this can be done as a follow up since it will not be straight forward.

Preset Definitions

Configuration Current Default Balanced Optimized for Throughput Optimized for Scale Optimized for Latency (?)

bulk_max_size 50 1600 1600 1600 50

workers 1 1 4 1 1

queue.mem.events 4096 3200 12800 3200 4100

flush.min_events 2048 1600 1600 1600 2050

flush.timeout 1 10 5 20 1

compression 0 1 1 1 1

idle_timeout 60 3 15 1 60

Performance

Stateful Throughput 1x 3x 5x 3x 1x

Serverless Throughput 1x 5-10x 10-20x 5-10x 1x

Serverless Throughput (Relative to Stateful) 0.1x 0.2-0.3x 0.3-0.5x 0.2-0.3x 0.1x

Connections 1x 0.3x 4x 0.04x 1x

Network Traffic 1x 0.1x 0.1x 0.05x 0.1x

High-throughput Queue Latency * 1x 1x 1x 1x 1x

Low-throughput Queue Latency ** 1x 10x 5x 20x 1x

When the preset is custom, the Fleet UI would be setting the parameters directly and there will be no need to render them at the agent in the manner done for the other presets.

Acceptance Criteria:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    preset: "balanced"

The Beats and Elastic Agent documentation is updated to document the presets along with their behavior and current values.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-11-21T19:08:20Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

nimarezainia · 2023-11-22T02:44:07Z

Would elastic-agent inspect command show the value of the actual settings or just the preset? looks like only the preset is shown in the .yml file and if the user wants to see the values set, they need to pull the diagnostics.

cmacknz · 2023-11-22T02:57:22Z

Would elastic-agent inspect command show the value of the actual settings or just the preset?

It would only show the preset with my proposed implementation above. This is because the expansion of preset into the individual parameters should happen in the sub-processes. In our case it would happen in Beats, and today elastic-agent inspect cannot get information back from Beats (although we could change this if we really wanted to).

I think having the presets pass through the agent and expand into parameters in the sub-processes is the correct approach, because it lets the parameters vary per sub-process. The preset: latency parameters can be different between Beats and the Otel shipper this way.

nimarezainia · 2023-11-22T03:06:53Z

I think having the presets pass through the agent and expand into parameters in the sub-processes is the correct approach, because it lets the parameters vary per sub-process. The preset: latency parameters can be different between Beats and the Otel shipper this way.

Yes got it. we may end up (most likely) in scenarios where a policy has a mixture of current agents and otel based ones and the parameters will be different for each. That can't be managed at Fleet and needs translation at agent.

faec · 2023-11-27T16:35:34Z

I have questions about where some of the performance numbers are coming from, especially "stateful throughput" and "network traffic":

Stateful throughput for "balanced" preset relative to baseline is 3x, but I never saw ratios like that in my testing. The most dramatic throughput gaps I saw in single-worker benchmarks were on the order of 30%, and even that excluded the input performance that users will see in practice -- what is the 3x number based on?
"Network traffic" is confusing as a metric name -- it can't mean bytes per second, since then the "throughput" preset (4 workers) would have a much higher value. So it must mean relative net traffic to ingest an equivalent data load. But:
- All of the settings have compression level 1, except the "default" (which is inaccurate since the current release already defaults to compression 1 -- so compression settings are actually the same for all columns).
- Even if we say uncompressed is the baseline, compression level 1 doesn't give a 90% bandwidth reduction over uncompressed.
- Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

cmacknz · 2023-11-27T16:43:03Z

I believe @strawgate originally did these tests and can hopefully answer those questions.

faec · 2023-11-27T17:22:14Z

A smaller note: the only real difference between "default" and "latency" presets is that "latency" enables compression... but based on the benchmarks from the compression change, all else being equal, enabling compression increases latency (slightly), so if we were really optimizing for latency we would turn it off.

cmacknz · 2023-11-27T17:48:24Z

IIRC the latency preset was meant to minimize latency but also be a quick way to go back to the defaults before elastic/beats#36990 since our original defaults were essentially optimized for latency.

I think in general we want compression on everywhere by default to minimize data transfer costs, so I think we should leave compression enabled in the optimized for latency preset. This makes it more optimized for latency than the other presets, but it does not give the lowest achievable latency.

strawgate · 2023-11-27T19:46:27Z

Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

Yeah, this is a bit confusing now, the "Default" table was from before we started down any defaults changes (including compression) and so represents the pre-8.11 defaults.

Even if we say uncompressed is the baseline, compression level 1 doesn't give a 90% bandwidth reduction over uncompressed.

In all of my tests the bandwidth reduction for compression_level: 1 exceeded 90%. When running an actual agent with integrations and with the benchmark we did for cloud billing, both showed a 95% reduction in traffic. The beat benchmark catalogue shows a 70% traffic reduction but the logs it generates are pseudo-random (The first ~1/3rd or something of each line is made to look like an nginx log and the rest is filled-in with random ascii characters).

Even if it did, the "scale" settings use the same compression level and certainly won't give us an additional 50% reduction relative to the other presets, as indicated in the table.

I think this is a typo and should be 0.1x as you've indicated.

Stateful throughput for "balanced" preset relative to baseline is 3x, but I never saw ratios like that in my testing. The most dramatic throughput gaps I saw in single-worker benchmarks were on the order of 30%, and even that excluded the input performance that users will see in practice -- what is the 3x number based on?

Yeah I think we had originally measured this relative to the ES cluster and we probably need to grab new numbers from the benchmarks and update the throughput part of the table.

cmacknz · 2023-11-30T21:00:27Z

Updated the issue based on the latest round of discussion:

Removed references to the default preset in favor of balanced.
Specified that the only preset that respects user specified performance parameters is custom.
Specified that the default preset when none is specified is custom.
Specified that the default and reference configuration files for Beats and Agent should set the preset to balanced.

faec · 2023-12-01T16:30:31Z

Following on slack discussion: in the interests of making the 8.12.0 release, splitting the diagnostics-specific tasks into a followup issue since those features are for convenience rather than core behavior (the existing diagnostics already provide enough information to determine the effective config, even with presets applied).

Adds the performance presets described in elastic/elastic-agent#3797 to the Elasticsearch output, configurable with the `preset` field.

cmacknz added the Team:Elastic-Agent Label for the Agent team label Nov 21, 2023

jlind23 assigned faec Nov 22, 2023

juliaElastic mentioned this issue Nov 27, 2023

[Fleet] Add presets for performance tuning to ES output configuration elastic/kibana#166870

Closed

10 tasks

This was referenced Nov 30, 2023

Support performance presets in the Elasticsearch output elastic/beats#37259

Merged

Include effective ES output config in the Beats diagnostics elastic/beats#37263

Open

faec closed this as completed in elastic/beats#37259 Dec 5, 2023

faec added a commit to elastic/beats that referenced this issue Dec 5, 2023

Support performance presets in the Elasticsearch output (#37259)

76919e0

Adds the performance presets described in elastic/elastic-agent#3797 to the Elasticsearch output, configurable with the `preset` field.

ycombinator mentioned this issue Dec 11, 2023

Document the Elasticsearch output's 'preset' field elastic/beats#37315

Merged

6 tasks

This was referenced Jan 25, 2024

Amazon SQS input stalls on new queue flush timeout defaults elastic/beats#37754

Closed

Queue flush settings interact badly with output's bulk_max_size elastic/beats#37757

Closed

zmoog mentioned this issue Feb 4, 2024

Figure out how to performance test the aws-s3 input in SQS mode zmoog/public-notes#76

Open

StephanErb mentioned this issue Mar 22, 2024

Scale Preset should prevent Thundering Herd issue with many Agents #4469

Open

StephanErb mentioned this issue Apr 22, 2024

Perf regression, too many small bulk requests elastic/apm-server#13024

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add presets for performance tuning to ES output configuration #3797

Add presets for performance tuning to ES output configuration #3797

cmacknz commented Nov 21, 2023 •

edited

Loading

elasticmachine commented Nov 21, 2023

nimarezainia commented Nov 22, 2023

cmacknz commented Nov 22, 2023

nimarezainia commented Nov 22, 2023

faec commented Nov 27, 2023

cmacknz commented Nov 27, 2023

faec commented Nov 27, 2023

cmacknz commented Nov 27, 2023

strawgate commented Nov 27, 2023 •

edited

Loading

cmacknz commented Nov 30, 2023

faec commented Dec 1, 2023

Add presets for performance tuning to ES output configuration #3797

Add presets for performance tuning to ES output configuration #3797

Comments

cmacknz commented Nov 21, 2023 • edited Loading

Design

Preset Definitions

Acceptance Criteria:

elasticmachine commented Nov 21, 2023

nimarezainia commented Nov 22, 2023

cmacknz commented Nov 22, 2023

nimarezainia commented Nov 22, 2023

faec commented Nov 27, 2023

cmacknz commented Nov 27, 2023

faec commented Nov 27, 2023

cmacknz commented Nov 27, 2023

strawgate commented Nov 27, 2023 • edited Loading

cmacknz commented Nov 30, 2023

faec commented Dec 1, 2023

cmacknz commented Nov 21, 2023 •

edited

Loading

strawgate commented Nov 27, 2023 •

edited

Loading