Set field limit, after testing with rally. #1444

simitt · 2018-10-16T14:50:03Z

Add rally challenge to test mapping explosion limits for APM.

implements #1291

I did some tests using the APM rally challenge ingest-field-explosion on ES instances with the following results:

1gb ram, 1 node, 1 zone ES instance: read timeouts for more than 200 tags
4gb ram, 1node, 1 zone ES instance: read timeouts for more than 750 tags

The amount of fields other than tags is roughly 200. My suggestion is to set 1,000 as default value for total_fields for 6.5. Setting it too low would cause errors when trying to write although the ES instance might be able to handle it. Setting the value too high would risk memory issues with the whole ES instance.

Add rally challenge to test mapping explosion limits for APM. implements elastic#1291

graphaelli

Great analysis!

simitt · 2018-10-17T08:29:59Z

After discussing the timeouts with @danielmitterdorfer I changed the batch_size from 1000 to 50, which is the default in apm-server, and updated the timeout to 60sec. Running following command:

./rally --track-path=<track-path> --on-error=abort --pipeline=benchmark-only --target-hosts=https://da7c9a38ea8949b69bc3bd36d6ed6698.europe-west3.gcp.cloud.es.io:9243 --client-options="basic_auth_user:'xxx',basic_auth_password:'xxx',http_compress:true,use_ssl:true,timeout:60" --track-params="event_type:'span'" --challenge=ingest-field-explosion

on a 4gb RAM, 1 node, 1 zone cloud instance leads to following results:

[updated results]

increasing warmup time to 240 (instead 120) and running races in ascending order to avoid potential negative effects from a former run with a higher cardinality in fields.

1000 tags

|   All |                       Min Throughput | index-apm-span-tags |  1641.44 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1681.32 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1706.39 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  167.322 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  237.611 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  382.622 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  725.664 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  742.809 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  167.322 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  237.611 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  382.622 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  725.664 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  742.809 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

1500 tags

|   All |                       Min Throughput | index-apm-span-tags |  1267.75 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1384.82 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1466.88 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  172.377 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  264.239 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  399.431 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  994.829 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1070.57 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  172.377 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  264.239 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  399.431 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  994.829 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1070.57 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

2000 tags

|   All |                       Min Throughput | index-apm-span-tags |   817.38 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1066.24 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1222.68 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  183.712 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  319.286 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  571.942 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  1098.75 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1365.68 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  183.712 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  319.286 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  571.942 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  1098.75 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1365.68 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

2500 tags

|   All |                       Min Throughput | index-apm-span-tags |   334.83 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |   789.21 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1026.06 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  196.822 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  337.615 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |    690.1 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  1152.36 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1397.39 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  196.822 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  337.615 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |    690.1 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  1152.36 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1397.39 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

5000 tags


|   All |                       Min Throughput | index-apm-span-tags |    16.83 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |   131.04 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |   506.42 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  224.312 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  419.976 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  12836.8 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  24025.8 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  27517.9 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  224.312 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  419.976 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  12836.8 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  24025.8 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  27517.9 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

Performance overview of the ES instance:

Reduce bulk size to apm-server default of 50. Increase warmup period to reduce measured latency.

simitt · 2018-10-17T10:06:08Z

After rerunning the tests (see results above) I suggest to set the total_fields limit somewhere around 1500-2000.

@roncohen what's your opinion on that?

roncohen · 2018-10-17T10:14:10Z

As you've shown, with the right size machines and configuration, it could work to have a higher number of columns than 2000.

Se let me see if this is your line of thinking: if you have 2000+ tags you're basically doing it wrong. So in that case it's better to loudly complain and stop indexing, even in cases where the cluster could actually handle more tags, because we would expect the number of tags to keep increasing beyond that because the feature is not being used correctly.

simitt · 2018-10-17T10:53:33Z

APM Server's default settings are generally rather conservative to fit for a small APM machine. So my suggestion is to align the default setting for total_fields with something that make sense for a smaller APM/ES setup. That's why I suggest to reduce the number of allowed fields to something around 2000 as default setting. Customers can overwrite that if they want.

(Note: ES default value for total_fields is 1_000, libbeat default value is 10_000).

roncohen · 2018-10-17T14:58:49Z

OK. Sounds good 👍

simitt · 2018-10-17T14:58:59Z

jenkins, retest this please

Set field limit, after testing with rally.

7c03741

Add rally challenge to test mapping explosion limits for APM. implements elastic#1291

graphaelli approved these changes Oct 16, 2018

View reviewed changes

Change bulk size and warmup for field explosion race.

82c18bd

Reduce bulk size to apm-server default of 50. Increase warmup period to reduce measured latency.

Adapt total_fields limit to 2000

0a6eaa1

simitt merged commit a5e7dff into elastic:master Oct 18, 2018

simitt deleted the 1291-set-limit-for-fields branch November 8, 2018 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set field limit, after testing with rally. #1444

Set field limit, after testing with rally. #1444

simitt commented Oct 16, 2018 •

edited

graphaelli left a comment

simitt commented Oct 17, 2018 •

edited

simitt commented Oct 17, 2018

roncohen commented Oct 17, 2018

simitt commented Oct 17, 2018

roncohen commented Oct 17, 2018

simitt commented Oct 17, 2018

Set field limit, after testing with rally. #1444

Set field limit, after testing with rally. #1444

Conversation

simitt commented Oct 16, 2018 • edited

graphaelli left a comment

Choose a reason for hiding this comment

simitt commented Oct 17, 2018 • edited

simitt commented Oct 17, 2018

roncohen commented Oct 17, 2018

simitt commented Oct 17, 2018

roncohen commented Oct 17, 2018

simitt commented Oct 17, 2018

simitt commented Oct 16, 2018 •

edited

simitt commented Oct 17, 2018 •

edited