Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set field limit, after testing with rally. #1444

Merged
merged 3 commits into from Oct 18, 2018

Conversation

simitt
Copy link
Contributor

@simitt simitt commented Oct 16, 2018

Add rally challenge to test mapping explosion limits for APM.

implements #1291

I did some tests using the APM rally challenge ingest-field-explosion on ES instances with the following results:

  • 1gb ram, 1 node, 1 zone ES instance: read timeouts for more than 200 tags
  • 4gb ram, 1node, 1 zone ES instance: read timeouts for more than 750 tags

The amount of fields other than tags is roughly 200. My suggestion is to set 1,000 as default value for total_fields for 6.5. Setting it too low would cause errors when trying to write although the ES instance might be able to handle it. Setting the value too high would risk memory issues with the whole ES instance.

Add rally challenge to test mapping explosion limits for APM.

implements elastic#1291
Copy link
Member

@graphaelli graphaelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis!

@simitt
Copy link
Contributor Author

simitt commented Oct 17, 2018

After discussing the timeouts with @danielmitterdorfer I changed the batch_size from 1000 to 50, which is the default in apm-server, and updated the timeout to 60sec. Running following command:

./rally --track-path=<track-path> --on-error=abort --pipeline=benchmark-only --target-hosts=https://da7c9a38ea8949b69bc3bd36d6ed6698.europe-west3.gcp.cloud.es.io:9243 --client-options="basic_auth_user:'xxx',basic_auth_password:'xxx',http_compress:true,use_ssl:true,timeout:60" --track-params="event_type:'span'" --challenge=ingest-field-explosion

on a 4gb RAM, 1 node, 1 zone cloud instance leads to following results:

[updated results]

increasing warmup time to 240 (instead 120) and running races in ascending order to avoid potential negative effects from a former run with a higher cardinality in fields.

  • 1000 tags
|   All |                       Min Throughput | index-apm-span-tags |  1641.44 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1681.32 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1706.39 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  167.322 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  237.611 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  382.622 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  725.664 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  742.809 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  167.322 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  237.611 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  382.622 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  725.664 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  742.809 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

  • 1500 tags
|   All |                       Min Throughput | index-apm-span-tags |  1267.75 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1384.82 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1466.88 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  172.377 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  264.239 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  399.431 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  994.829 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1070.57 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  172.377 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  264.239 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  399.431 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  994.829 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1070.57 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

  • 2000 tags
|   All |                       Min Throughput | index-apm-span-tags |   817.38 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |  1066.24 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1222.68 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  183.712 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  319.286 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  571.942 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  1098.75 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1365.68 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  183.712 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  319.286 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  571.942 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  1098.75 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1365.68 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |
  • 2500 tags
|   All |                       Min Throughput | index-apm-span-tags |   334.83 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |   789.21 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |  1026.06 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  196.822 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  337.615 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |    690.1 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  1152.36 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  1397.39 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  196.822 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  337.615 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |    690.1 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  1152.36 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  1397.39 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

  • 5000 tags

|   All |                       Min Throughput | index-apm-span-tags |    16.83 | docs/s |
|   All |                    Median Throughput | index-apm-span-tags |   131.04 | docs/s |
|   All |                       Max Throughput | index-apm-span-tags |   506.42 | docs/s |
|   All |              50th percentile latency | index-apm-span-tags |  224.312 |     ms |
|   All |              90th percentile latency | index-apm-span-tags |  419.976 |     ms |
|   All |              99th percentile latency | index-apm-span-tags |  12836.8 |     ms |
|   All |            99.9th percentile latency | index-apm-span-tags |  24025.8 |     ms |
|   All |             100th percentile latency | index-apm-span-tags |  27517.9 |     ms |
|   All |         50th percentile service time | index-apm-span-tags |  224.312 |     ms |
|   All |         90th percentile service time | index-apm-span-tags |  419.976 |     ms |
|   All |         99th percentile service time | index-apm-span-tags |  12836.8 |     ms |
|   All |       99.9th percentile service time | index-apm-span-tags |  24025.8 |     ms |
|   All |        100th percentile service time | index-apm-span-tags |  27517.9 |     ms |
|   All |                           error rate | index-apm-span-tags |        0 |      % |

Performance overview of the ES instance:
screen shot 2018-10-17 at 12 05 09

Reduce bulk size to apm-server default of 50. Increase warmup period to
reduce measured latency.
@simitt
Copy link
Contributor Author

simitt commented Oct 17, 2018

After rerunning the tests (see results above) I suggest to set the total_fields limit somewhere around 1500-2000.

@roncohen what's your opinion on that?

@roncohen
Copy link
Contributor

As you've shown, with the right size machines and configuration, it could work to have a higher number of columns than 2000.

Se let me see if this is your line of thinking: if you have 2000+ tags you're basically doing it wrong. So in that case it's better to loudly complain and stop indexing, even in cases where the cluster could actually handle more tags, because we would expect the number of tags to keep increasing beyond that because the feature is not being used correctly.

@simitt
Copy link
Contributor Author

simitt commented Oct 17, 2018

APM Server's default settings are generally rather conservative to fit for a small APM machine. So my suggestion is to align the default setting for total_fields with something that make sense for a smaller APM/ES setup. That's why I suggest to reduce the number of allowed fields to something around 2000 as default setting. Customers can overwrite that if they want.

(Note: ES default value for total_fields is 1_000, libbeat default value is 10_000).

@roncohen
Copy link
Contributor

OK. Sounds good 👍

@simitt
Copy link
Contributor Author

simitt commented Oct 17, 2018

jenkins, retest this please

@simitt simitt merged commit a5e7dff into elastic:master Oct 18, 2018
@simitt simitt deleted the 1291-set-limit-for-fields branch November 8, 2018 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants