Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/replace avl digest with merging digest #35182

Conversation

not-napoleon
Copy link
Member

Relates to #28702

@colings86 had started this conversion in the above referenced PR. This builds on his work with unit tests for deserializing between the two t-digest implementations.

This is a work in progress because the testing methodology is ready for feedback, but specific test parameters still need to be tuned.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@not-napoleon
Copy link
Member Author

The big question I set out to answer in this PR was what happens in a mixed
cluster case. Such a situation could result from an in-flight upgrade, or a
CCS query. We could see one node serialize an AVLTree based TDigest, and
another node deserialize the same bytes as a MergeTree based TDigest. In this
case, we want to know what the error is, compared to just using AVLTree. Note
that we never care to compare to a MergeTree, since the best case a user could
get without this change is the AVLTree error.

I have graphs below, but will give an overview first. Our worst drop in
accuracy was about 1.3 percentage points (all analysis is done in percentage
points, by subtracting calculated error percents). In a few cases, we
actually saw an improvement in accuracy, but overall the mixed cluster case is
still worse than before this change. The drop in accuracy in most cases
however is very small, and we don't expect customers to be in a mixed version
situation long term.

As the compression increases (NB: Higher compression values mean more
centroids, and thus more accuracy.), the gap in accuracy rapidly narrows. I
ran tests at compression values of 100, 1000, 5000, and 10000, but the graphs
below only show 100 and 1000; that's because the results for 5k and 10k are so
tightly clustered around 0 as to be invisible.

The worst mixed-implementation performance actually came with Gaussian
distributed data looking at the 99th percentile. I suspect this happens
because the algorithm is well optimized for this case, and thus the
transmission errors I'm measuring here tend to dominate. That said, I haven't
done any digging to prove out that suspicion. I'm happy to do so if we want
the extra confidence, but it didn't seem like a blocker to me.

I am concerned that this PR will add a "flaky" unit test. I've run this test
a lot locally (order of hundreds of thousands of iterations), and it does
occasionally fail with a high delta. I can keep cranking the tolerance
threshold up until I don't see that, but at some point I wonder if it's even
worth having a test. I'm open to suggestions here.

Graphs of test results (Note - when I say "99th Percentile", I mean that's the percentile I queried the TDigest for, not that this is the 99th percentile of the error):
Gaussian Distribution, 99th Percentile:
gaussian 99 by compression

Random Distribution, 99th Percentile:
random 99 by compression

Gaussian Distribution, 90th Percentile:
gaussian 90 by compression

Random Distribution, 90th Percentile:
random 90 by compression

@colings86
Copy link
Contributor

Thanks for running these benchmarks @not-napoleon. Based on the fact that the error introduced in the mixed cluster scenario is pretty low and that we expect mixed clusters to be a short-lived situation I would be happy to proceed with the switchover as is. I'd be interested to hear thoughts from @polyfractal and @jpountz on this though.

Also in terms of the "flaky" test. I think we can do one of two things (again I'm interested to see what you, @polyfractal and @jpountz think here):

  1. Don't merge the test but view it as a benchmarking exercise we did to ensure this change would not introduce unacceptable errors in the transition period.
  2. Run the test in a different way where it calculates the percentage difference from N requests and calculates e.g. the 90th percentile is within a tolerance instead of hoping a single request is within tolerance. The downside here might be the test time but since it's a unit test hopefully this would not make the test take too long with something like N=50.

@not-napoleon
Copy link
Member Author

@colings86 both options for the test seem reasonable. I was running 10k iterations of the test in less than 120 seconds on my laptop, and a lot of that time is gradle overhead, so 50 shouldn't even be a blip in our test times.

@polyfractal
Copy link
Contributor

Based on the fact that the error introduced in the mixed cluster scenario is pretty low and that we expect mixed clusters to be a short-lived situation I would be happy to proceed with the switchover as is.

Gernerally ++ to this. @not-napoleon and I were chatting last night and another point came up: since the error basically goes to zero when compression is cranked up, if a user is unhappy about the potential cross-version error they could temporarily increase compression until the mixed cluster is homogeneous again.

I did have a few questions about testing methodology after noodling over it a bit. I'm not sure if any of these are important, they are more half-baked thoughts than anything:

  • I wonder if maybe we should be testing that abs(modern_error) <= abs(legacy_error * 1.05), rather than abs(legacy_error - modern_error) <= 0.05 ? E.g. we probably don't care if the errors are actually similar, just that it's close-to or better.

  • I wonder if we should test the case where many legacy AVL get deserialized and merged? E.g. I'm thinking of the case where there are a dozen shards in the legacy AVL and a single new MergingTree. Does the ~1% error start compounding badly? I suspect it will continue to be a nicely behaved error but maybe not?

Also in terms of the "flaky" test. I think we can do one of two things (again I'm interested to see what you, @polyfractal and @jpountz think here):

I'm leaning towards option 1), or just cranking the tolerance up a bit and revisiting if it still flakes out on CI. Since this is a short-lived problem/test that will go away after everyone upgrades, it probably doesn't make a ton of sense to invest lots of time trying to keep the test stable.

@not-napoleon
Copy link
Member Author

I'm inclined to not keep the tests around; I don't think they prove anything individually that we need for the long term.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@elasticsearchmachine
Copy link
Collaborator

Hi @not-napoleon, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@martijnvg
Copy link
Member

Looks like continuing to adopt merging digest is worth our while from a performance perspective:

|                                                Min Throughput | percentile_cpu_usage_per_container_15_minutes |      0.25977     |     13.8869      |      13.6271      |  ops/s | +5245.83% |
|                                               Mean Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260257    |     14.256       |      13.9958      |  ops/s | +5377.68% |
|                                             Median Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260266    |     14.3062      |      14.0459      |  ops/s | +5396.76% |
|                                                Max Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260605    |     14.3405      |      14.0799      |  ops/s | +5402.76% |
|                                       50th percentile latency | percentile_cpu_usage_per_container_15_minutes |   3818.22        |     68.1857      |   -3750.03        |     ms |   -98.21% |
|                                       90th percentile latency | percentile_cpu_usage_per_container_15_minutes |   3931.99        |     73.0426      |   -3858.95        |     ms |   -98.14% |
|                                       99th percentile latency | percentile_cpu_usage_per_container_15_minutes |   4047.25        |     89.1457      |   -3958.11        |     ms |   -97.80% |
|                                      100th percentile latency | percentile_cpu_usage_per_container_15_minutes |   4054.87        |     90.7218      |   -3964.15        |     ms |   -97.76% |
|                                  50th percentile service time | percentile_cpu_usage_per_container_15_minutes |   3818.22        |     68.1857      |   -3750.03        |     ms |   -98.21% |
|                                  90th percentile service time | percentile_cpu_usage_per_container_15_minutes |   3931.99        |     73.0426      |   -3858.95        |     ms |   -98.14% |
|                                  99th percentile service time | percentile_cpu_usage_per_container_15_minutes |   4047.25        |     89.1457      |   -3958.11        |     ms |   -97.80% |
|                                 100th percentile service time | percentile_cpu_usage_per_container_15_minutes |   4054.87        |     90.7218      |   -3964.15        |     ms |   -97.76% |
|                                                    error rate | percentile_cpu_usage_per_container_15_minutes |      0           |      0           |       0           |      % |     0.00% |

The two hour and 24 hour variant of this search showed a similar relative improvement.

This based on running this search from a new tsdb rally track:

{
      "aggs": {
        "0": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "{{info[0]}}",
            "time_zone": "Europe/Athens",
            "min_doc_count": 1
          },
          "aggs": {
            "1": {
              "percentiles": {
                "field": "kubernetes.container.cpu.usage.nanocores",
                "percents": [
                  95
                ]
              }
            }
          }
        }
      },
      "size": 0,
      "fields": [
        {
          "field": "@timestamp",
          "format": "date_time"
        },
        {
          "field": "event.ingested",
          "format": "date_time"
        },
        {
          "field": "process.cpu.start_time",
          "format": "date_time"
        },
        {
          "field": "system.process.cpu.start_time",
          "format": "date_time"
        }
      ],
      "script_fields": {},
      "stored_fields": [
        "*"
      ],
      "runtime_mappings": {},
      "_source": {
        "excludes": []
      },
      "query": {
        "bool": {
          "must": [],
          "filter": [
            {
              "match_phrase": {
                "data_stream.dataset": "kubernetes.container"
              }
            },
            {
              "range": {
                "@timestamp": {
                  "format": "strict_date_optional_time",
                  "gte": "{{info[1]}}",
                  "lte": "{{end_time}}"
                }
              }
            }
          ],
          "should": [],
          "must_not": []
        }
      }
    } 
}

The baseline is main branch and contender this PR.

@martijnvg martijnvg mentioned this pull request May 3, 2023
7 tasks
@kkrik-es kkrik-es mentioned this pull request May 15, 2023
@not-napoleon
Copy link
Member Author

This is no longer relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet