[APM] Use random sampler usage in span query #182828

crespocarlos · 2024-05-07T13:17:35Z

Summary

This PR changes the get_exit_span_samples query to use random_sampler aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless.

IMPORTANT ❗
The change impacts other places and may lead to a potential loss in the long tail of results

UI

The dependencies page will show a badge to inform users when the data is being sampled

default

no_badge.mov

sampled data

with_badge.mov

How to test

The following can be tested on https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud

Document count for a 30-day range: 594537153

GET traces-apm*,apm*/_search
{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}

A sample rate is calculated based on the doc. count. eg: 100000/594537153 = 0,000168198067178

0,000168198067178 is the probability sampling passed to the random_sampler aggregation.


GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.000168198067178,
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

It's hard to create an environment with such a data volume. We can use the query above in https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/, change the date ranges, and validate if the main query will work.

Alternatively

Start Kibana pointing to an oblt cluster (non-serverless)
Navigate to APM > Dependencies
Try different time ranges

For reviewers

This change affects

APM > Dependencies
APM > Dependencies > Overview (Upstream Services section)
APM > Services > Overview (Dependencies tab)
Assistant's get_apm_downstream_dependencies function

apmmachine · 2024-05-07T13:17:52Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

crespocarlos · 2024-05-07T13:18:42Z

/ci

crespocarlos · 2024-05-08T15:15:50Z

...bservability_solution/apm/server/lib/connections/get_connection_stats/get_destination_map.ts

+    });
+
+    const totalDocCount = hitCountResponse.hits.total.value;
+    const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1);


100000 is an arbitrary value. I'm not sure what would be a right balance here

I don't know a lot on this topic but I think there is a trade-off between achieving statistical significance and maintaining manageable sample sizes for performance optimization. The higher the number is, the higher the probability, the longer it takes? I think one variable is our performance expectations. I'm also not sure how to go about choosing a number that is meaningful.

I think statistical significance is somewhat easier than predicting performance. You can make a reasonable estimation of how many documents will be processed per second account for e.g. 3x difference both ways (faster and slower), but that will be per shard, and in bigger clusters the data might be distributed amongst shards, nodes, search threads etc. From my experience, 100k on a single shard will be fast enough, so you can probably increase that number quite a bit. For a simple terms agg you can expect about 10m docs per second, this request will be a little slower of course. I would play around a bit with the number on Serverless to see what performance you get out of it. If long tail is important, you can also do a follow-up request where you've removed the x biggest buckets, but that seems overkill.

@neptunian @dgieselaar I played with different sampling probabilities:

Query executed:

GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": "now-30d", "lte": "now", } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": {PROBABILITY} "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } }

Total docs: ~633.278.865:
Scenarios: The query above, running it with size=0 and size=1, to force more load on ES

static number computed probability doc_count took ( with size=1) took

100k 0.000157908317373 99.806 12.8s 3.3s

500k 0.000789541586865 500.187 12.9s 2.9s

1M 0.00157908317373 999.821 12.9s 2.8s

5M 0.00789541586865 5.004.747 12.8s 3.4s

10M 0.0157908317373 10.009.281 12.4s 3.3s

25M 0.039477079343237 25.026.358 13.8s 5s

50M 0.0789541586865 50.141.260 16.4s 7.2s

100M 0.157908317373 100.121.927 22.8s 10.3s

150M 0,236862476059421 150.029.124 23.5s 14.8s

250M 0.39477079343237 249.747.860 timeout 22s

10M - good performance. It's only a small fraction of documents, but it could be enough(?)

between 25M and 50M - performance is a bit compromised but could be acceptable

between 100M and 150M - could lead to some performance issues, but I'm not sure if it would get to cause a 502

anything higher than 250M may cause timeouts

crespocarlos · 2024-05-08T15:17:18Z

/oblt-deploy-serverless

elasticmachine · 2024-05-08T15:18:44Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

crespocarlos · 2024-05-08T18:15:11Z

@elasticmachine merge upstream

neptunian · 2024-05-09T15:24:25Z

x-pack/plugins/observability_solution/apm/server/routes/dependencies/route.ts

+
+    const [apmEventClient, randomSampler] = await Promise.all([
+      getApmEventClient(resources),
+      getRandomSampler({ security, request, probability: 1 }),


Does this work in serverless in the observability product? Or the seed defaults to 1? I notice we aren't using in the get_logs_category case you mentioned so curious as to why we're using it here.

Yeah, if security is disabled the seed defaults to 1. We're using this function in other places across APM. I think it makes sense to use it to keep consistency.

neptunian · 2024-05-09T15:48:10Z

...bservability_solution/apm/server/lib/connections/get_connection_stats/get_destination_map.ts

+    const totalDocCount = hitCountResponse.hits.total.value;
+    const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1);
+    const samplingProbability =
+      rawSamplingProbability < 0.5 ? rawSamplingProbability : randomSampler.probability;


It seems like randomSampler.probability is always going to be 1, and we want it to be 1 in this case, so I wonder why use this property?

randomSampler.probability will be 1 if 1 is passed to getRandomSampler. We could hardcode 1 here, but if we keep consistency with other places in APM, we will want to use randomSampler.probability.

The difference between here and other places in APM is that in some routes, is that the probability can be passed in the request - which could be implemented in the future for this function.

Do we use the randomSampler.probability value in places where we are first doing calculations and checking conditions? If we use the value directly, like here, it makes sense because the user would be expecting it be passed into a query, I presume. But if we are first calculating our own probability and then using it only if it meets the 50% condition, it seems like it would need to be based on that. In this case the value should always be 1 based on those conditions? Similar to here https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/server/routes/assistant_functions/get_log_categories/index.ts#L92.

Since we're using the randomSampler object for the seed as well, I personally don't see too much of an issue with using randomSampler.probability when the sampler rate is > than 0.5.

Even if 1 is fixed in the random sampler agg configuration, like here

It would be fine to hardcode 1 here if we weren't using getRandomSampler to build the configuration though. wdyt?

Since we're using the randomSampler object for the seed as well, I personally don't see too much of an issue with using randomSampler.probability when the sampler rate is > than 0.5.

But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?

I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.

I'm not too fussed about it, just think this makes it harder to read with unexpected behaviour if we allow the value to be passed into the route as you mentioned. The user would have to be aware of the conditions under which its used/not used.

But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?

the seed helps the query utilize the same subset of documents between calls. Users are not supposed to configure that. Yeah. it comes from th getRandomSampler function.

I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.

That's true. The way I see it, it is more about where to configure the default probability. It's either hardcode here, or in getRandomSampler.

If we use the seed from getRandomSampler, but hardcode 1 in the probability it would also be a bit confusing, because the probability parameter in getRandomSampler is mandatory. So we'd have 2 places hardcoding the default probability. Or we hardcode both seed and probability here. Idk.

crespocarlos · 2024-05-13T12:00:46Z

@elasticmachine merge upstream

paulb-elastic · 2024-05-17T13:46:51Z

Discussed this and think it would be useful if there are more than the static_number then mention this in the UI.

We need to be careful about how forceful that messaging is, but make it clear to reduce the time period if not all documents are being used to find the dependencies.

@crespocarlos please catch up with @chrisdistasio to help out with the messaging we should show in the UI

crespocarlos · 2024-05-17T16:00:53Z

@paulb-elastic makes sense. Users need to know what they can do in case dependencies are not listed in the UI. This change will also impact other places that consume the same function. I will reach out to @chrisdistasio, to make sure this won't negatively impact users.

crespocarlos · 2024-05-20T00:04:12Z

@elasticmachine merge upstream

…ndencies-fix

crespocarlos · 2024-05-27T10:15:05Z

@elasticmachine merge upstream

crespocarlos · 2024-05-27T15:34:05Z

@elasticmachine merge upstream

crespocarlos · 2024-05-29T06:27:49Z

@elasticmachine merge upstream

…ndencies-fix

…r agg

crespocarlos · 2024-06-03T07:41:40Z

@elasticmachine merge upstream

sorenlouv · 2024-06-03T11:46:45Z

...rvability_solution/apm/public/components/app/dependencies_inventory/random_sampler_badge.tsx

+      >
+        <EuiBadge iconType="iInCircle" color="hollow">
+          {i18n.translate('xpack.apm.dependencies.randomSampler.badge', {
+            defaultMessage: `Based on sampled transactions`,


This is re-using the message that is displayed already when falling back to transactions from metrics. I can't tell if that's good or bad. It could be slightly confusing when debugging this because it's not immediately obvious if the message refers to sampled spans or sampled transactions.

Perhaps it should say "Based on sampled spans" ?

sounds good! I'll change it.

crespocarlos · 2024-06-04T07:49:53Z

@elasticmachine merge upstream

kibana-ci · 2024-06-04T08:50:18Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 9fa0962
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-182828-9fa09625c49a
Observability Deployment

Failed CI Steps

FTR Configs #47

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`apm`	1799	1800	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`apm`	3.3MB	3.3MB	+836.0B

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id	before	after	diff
`module count`	-	5499	+5499
`total size`	-	8.9MB	+8.9MB

History

💔 Build #213635 failed 08cb84b
💛 Build #213451 was flaky 63db1b5
💚 Build #213299 succeeded bab5452
💚 Build #212705 succeeded 96cc0a7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@timestamp

Fixes [elastic#178979](elastic#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Use random sampler usage in span query

426c3cd

crespocarlos commented May 8, 2024

View reviewed changes

crespocarlos added Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.15.0 release_note:fix labels May 8, 2024

crespocarlos added release_note:skip Skip the PR/issue when compiling release notes and removed release_note:fix labels May 8, 2024

crespocarlos marked this pull request as ready for review May 8, 2024 15:18

crespocarlos requested a review from a team as a code owner May 8, 2024 15:18

crespocarlos requested a review from dgieselaar May 8, 2024 17:32

Merge branch 'main' into 178979-top-dependencies-fix

289cb38

botelastic bot added the ci:project-deploy-observability Create an Observability project label May 8, 2024

neptunian reviewed May 9, 2024

View reviewed changes

Merge branch 'main' into 178979-top-dependencies-fix

3558ca0

raise static number to 10M

0288d64

Merge branch 'main' into 178979-top-dependencies-fix

9005e85

crespocarlos mentioned this pull request May 27, 2024

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

Closed

Merge branch 'main' of github.com:elastic/kibana into 178979-top-depe…

cb016d1

…ndencies-fix

Merge branch 'main' into 178979-top-dependencies-fix

b5c9347

Merge branch 'main' into 178979-top-dependencies-fix

c237479

Merge branch 'main' into 178979-top-dependencies-fix

96cc0a7

chrisdistasio requested a review from sorenlouv May 29, 2024 17:38

neptunian approved these changes May 29, 2024

View reviewed changes

crespocarlos added 2 commits May 31, 2024 15:01

Merge branch 'main' of github.com:elastic/kibana into 178979-top-depe…

4bb7681

…ndencies-fix

Add badge to UI to inform users when the query is using random sample…

bab5452

…r agg

crespocarlos requested a review from neptunian May 31, 2024 13:31

Merge branch 'main' into 178979-top-dependencies-fix

63db1b5

dgieselaar approved these changes Jun 3, 2024

View reviewed changes

sorenlouv approved these changes Jun 3, 2024

View reviewed changes

sorenlouv reviewed Jun 3, 2024

View reviewed changes

Change badge message

08cb84b

Merge branch 'main' into 178979-top-dependencies-fix

9fa0962

crespocarlos merged commit a0096ce into elastic:main Jun 4, 2024
22 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Jun 4, 2024

crespocarlos deleted the 178979-top-dependencies-fix branch June 4, 2024 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Use random sampler usage in span query #182828

[APM] Use random sampler usage in span query #182828

crespocarlos commented May 7, 2024 •

edited

apmmachine commented May 7, 2024

crespocarlos commented May 7, 2024

crespocarlos May 8, 2024

neptunian May 9, 2024 •

edited

dgieselaar May 9, 2024

crespocarlos May 15, 2024 •

edited

crespocarlos commented May 8, 2024

elasticmachine commented May 8, 2024

crespocarlos commented May 8, 2024

neptunian May 9, 2024 •

edited

crespocarlos May 13, 2024

neptunian May 9, 2024 •

edited

crespocarlos May 13, 2024

neptunian May 14, 2024

crespocarlos May 14, 2024 •

edited

neptunian May 14, 2024

crespocarlos May 14, 2024 •

edited

crespocarlos commented May 13, 2024

paulb-elastic commented May 17, 2024

crespocarlos commented May 17, 2024

crespocarlos commented May 20, 2024

crespocarlos commented May 27, 2024

crespocarlos commented May 27, 2024

crespocarlos commented May 29, 2024

crespocarlos commented Jun 3, 2024

sorenlouv Jun 3, 2024 •

edited

crespocarlos Jun 3, 2024

crespocarlos commented Jun 4, 2024

kibana-ci commented Jun 4, 2024 •

edited

static number	computed probability	doc_count	took ( `with size=1`)	took
100k	0.000157908317373	99.806	12.8s	3.3s
500k	0.000789541586865	500.187	12.9s	2.9s
1M	0.00157908317373	999.821	12.9s	2.8s
5M	0.00789541586865	5.004.747	12.8s	3.4s
10M	0.0157908317373	10.009.281	12.4s	3.3s
25M	0.039477079343237	25.026.358	13.8s	5s
50M	0.0789541586865	50.141.260	16.4s	7.2s
100M	0.157908317373	100.121.927	22.8s	10.3s
150M	0,236862476059421	150.029.124	23.5s	14.8s
250M	0.39477079343237	249.747.860	timeout	22s

[APM] Use random sampler usage in span query #182828

[APM] Use random sampler usage in span query #182828

Conversation

crespocarlos commented May 7, 2024 • edited

Summary

UI

How to test

Alternatively

For reviewers

apmmachine commented May 7, 2024

🤖 GitHub comments

crespocarlos commented May 7, 2024

crespocarlos May 8, 2024

Choose a reason for hiding this comment

neptunian May 9, 2024 • edited

Choose a reason for hiding this comment

dgieselaar May 9, 2024

Choose a reason for hiding this comment

crespocarlos May 15, 2024 • edited

Choose a reason for hiding this comment

crespocarlos commented May 8, 2024

elasticmachine commented May 8, 2024

crespocarlos commented May 8, 2024

neptunian May 9, 2024 • edited

Choose a reason for hiding this comment

crespocarlos May 13, 2024

Choose a reason for hiding this comment

neptunian May 9, 2024 • edited

Choose a reason for hiding this comment

crespocarlos May 13, 2024

Choose a reason for hiding this comment

neptunian May 14, 2024

Choose a reason for hiding this comment

crespocarlos May 14, 2024 • edited

Choose a reason for hiding this comment

neptunian May 14, 2024

Choose a reason for hiding this comment

crespocarlos May 14, 2024 • edited

Choose a reason for hiding this comment

crespocarlos commented May 13, 2024

paulb-elastic commented May 17, 2024

crespocarlos commented May 17, 2024

crespocarlos commented May 20, 2024

crespocarlos commented May 27, 2024

crespocarlos commented May 27, 2024

crespocarlos commented May 29, 2024

crespocarlos commented Jun 3, 2024

sorenlouv Jun 3, 2024 • edited

Choose a reason for hiding this comment

crespocarlos Jun 3, 2024

Choose a reason for hiding this comment

crespocarlos commented Jun 4, 2024

kibana-ci commented Jun 4, 2024 • edited

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Async chunks

Canvas Sharable Runtime

History

crespocarlos commented May 7, 2024 •

edited

neptunian May 9, 2024 •

edited

crespocarlos May 15, 2024 •

edited

neptunian May 9, 2024 •

edited

neptunian May 9, 2024 •

edited

crespocarlos May 14, 2024 •

edited

crespocarlos May 14, 2024 •

edited

sorenlouv Jun 3, 2024 •

edited

kibana-ci commented Jun 4, 2024 •

edited