-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Use random sampler usage in span query #182828
[APM] Use random sampler usage in span query #182828
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
/ci |
}); | ||
|
||
const totalDocCount = hitCountResponse.hits.total.value; | ||
const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100000 is an arbitrary value. I'm not sure what would be a right balance here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know a lot on this topic but I think there is a trade-off between achieving statistical significance and maintaining manageable sample sizes for performance optimization. The higher the number is, the higher the probability, the longer it takes? I think one variable is our performance expectations. I'm also not sure how to go about choosing a number that is meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think statistical significance is somewhat easier than predicting performance. You can make a reasonable estimation of how many documents will be processed per second account for e.g. 3x difference both ways (faster and slower), but that will be per shard, and in bigger clusters the data might be distributed amongst shards, nodes, search threads etc. From my experience, 100k on a single shard will be fast enough, so you can probably increase that number quite a bit. For a simple terms agg you can expect about 10m docs per second, this request will be a little slower of course. I would play around a bit with the number on Serverless to see what performance you get out of it. If long tail is important, you can also do a follow-up request where you've removed the x biggest buckets, but that seems overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neptunian @dgieselaar I played with different sampling probabilities:
Query executed:
GET traces-apm*,apm*/_search
{
"track_total_hits": false,
"size": 0,
"query": {
"bool": {
"filter": [
{
"terms": {
"processor.event": [
"span"
]
}
}
],
"must": [
{
"bool": {
"filter": [
{
"exists": {
"field": "span.destination.service.resource"
}
},
{
"range": {
"@timestamp": {
"gte": "now-30d",
"lte": "now",
}
}
},
{
"bool": {
"must_not": [
{
"terms": {
"agent.name": [
"js-base",
"rum-js",
"opentelemetry/webjs"
]
}
}
]
}
}
]
}
}
]
}
},
"aggs": {
"sampling": {
"random_sampler": {
"probability": {PROBABILITY}
"seed": 815613888
},
"aggs": {
"connections": {
"composite": {
"size": 10000,
"sources": [
{
"dependencyName": {
"terms": {
"field": "span.destination.service.resource"
}
}
},
{
"eventOutcome": {
"terms": {
"field": "event.outcome"
}
}
}
]
},
"aggs": {
"sample": {
"top_metrics": {
"size": 1,
"metrics": [
{
"field": "span.type"
},
{
"field": "span.subtype"
},
{
"field": "span.id"
}
],
"sort": [
{
"@timestamp": "asc"
}
]
}
}
}
}
}
}
}
}
Total docs: ~633.278.865:
Scenarios: The query above, running it with size=0
and size=1
, to force more load on ES
static number | computed probability | doc_count | took ( with size=1 ) |
took |
---|---|---|---|---|
100k | 0.000157908317373 | 99.806 | 12.8s | 3.3s |
500k | 0.000789541586865 | 500.187 | 12.9s | 2.9s |
1M | 0.00157908317373 | 999.821 | 12.9s | 2.8s |
5M | 0.00789541586865 | 5.004.747 | 12.8s | 3.4s |
10M | 0.0157908317373 | 10.009.281 | 12.4s | 3.3s |
25M | 0.039477079343237 | 25.026.358 | 13.8s | 5s |
50M | 0.0789541586865 | 50.141.260 | 16.4s | 7.2s |
100M | 0.157908317373 | 100.121.927 | 22.8s | 10.3s |
150M | 0,236862476059421 | 150.029.124 | 23.5s | 14.8s |
250M | 0.39477079343237 | 249.747.860 | timeout | 22s |
- 10M - good performance. It's only a small fraction of documents, but it could be enough(?)
- between 25M and 50M - performance is a bit compromised but could be acceptable
- between 100M and 150M - could lead to some performance issues, but I'm not sure if it would get to cause a 502
- anything higher than 250M may cause timeouts
/oblt-deploy-serverless |
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
@elasticmachine merge upstream |
|
||
const [apmEventClient, randomSampler] = await Promise.all([ | ||
getApmEventClient(resources), | ||
getRandomSampler({ security, request, probability: 1 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work in serverless in the observability product? Or the seed defaults to 1? I notice we aren't using in the get_logs_category
case you mentioned so curious as to why we're using it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if security
is disabled the seed defaults to 1
. We're using this function in other places across APM. I think it makes sense to use it to keep consistency.
const totalDocCount = hitCountResponse.hits.total.value; | ||
const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1); | ||
const samplingProbability = | ||
rawSamplingProbability < 0.5 ? rawSamplingProbability : randomSampler.probability; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like randomSampler.probability
is always going to be 1, and we want it to be 1 in this case, so I wonder why use this property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomSampler.probability
will be 1
if 1
is passed to getRandomSampler
. We could hardcode 1
here, but if we keep consistency with other places in APM, we will want to use randomSampler.probability
.
The difference between here and other places in APM is that in some routes, is that the probability
can be passed in the request - which could be implemented in the future for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use the randomSampler.probability
value in places where we are first doing calculations and checking conditions? If we use the value directly, like here, it makes sense because the user would be expecting it be passed into a query, I presume. But if we are first calculating our own probability and then using it only if it meets the 50% condition, it seems like it would need to be based on that. In this case the value should always be 1 based on those conditions? Similar to here https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/server/routes/assistant_functions/get_log_categories/index.ts#L92.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're using the randomSampler object for the seed
as well, I personally don't see too much of an issue with using randomSampler.probability
when the sampler rate is > than 0.5.
Even if 1
is fixed in the random sampler agg configuration, like here
It would be fine to hardcode 1 here if we weren't using getRandomSampler
to build the configuration though. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're using the randomSampler object for the seed as well, I personally don't see too much of an issue with using randomSampler.probability when the sampler rate is > than 0.5.
But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?
I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.
I'm not too fussed about it, just think this makes it harder to read with unexpected behaviour if we allow the value to be passed into the route as you mentioned. The user would have to be aware of the conditions under which its used/not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?
the seed helps the query utilize the same subset of documents between calls. Users are not supposed to configure that. Yeah. it comes from th getRandomSampler
function.
I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.
That's true. The way I see it, it is more about where to configure the default probability. It's either hardcode here, or in getRandomSampler
.
If we use the seed from getRandomSampler
, but hardcode 1
in the probability
it would also be a bit confusing, because the probability
parameter in getRandomSampler
is mandatory. So we'd have 2 places hardcoding the default probability
. Or we hardcode both seed
and probability
here. Idk.
@elasticmachine merge upstream |
Discussed this and think it would be useful if there are more than the static_number then mention this in the UI. We need to be careful about how forceful that messaging is, but make it clear to reduce the time period if not all documents are being used to find the dependencies. @crespocarlos please catch up with @chrisdistasio to help out with the messaging we should show in the UI |
@paulb-elastic makes sense. Users need to know what they can do in case dependencies are not listed in the UI. This change will also impact other places that consume the same function. I will reach out to @chrisdistasio, to make sure this won't negatively impact users. |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
> | ||
<EuiBadge iconType="iInCircle" color="hollow"> | ||
{i18n.translate('xpack.apm.dependencies.randomSampler.badge', { | ||
defaultMessage: `Based on sampled transactions`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is re-using the message that is displayed already when falling back to transactions from metrics. I can't tell if that's good or bad. It could be slightly confusing when debugging this because it's not immediately obvious if the message refers to sampled spans or sampled transactions.
Perhaps it should say "Based on sampled spans" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I'll change it.
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]Module Count
Async chunks
Canvas Sharable Runtime
History
To update your PR or re-run it, just comment with: |
Fixes [elastic#178979](elastic#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Fixes #178979
Summary
This PR changes the
get_exit_span_samples
query to userandom_sampler
aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless.IMPORTANT ❗
The change impacts other places and may lead to a potential loss in the long tail of results
UI
The dependencies page will show a badge to inform users when the data is being sampled
default
no_badge.mov
sampled data
with_badge.mov
How to test
The following can be tested on
https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud
Document count for a 30-day range:
594537153
A sample rate is calculated based on the doc. count. eg:
100000/594537153 = 0,000168198067178
0,000168198067178
is the probability sampling passed to therandom_sampler
aggregation.https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/
, change the date ranges, and validate if the main query will work.Alternatively
For reviewers
This change affects
get_apm_downstream_dependencies
function