Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Use random sampler usage in span query #182828

Merged
merged 14 commits into from
Jun 4, 2024

Conversation

crespocarlos
Copy link
Contributor

@crespocarlos crespocarlos commented May 7, 2024

Fixes #178979

Summary

This PR changes the get_exit_span_samples query to use random_sampler aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless.

IMPORTANT
The change impacts other places and may lead to a potential loss in the long tail of results

UI

The dependencies page will show a badge to inform users when the data is being sampled

default

no_badge.mov

sampled data

with_badge.mov

How to test

The following can be tested on https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud

Document count for a 30-day range: 594537153

GET traces-apm*,apm*/_search
{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}

A sample rate is calculated based on the doc. count. eg: 100000/594537153 = 0,000168198067178

0,000168198067178 is the probability sampling passed to the random_sampler aggregation.


GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.000168198067178,
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}
  • It's hard to create an environment with such a data volume. We can use the query above in https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/, change the date ranges, and validate if the main query will work.

Alternatively

  • Start Kibana pointing to an oblt cluster (non-serverless)
  • Navigate to APM > Dependencies
  • Try different time ranges

For reviewers

This change affects

  • APM > Dependencies
  • APM > Dependencies > Overview (Upstream Services section)
  • APM > Services > Overview (Dependencies tab)
  • Assistant's get_apm_downstream_dependencies function

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@crespocarlos
Copy link
Contributor Author

/ci

});

const totalDocCount = hitCountResponse.hits.total.value;
const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100000 is an arbitrary value. I'm not sure what would be a right balance here

Copy link
Contributor

@neptunian neptunian May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a lot on this topic but I think there is a trade-off between achieving statistical significance and maintaining manageable sample sizes for performance optimization. The higher the number is, the higher the probability, the longer it takes? I think one variable is our performance expectations. I'm also not sure how to go about choosing a number that is meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think statistical significance is somewhat easier than predicting performance. You can make a reasonable estimation of how many documents will be processed per second account for e.g. 3x difference both ways (faster and slower), but that will be per shard, and in bigger clusters the data might be distributed amongst shards, nodes, search threads etc. From my experience, 100k on a single shard will be fast enough, so you can probably increase that number quite a bit. For a simple terms agg you can expect about 10m docs per second, this request will be a little slower of course. I would play around a bit with the number on Serverless to see what performance you get out of it. If long tail is important, you can also do a follow-up request where you've removed the x biggest buckets, but that seems overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neptunian @dgieselaar I played with different sampling probabilities:

Query executed:

GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-30d",
                    "lte": "now",
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": {PROBABILITY}
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

Total docs: ~633.278.865:
Scenarios: The query above, running it with size=0 and size=1, to force more load on ES

static number computed probability doc_count took ( with size=1) took
100k 0.000157908317373 99.806 12.8s 3.3s
500k 0.000789541586865 500.187 12.9s 2.9s
1M 0.00157908317373 999.821 12.9s 2.8s
5M 0.00789541586865 5.004.747 12.8s 3.4s
10M 0.0157908317373 10.009.281 12.4s 3.3s
25M 0.039477079343237 25.026.358 13.8s 5s
50M 0.0789541586865 50.141.260 16.4s 7.2s
100M 0.157908317373 100.121.927 22.8s 10.3s
150M 0,236862476059421 150.029.124 23.5s 14.8s
250M 0.39477079343237 249.747.860 timeout 22s
  • 10M - good performance. It's only a small fraction of documents, but it could be enough(?)
  • between 25M and 50M - performance is a bit compromised but could be acceptable
  • between 100M and 150M - could lead to some performance issues, but I'm not sure if it would get to cause a 502
  • anything higher than 250M may cause timeouts

@crespocarlos crespocarlos added Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.15.0 release_note:fix labels May 8, 2024
@crespocarlos
Copy link
Contributor Author

/oblt-deploy-serverless

@crespocarlos crespocarlos added release_note:skip Skip the PR/issue when compiling release notes and removed release_note:fix labels May 8, 2024
@crespocarlos crespocarlos marked this pull request as ready for review May 8, 2024 15:18
@crespocarlos crespocarlos requested a review from a team as a code owner May 8, 2024 15:18
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label May 8, 2024

const [apmEventClient, randomSampler] = await Promise.all([
getApmEventClient(resources),
getRandomSampler({ security, request, probability: 1 }),
Copy link
Contributor

@neptunian neptunian May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work in serverless in the observability product? Or the seed defaults to 1? I notice we aren't using in the get_logs_category case you mentioned so curious as to why we're using it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if security is disabled the seed defaults to 1. We're using this function in other places across APM. I think it makes sense to use it to keep consistency.

const totalDocCount = hitCountResponse.hits.total.value;
const rawSamplingProbability = Math.min(100_000 / totalDocCount, 1);
const samplingProbability =
rawSamplingProbability < 0.5 ? rawSamplingProbability : randomSampler.probability;
Copy link
Contributor

@neptunian neptunian May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like randomSampler.probability is always going to be 1, and we want it to be 1 in this case, so I wonder why use this property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomSampler.probability will be 1 if 1 is passed to getRandomSampler. We could hardcode 1 here, but if we keep consistency with other places in APM, we will want to use randomSampler.probability.

The difference between here and other places in APM is that in some routes, is that the probability can be passed in the request - which could be implemented in the future for this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the randomSampler.probability value in places where we are first doing calculations and checking conditions? If we use the value directly, like here, it makes sense because the user would be expecting it be passed into a query, I presume. But if we are first calculating our own probability and then using it only if it meets the 50% condition, it seems like it would need to be based on that. In this case the value should always be 1 based on those conditions? Similar to here https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/server/routes/assistant_functions/get_log_categories/index.ts#L92.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're using the randomSampler object for the seed as well, I personally don't see too much of an issue with using randomSampler.probability when the sampler rate is > than 0.5.

Even if 1 is fixed in the random sampler agg configuration, like here

It would be fine to hardcode 1 here if we weren't using getRandomSampler to build the configuration though. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're using the randomSampler object for the seed as well, I personally don't see too much of an issue with using randomSampler.probability when the sampler rate is > than 0.5.

But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?

I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.

I'm not too fussed about it, just think this makes it harder to read with unexpected behaviour if we allow the value to be passed into the route as you mentioned. The user would have to be aware of the conditions under which its used/not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't let the user ever configure the seed right? And all the seed calculations are done within the function that returns it?

the seed helps the query utilize the same subset of documents between calls. Users are not supposed to configure that. Yeah. it comes from th getRandomSampler function.

I think part of the problem is wanting to colocate the probability with the seed when this function was created but not supporting passing in your own function to the probability. So now our probability calculation is not colocated with the probability value required in the getRandomSampler function, unlike the seed which is.

That's true. The way I see it, it is more about where to configure the default probability. It's either hardcode here, or in getRandomSampler.

If we use the seed from getRandomSampler, but hardcode 1 in the probability it would also be a bit confusing, because the probability parameter in getRandomSampler is mandatory. So we'd have 2 places hardcoding the default probability. Or we hardcode both seed and probability here. Idk.

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@paulb-elastic
Copy link
Contributor

Discussed this and think it would be useful if there are more than the static_number then mention this in the UI.

We need to be careful about how forceful that messaging is, but make it clear to reduce the time period if not all documents are being used to find the dependencies.

@crespocarlos please catch up with @chrisdistasio to help out with the messaging we should show in the UI

@crespocarlos
Copy link
Contributor Author

@paulb-elastic makes sense. Users need to know what they can do in case dependencies are not listed in the UI. This change will also impact other places that consume the same function. I will reach out to @chrisdistasio, to make sure this won't negatively impact users.

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

>
<EuiBadge iconType="iInCircle" color="hollow">
{i18n.translate('xpack.apm.dependencies.randomSampler.badge', {
defaultMessage: `Based on sampled transactions`,
Copy link
Member

@sorenlouv sorenlouv Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is re-using the message that is displayed already when falling back to transactions from metrics. I can't tell if that's good or bad. It could be slightly confusing when debugging this because it's not immediately obvious if the message refers to sampled spans or sampled transactions.

Perhaps it should say "Based on sampled spans" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! I'll change it.

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Jun 4, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1799 1800 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 3.3MB 3.3MB +836.0B

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id before after diff
module count - 5499 +5499
total size - 8.9MB +8.9MB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@crespocarlos crespocarlos merged commit a0096ce into elastic:main Jun 4, 2024
22 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 4, 2024
@crespocarlos crespocarlos deleted the 178979-top-dependencies-fix branch June 4, 2024 09:48
rohanxz pushed a commit to honeyn303/kibana that referenced this pull request Jun 4, 2024
Fixes [elastic#178979](elastic#178979)

## Summary

This PR changes the `get_exit_span_samples` query to use
`random_sampler` aggregation, to limit the number of documents used in
the query and avoid 502 errors reported seen in serverless.

**IMPORTANT** ❗ 
The change impacts other places and may lead to a potential loss in the
long tail of results


### UI

The dependencies page will show a badge to inform users when the data is
being sampled

**default**


https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd

**sampled data**


https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21


### How to test

The following can be tested on
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud`

Document count for a 30-day range: `594537153`
```
GET traces-apm*,apm*/_search
{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}
```

A sample rate is calculated based on the doc. count. eg:
`100000/594537153 = 0,000168198067178`

`0,000168198067178` is the probability sampling passed to the
`random_sampler` aggregation.

```

GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.000168198067178,
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}
```

- It's hard to create an environment with such a data volume. We can use
the query above in
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`,
change the date ranges, and validate if the main query will work.

### Alternatively

- Start Kibana pointing to an oblt cluster (non-serverless)
- Navigate to APM > Dependencies
- Try different time ranges

### For reviewers

This change affects
- APM > Dependencies
- APM > Dependencies > Overview (Upstream Services section)
- APM > Services > Overview (Dependencies tab)
- Assistant's `get_apm_downstream_dependencies` function

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[APM] Top dependencies request sometimes fails when searching outside of the boost window
9 participants