Skip to content

Commit

Permalink
New random_sampler aggregation for sampling documents in aggregations (
Browse files Browse the repository at this point in the history
…#84363)

This adds a new sampling aggregation that performs a background sampling over all documents in an index. 

The syntax is as follows:
```
{
  "aggregations": {
    "sampling": {
      "random_sampler": {
        "probability": 0.1
      },
      "aggs": {
        "price_percentiles": {
          "percentiles": {
            "field": "taxful_total_price"
          }
        }
      }
    }
  }
}
```

This aggregation provides fast random sampling over the entire document set in order to speed up costly aggregations.

Testing this over a variety of aggregations and data sets, the median speed up when sampling at `0.001` over millions of documents is around 70X speed improvement.

Relative error rate does rely on the size of the data and the aggregation kind. Here are some typically expected numbers when sampling over 10s of millions of documents. `p` is the configured probability and `n` is the number of documents matched by your provided filter query.
  • Loading branch information
benwtrent committed Mar 2, 2022
1 parent 5caf3aa commit b592d2b
Show file tree
Hide file tree
Showing 19 changed files with 333 additions and 54 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/84363.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 84363
summary: New `random_sampler` aggregation for sampling documents in aggregations
area: Aggregations
type: feature
issues: []
2 changes: 2 additions & 0 deletions docs/reference/aggregations/bucket.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ include::bucket/nested-aggregation.asciidoc[]

include::bucket/parent-aggregation.asciidoc[]

include::bucket/random-sampler-aggregation.asciidoc[]

include::bucket/range-aggregation.asciidoc[]

include::bucket/rare-terms-aggregation.asciidoc[]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ tokens are used to categorize the text.

NOTE: If you have considerable memory allocated to your JVM but are receiving circuit breaker exceptions from this
aggregation, you may be attempting to categorize text that is poorly formatted for categorization. Consider
adding `categorization_filters` or running under <<search-aggregations-bucket-sampler-aggregation,sampler>> or
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>> to explore the created categories.
adding `categorization_filters` or running under <<search-aggregations-bucket-sampler-aggregation,sampler>>,
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
<<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.

[[bucket-categorize-text-agg-syntax]]
==== Parameters
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
[[search-aggregations-random-sampler-aggregation]]
=== Random sampler aggregation
++++
<titleabbrev>Random sampler</titleabbrev>
++++

experimental::[]

The `random_sampler` aggregation is a single bucket aggregation that randomly
includes documents in the aggregated results. Sampling provides significant
speed improvement at the cost of accuracy.

The sampling is accomplished by providing a random subset of the entire set of
documents in a shard. If a filter query is provided in the search request, that
filter is applied over the sampled subset. Consequently, if a filter is
restrictive, very few documents might match; therefore, the statistics might not
be as accurate.

NOTE: This aggregation is not to be confused with the
<<search-aggregations-bucket-sampler-aggregation,sampler aggregation>>. The
sampler aggregation is not over all documents; rather, it samples the first `n`
documents matched by the query.

[source,console]
----
GET kibana_sample_data_ecommerce/_search?size=0&track_total_hits=false
{
"aggregations": {
"sampling": {
"random_sampler": {
"probability": 0.1
},
"aggs": {
"price_percentiles": {
"percentiles": {
"field": "taxful_total_price"
}
}
}
}
}
}
----
// TEST[setup:kibana_sample_data_ecommerce]

[[random-sampler-top-level-params]]
==== Top-level parameters for random_sampler

`probability`::
(Required, float) The probability that a document will be included in the
aggregated data. Must be greater than 0, less than `0.5`, or exactly `1`. The
lower the probability, the fewer documents are matched.

`seed`::
(Optional, integer) The seed to generate the random sampling of documents. When
a seed is provided, the random subset of documents is the same between calls.

[[random-sampler-inner-workings]]
==== How does the sampling work?

The aggregation is a random sample of all the documents in the index. In other
words, the sampling is over the background set of documents. If a query is
provided, a document is returned if it is matched by the query and if the
document is in the random sampling. The sampling is not done over the matched
documents.

Consider the set of documents `[1, 2, 3, 4, 5]`. Your query matches `[1, 3, 5]`
and the randomly sampled set is `[2, 4, 5]`. In this case, the document returned
would be `[5]`.

This type of sampling provides almost linear improvement in query latency in relation to the amount
by which sampling reduces the document set size:

image::images/aggregations/random-sampler-agg-graph.png[Graph of the median speedup by sampling factor,align="center"]

This graph is typical of the speed up for the majority of aggregations for a test data set of 63 million documents. The exact constants will depend on the data set size and the number of shards, but the form of the relationship between speed up and probability holds widely. For certain aggregations, the speed up may not
be as dramatic. These aggregations have some constant overhead unrelated to the number of documents seen. Even for
those aggregations, the speed improvements can be significant.

The sample set is generated by skipping documents using a geometric distribution
(`(1-p)^(k-1)*p`) with success probability being the provided `probability` (`p` in the distribution equation).
The values returned from the distribution indicate how many documents to skip in
the background. This is equivalent to selecting documents uniformly at random. It follows that the expected number of failures before a success is
`(1-p)/p`. For example, with the `"probability": 0.01`, the expected number of failures (or
average number of documents skipped) would be `99` with a variance of `9900`.
Consequently, if you had only 80 documents in your index or matched by your
filter, you would most likely receive no results.

image::images/aggregations/relative-error-vs-doc-count.png[Graph of the relative error by sampling probability and doc count,align="center"]

In the above image `p` is the probability provided to the aggregation, and `n` is the number of documents matched by whatever
query is provided. You can see the impact of outliers on `sum` and `mean`, but when many documents are still matched at
higher sampling rates, the relative error is still low.

NOTE: This represents the result of aggregations against a typical positively skewed APM data set which also has outliers in the upper tail. The linear dependence of the relative error on the sample size is found to hold widely, but the slope depends on the variation in the quantity being aggregated. As such, the variance in your own data may
cause relative error rates to increase or decrease at a different rate.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion qa/mixed-cluster/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ BuildParams.bwcVersions.withWireCompatible { bwcVersion, baseName ->
setting 'path.repo', "${buildDir}/cluster/shared/repo/${baseName}"
setting 'xpack.security.enabled', 'false'
requiresFeature 'es.index_mode_feature_flag_registered', Version.fromString("8.0.0")
requiresFeature 'es.random_sampler_feature_flag_registered', Version.fromString("8.1.0")
}

tasks.register("${baseName}#mixedClusterTest", StandaloneRestIntegTestTask) {
Expand Down
1 change: 0 additions & 1 deletion qa/smoke-test-multinode/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ testClusters.matching { it.name == "integTest" }.configureEach {
testClusters.configureEach {
setting 'xpack.security.enabled', 'false'
requiresFeature 'es.index_mode_feature_flag_registered', Version.fromString("8.0.0")
requiresFeature 'es.random_sampler_feature_flag_registered', Version.fromString("8.1.0")
}

tasks.named("integTest").configure {
Expand Down
1 change: 0 additions & 1 deletion rest-api-spec/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ artifacts {
testClusters.configureEach {
module ':modules:mapper-extras'
requiresFeature 'es.index_mode_feature_flag_registered', Version.fromString("8.0.0")
requiresFeature 'es.random_sampler_feature_flag_registered', Version.fromString("8.1.0")
}

tasks.named("test").configure { enabled = false }
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ setup:
"aggs": {
"sampled": {
"random_sampler": {
"probability": 0.95
"probability": 0.5
},
"aggs": {
"mean": {
Expand All @@ -55,8 +55,29 @@ setup:
}
}
}
- close_to: { aggregations.sampled.mean.value: {value: 2.5, error: 1.0} }

- is_true: aggregations.sampled.mean
- do:
search:
index: data
size: 0
body: >
{
"aggs": {
"sampled": {
"random_sampler": {
"probability": 1.0
},
"aggs": {
"mean": {
"avg": {
"field": "value"
}
}
}
}
}
}
- match: { aggregations.sampled.mean.value: 2.5 }
---
"Test random_sampler aggregation with filter":
- skip:
Expand All @@ -78,7 +99,7 @@ setup:
"aggs": {
"sampled": {
"random_sampler": {
"probability": 0.95
"probability": 0.5
},
"aggs": {
"mean": {
Expand All @@ -90,8 +111,7 @@ setup:
}
}
}
- match: { aggregations.sampled.mean.value: 1.0 }

- is_true: aggregations.sampled.mean
- do:
search:
index: data
Expand All @@ -101,14 +121,14 @@ setup:
"query": {
"bool": {
"filter": [
{"term": {"product": "VCR"}}
{"term": {"product": "server"}}
]
}
},
"aggs": {
"sampled": {
"random_sampler": {
"probability": 0.95
"probability": 1.0
},
"aggs": {
"mean": {
Expand All @@ -120,14 +140,14 @@ setup:
}
}
}
- match: { aggregations.sampled.mean.value: 4.0 }
- match: { aggregations.sampled.mean.value: 1.0 }
---
"Test random_sampler aggregation with poor settings":
- skip:
version: " - 8.1.99"
reason: added in 8.2.0
- do:
catch: /\[probability\] must be between 0 and 1/
catch: /\[probability\] must be between 0.0 and 0.5 or exactly 1.0, was \[1.5\]/
search:
index: data
size: 0
Expand All @@ -149,7 +169,7 @@ setup:
}
}
- do:
catch: /\[probability\] must be between 0 and 1/
catch: /\[probability\] must be greater than 0.0, was \[0.0\]/
search:
index: data
size: 0
Expand Down
2 changes: 0 additions & 2 deletions server/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -130,11 +130,9 @@ tasks.named("processResources").configure {
if (BuildParams.isSnapshotBuild() == false) {
tasks.named("test").configure {
systemProperty 'es.index_mode_feature_flag_registered', 'true'
systemProperty 'es.random_sampler_feature_flag_registered', 'true'
}
tasks.named("internalClusterTest").configure {
systemProperty 'es.index_mode_feature_flag_registered', 'true'
systemProperty 'es.random_sampler_feature_flag_registered', 'true'
}
}

Expand Down

0 comments on commit b592d2b

Please sign in to comment.