Save memory when string terms are not on top #57758

nik9000 · 2020-06-05T18:42:24Z

This reworks string flavored implementations of the terms aggregation
to save memory when it is under another bucket by dropping the usage of
asMultiBucketAggregator.

This reworks string flavored implementations of the `terms` aggregation to save memory when it is under another bucket by dropping the usage of `asMultiBucketAggregator`.

elasticmachine · 2020-06-05T18:42:27Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

nik9000 · 2020-06-06T17:02:19Z

In my tests this makes the map execution mode slightly faster when the terms agg is collected from many buckets but doesn't really do anything to the global_ordinals case. But I think my test case is super high cardinality so the performance of adding buckets is dominating the run time. It should indeed save memory. Especially on highly nested aggregations.

nik9000 · 2020-06-06T17:10:46Z

I'm putting together a lower cardinality test case to see what that does.

nik9000 · 2020-06-07T16:44:18Z

When the terms aggregation is inside another aggregation and cardinality is high (~5 million buckets collected), this is flat for performance for global ords and marginally better for the "map" strategy (before after). When the cardinality is lower (~30 thousand buckets collected) this is about 15% faster for global ords (before after).

I expect this is because LongKeyedBucketOrds isn't really optimized at all. We could and should do better here. But that is a thing for another PR.

It uses much less memory for the "map" strategy when the terms agg is a child of a high cardinality agg because the bucket collection strategy can share the map of terms.

@polyfractal had asked me about the performance difference for map vs global_ordinals. In this particular test global_ordinals is about 65% faster for the 30k bucket case and about 40% faster for the 5m bucket case. But this does a disservice to global_ordinals because non of the "interesting" optimization kick in here because the terms is a sub-agg. If terms is on top we can use the sparse strategy or the low cardinality strategy, both of which claim to be faster.

At some point we could indeed investigate using those strategies with the terms is a sub-agg. In particular, I'd be interested in cases when there isn't any correlation between the value of the parent agg and the terms agg. In that case we could benefit from the "dense" strategy even when we are a sub bucket. We'd end up spending 8 bytes per bucket rather than 24 which is lovely. And there'd be no hashing whatsoever to get the bucket which'd make the lookups faster.

not-napoleon

LGTM

not-napoleon · 2020-06-08T19:54:08Z

...r/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/BytesKeyedBucketOrds.java

+        return collectsFromSingleBucket ? new FromSingle(bigArrays) : new FromMany(bigArrays);
+    }
+
+    private BytesKeyedBucketOrds() {}


Oh, that reminds me, I should put a private default constructor on ValuesSourceConfig, thanks.

not-napoleon · 2020-06-08T20:18:52Z

.../org/elasticsearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java

+            this.collectionStrategy = new RemapGlobalOrds(collectsFromSingleBucket);
+        } else {
+            // Dense ords don't know how to collect from many buckets
+            assert collectsFromSingleBucket;


How confidant are we that this assert will never trip? I'm wondering if it makes sense to change the condition to read if (remapGlobalOrds || collectFromSingleBucket==false)

I'm pretty confident. But I'll switch it to a hard check just for extra assurance.

nik9000 · 2020-06-08T22:20:21Z

run elasticsearch-ci/default-distro

This reworks string flavored implementations of the `terms` aggregation to save memory when it is under another bucket by dropping the usage of `asMultiBucketAggregator`.

Save memory when string terms are not on top

c7a6dee

This reworks string flavored implementations of the `terms` aggregation to save memory when it is under another bucket by dropping the usage of `asMultiBucketAggregator`.

nik9000 added >enhancement :Analytics/Aggregations Aggregations v8.0.0 v7.9.0 labels Jun 5, 2020

nik9000 requested a review from not-napoleon June 5, 2020 18:42

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 5, 2020

This was referenced Jun 5, 2020

Fix a bug with missing fields in sig_terms #57757

Merged

Multi-bucket aggregator wrapper is slow and uses a ton of memory #56487

Closed

Add operation for sub-string-terms elastic/rally-tracks#125

Merged

Debug info

175b942

Big oops

f8842c0

Merge branch 'master' into terms_mem

c04bf45

not-napoleon approved these changes Jun 8, 2020

View reviewed changes

Hard check now

e1a6264

nik9000 added 2 commits June 8, 2020 18:33

Merge branch 'master' into terms_mem

38364b6

Update after merge

31fac9f

nik9000 merged commit 9b8ff5c into elastic:master Jun 9, 2020

nik9000 added the backport pending label Jun 9, 2020

nik9000 removed the backport pending label Jun 9, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save memory when string terms are not on top #57758

Save memory when string terms are not on top #57758

nik9000 commented Jun 5, 2020

elasticmachine commented Jun 5, 2020

nik9000 commented Jun 6, 2020

nik9000 commented Jun 6, 2020

nik9000 commented Jun 7, 2020

not-napoleon left a comment

not-napoleon Jun 8, 2020

not-napoleon Jun 8, 2020

nik9000 Jun 8, 2020

nik9000 commented Jun 8, 2020

Save memory when string terms are not on top #57758

Save memory when string terms are not on top #57758

Conversation

nik9000 commented Jun 5, 2020

elasticmachine commented Jun 5, 2020

nik9000 commented Jun 6, 2020

nik9000 commented Jun 6, 2020

nik9000 commented Jun 7, 2020

not-napoleon left a comment

Choose a reason for hiding this comment

not-napoleon Jun 8, 2020

Choose a reason for hiding this comment

not-napoleon Jun 8, 2020

Choose a reason for hiding this comment

nik9000 Jun 8, 2020

Choose a reason for hiding this comment

nik9000 commented Jun 8, 2020