Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save memory when string terms are not on top #57758

Merged
merged 7 commits into from Jun 9, 2020
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 5, 2020

This reworks string flavored implementations of the terms aggregation
to save memory when it is under another bucket by dropping the usage of
asMultiBucketAggregator.

This reworks string flavored implementations of the `terms` aggregation
to save memory when it is under another bucket by dropping the usage of
`asMultiBucketAggregator`.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@nik9000
Copy link
Member Author

nik9000 commented Jun 6, 2020

In my tests this makes the map execution mode slightly faster when the terms agg is collected from many buckets but doesn't really do anything to the global_ordinals case. But I think my test case is super high cardinality so the performance of adding buckets is dominating the run time. It should indeed save memory. Especially on highly nested aggregations.

@nik9000
Copy link
Member Author

nik9000 commented Jun 6, 2020

I'm putting together a lower cardinality test case to see what that does.

@nik9000
Copy link
Member Author

nik9000 commented Jun 7, 2020

When the terms aggregation is inside another aggregation and cardinality is high (~5 million buckets collected), this is flat for performance for global ords and marginally better for the "map" strategy (before after). When the cardinality is lower (~30 thousand buckets collected) this is about 15% faster for global ords (before after).

I expect this is because LongKeyedBucketOrds isn't really optimized at all. We could and should do better here. But that is a thing for another PR.

It uses much less memory for the "map" strategy when the terms agg is a child of a high cardinality agg because the bucket collection strategy can share the map of terms.

@polyfractal had asked me about the performance difference for map vs global_ordinals. In this particular test global_ordinals is about 65% faster for the 30k bucket case and about 40% faster for the 5m bucket case. But this does a disservice to global_ordinals because non of the "interesting" optimization kick in here because the terms is a sub-agg. If terms is on top we can use the sparse strategy or the low cardinality strategy, both of which claim to be faster.

At some point we could indeed investigate using those strategies with the terms is a sub-agg. In particular, I'd be interested in cases when there isn't any correlation between the value of the parent agg and the terms agg. In that case we could benefit from the "dense" strategy even when we are a sub bucket. We'd end up spending 8 bytes per bucket rather than 24 which is lovely. And there'd be no hashing whatsoever to get the bucket which'd make the lookups faster.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return collectsFromSingleBucket ? new FromSingle(bigArrays) : new FromMany(bigArrays);
}

private BytesKeyedBucketOrds() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that reminds me, I should put a private default constructor on ValuesSourceConfig, thanks.

this.collectionStrategy = new RemapGlobalOrds(collectsFromSingleBucket);
} else {
// Dense ords don't know how to collect from many buckets
assert collectsFromSingleBucket;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confidant are we that this assert will never trip? I'm wondering if it makes sense to change the condition to read if (remapGlobalOrds || collectFromSingleBucket==false)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident. But I'll switch it to a hard check just for extra assurance.

@nik9000
Copy link
Member Author

nik9000 commented Jun 8, 2020

run elasticsearch-ci/default-distro

@nik9000 nik9000 merged commit 9b8ff5c into elastic:master Jun 9, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 9, 2020
This reworks string flavored implementations of the `terms` aggregation
to save memory when it is under another bucket by dropping the usage of
`asMultiBucketAggregator`.
nik9000 added a commit that referenced this pull request Jun 9, 2020
This reworks string flavored implementations of the `terms` aggregation
to save memory when it is under another bucket by dropping the usage of
`asMultiBucketAggregator`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants