Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve global ordinals on low-cardinality fields #5854

Closed
jpountz opened this issue Apr 17, 2014 · 6 comments

Comments

@jpountz
Copy link
Contributor

commented Apr 17, 2014

On low-cardinality fields, it is very likely that the large segments are going to contain the same set of values as the whole index. This means that the segments ordinals are already global and that the segmentOrdToGlobalOrdLookup is going to be an identity map.

We could detect such situations, and directly expose the segment ordinals as global ordinals in order to remove one layer of abstraction.

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2014

the other approach, what lucene does, is to detect low cardinality (with respect to number of matching docs) and just collect with segment ords, and then convert over in a second pass. Imagine 1 billion docs with only 5 unique values, this saves a lot of cpu since you arent remapping the same stuff over and over.

@martijnvg

This comment has been minimized.

Copy link
Member

commented Apr 17, 2014

@jpountz Nice idea! The caveat here is that the largest segment does need to have all values, but for low cardinality fields this should already be the case.

@rmuir That implies a feature using global ordinals (e.g. terms aggs) to change its behaviour (different execution mode). This nice thing about @jpountz trick is that it is behind the field data ordinals interface and features using it wouldn't know about, so it is a small change.

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2014

@martijnvg right, the two-pass approach requires a different execution mode, thats true, but it does not require any special alignment of ordinals, so its a more general optimization. When I applied this to apache solr last year, I think it doubled faceting performance for low cardinality fields: https://issues.apache.org/jira/browse/SOLR-5512

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2014

@rmuir When @martijnvg and I worked on global ordinals for Elasticsearch, we considered building aggregations with segment ordinals first and merging in a final step, but this introduced some complexity as well since we would need to add logic to merge sub aggregation buckets together (on every Aggregator impl).

On the other hand, global ordinals are very appealing since we can use global ordinals directly as bucket ordinals.

I think working with segment ordinals could be interesting for leaf terms aggregators though as in this case we don't need to merge sub aggregations.

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2014

When i benchmarked the change, it was as you would imagine, for high cardinality fields its slower to do two passes, too much overhead. The crazy heuristic to decide is something Mike came up with benchmarking lucene facets, I just stole it.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 18, 2014
@martijnvg

This comment has been minimized.

Copy link
Member

commented Apr 18, 2014

I think working with segment ordinals could be interesting for leaf terms aggregators though as in this case we don't need to merge sub aggregations.

+1 to explore and add this for leaf terms aggs

martijnvg added a commit that referenced this issue Apr 25, 2014
…ues for a field on a shard level.

Relates to #5854
Closes #5873
martijnvg added a commit that referenced this issue Apr 25, 2014
…ues for a field on a shard level.

Relates to #5854
Closes #5873
martijnvg added a commit that referenced this issue Apr 29, 2014
…dinality fields.

Instead of resolving the global ordinal for each hit on the fly, resolve the global ordinals during post collect.
On fields with not so many unique values, that can reduce the number of global ordinals significantly.

Closes #5895
Closes #5854
@martijnvg martijnvg closed this in f3219f7 Apr 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.