Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSDB doc value for skipping documents per tsid #92062

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Dec 2, 2022

Time series indices are always sorted first by the _tsid field which identifies uniquely a time series and secondly by the timestamp in reverse order. A typical query on this type of data is to return the last metrics for each unique time series. Because data is ordered in the segments, we only need to read the first document for each _tsid. Unfortunately we currently don't have the information to skip the documents to the next _tsid so we are still reading all documents in the index to answer that question. _tsid fields are stored as lucene sorted doc values.

This PR tries to provide a mechanism to efficiently skip documents from a _tsid. It does it by creating a new doc value format for _tsid fields. This doc value format is just a wrapper on top of lucene sorted doc values but adds an extra file to the current format that provides a jump table between _tsids. This size of the new file depends on the number of _tsids in the segment and it will have a maximum size of 4 * number of _tsid but typically will be significantly smaller.

I tried to make the PR as compatible as possible to #92045

@iverase iverase added the :Analytics/Geo Indexing, search aggregations of geo points and shapes label Dec 2, 2022
@iverase iverase marked this pull request as draft December 2, 2022 12:40
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!
Do you think that as part of this pr, we already want to make use of TSIDSortedDocValues? E.g. in top_metric agg when sorting by timestamp field?

* Some implementations are considerably more efficient than that.
*/
public int advanceOrd() throws IOException {
int doc = docID();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave this method abstract? I don't think want this implementation to run in production?
The ES87TSDBDocValuesFormat.java implementation at line 132 should always be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this implementation is that we can use this approach even for old indices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, even with this implementation aggregations can benefit from just skipping to the next tsid.

final PostingsFormat format = mapperService.mappingLookup().getPostingsFormat(field);
if (format != null) {
return format;
if (mapperService != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that mapperService should never be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a leftover

@iverase
Copy link
Contributor Author

iverase commented Dec 7, 2022

Do you think that as part of this pr, we already want to make use of TSIDSortedDocValues?

My idea is that TSIDSortedDocValues might be used by the leaf walker to skip documents if we only need to visit the first documents for each tsid. Still we need a mechanism that tells the walker that we are in such a mode. Here is @jpountz suggestion on how we can achieve it: #87564 (comment).

wdyt?

@martijnvg
Copy link
Member

martijnvg commented Dec 7, 2022

My idea is that TSIDSortedDocValues might be used by the leaf walker to skip documents if we only need to visit the first documents for each tsid. Still we need a mechanism that tells the walker that we are in such a mode. Here is @jpountz suggestion on how we can achieve it: #87564 (comment).

I see, that makes sense to me. Maybe aggregation implementation can just check the AggregationExecutionContext to check whether they can utilise the mode that allows them to skip to the next tsid (similar to the change made to here).

@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@quux00 quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023
@StephanErb
Copy link

This still sounds quite useful for many TSDB query usecases! Are there any plans to bring this forward? Or is the idea discarded in favor of the more generic #95701?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants