Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Oct 28, 2025

This change follows #136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed.

Why change the document _id format?

The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value.

This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa.

The format for the synthetic _id in this pull request has been changed to be:

  • _tsid (variable length)
  • Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian)
  • routing hash _ts_routing_hash (4 bytes)

Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents.

Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted).

It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first.

Impact on synthetic terms and postings format

The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in #136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups.

In contrast to #136810, SyntheticIdTermsEnum and SyntheticIdPostingsEnum implementations are ready for reviews.

Tests improvements

Tests have been improved to delete random documents and/or lookup documents from in-memory or flushed to disk segments. Searches and search-by-id should also work.

The following feature are NOT covered by tests:

  • aggregations over _id likely do not work
  • nested-docs (if that's possible in time-series datastreams?)
  • and many more (split/clone, malformed _ids rejections etc)

@tlrx tlrx marked this pull request as ready for review October 29, 2025 15:00
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 29, 2025
TIME_SERIES_SORT = new FieldSortSpec[] { new FieldSortSpec(TimeSeriesIdFieldMapper.NAME), timeStampSpec };
TIME_SERIES_WITH_SYNTHETIC_ID_SORT = new FieldSortSpec[] {
new FieldSortSpec(TimeSeriesIdFieldMapper.NAME),
new FieldSortSpec(DataStreamTimestampFieldMapper.DEFAULT_PATH) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to drop the DESC ordering for timestamp? This is used for processing recent data first, and dropping it will penalize query performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the ordering of the timestamp because it was easier to reason about end-to-end.

When applying soft-updates of documents, Lucene iterates over the terms (this is _id values) to know which documents must be soft-updated. It has an optimization to stop applying updates once it finds a term (ie, another _id value) that is greater than the one used for the soft-update.

This comparison is done between two values of _id (the value for the update, and the next value from the terms enumeration in the segment) which are in fact arrays of bytes. So we want the lexicographical ordering of those arrays to match the ordering of documents in the segment, and using Big Endian encoded timestamps allows that (ie, if timestamp1 < timestamp2 then arrays of bytes 1 < arrays of bytes 2). It also mean that documents must be sorted according to timestamp long natural ordering in the segment, so ascending.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlrx Can we encode the synthetic ID using _tisd and Long.MAX_VALUE - timestamp instead? Changing the index sort would break downsampling and rate calculation in ES|QL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks Nhat! I pushed d71316d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Tanguy!

assert 0 <= tsIdOrd : tsIdOrd;
assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd;

for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg looks like a good candidate for docvalue skipper? Food for thought, nothing needed in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Francisco identified this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.

Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, I'll address this in a follow up.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I left a few minor comments. A would wait until someone with more expertise comments about the sorting changes.

// see IndexRequest#autoGenerateTimeBasedId.
return hashToShardId(ByteUtils.readIntLE(idBytes, addIdWithRoutingHash ? idBytes.length - 9 : 0));
int hash;
if (addIdWithRoutingHash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if we would see some performance degradation from all this new branching? I won't expect it to be important but I wanted to mention it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this specific branch is OK, but your comment made me think about how the IndexRouting is instanciated, and we don't want to read the USE_SYNTHETIC_ID setting for routing every operation.

So I pushed 933e280 to compute the useTimeSeriesSyntheticId flag once and for all when IndexMetadata are built, and uses this flag for routing operations.

document.add(versionField);
if (useSyntheticId) {
// Use a synthetic _id field which is not indexed nor stored
document.add(IdFieldMapper.syntheticIdField(id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if once we adopt the new stored fields format which would skip storing the _id we would need to adapt the SearchBasedChangesSnapshot where we load the _id from the stored fields as far as I can read in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that SearchBasedChangesSnapshot uses TsIdLoader to load document ids since #97409. I'll see if I can write a test for this in a follow up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this back, it uses the stored _id field. More work is needed to have SearchBasedChangesSnapshot implementations work with synthetic _id, but that should not prevent the merge of this PR.

@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've created a changelog YAML for you.

@tlrx tlrx requested review from dnhatn, fcofdez and kkrik-es November 3, 2025 13:07
@tlrx
Copy link
Member Author

tlrx commented Nov 3, 2025

Thanks all for your reviews. I applied your feedback, this is ready for another round of reviews.

var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid);

if (useDocValuesSkipper) {
document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're adding skippers for @timestamp first, but not for tsid? @martijnvg to double-check this part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skippers are added for both _tsid and @timestamp.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tanguy, this is much cleaner now. Please wait for Martijn to check the Lucene changes.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tlrx, LGTM.

assert 0 <= tsIdOrd : tsIdOrd;
assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd;

for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.

Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).

var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid);

if (useDocValuesSkipper) {
document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skippers are added for both _tsid and @timestamp.

@tlrx tlrx merged commit 184c51b into elastic:main Nov 5, 2025
34 checks passed
@tlrx tlrx deleted the 2025/10/24/new-id-format branch November 5, 2025 13:31
@tlrx
Copy link
Member Author

tlrx commented Nov 5, 2025

Thanks all!

afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 6, 2025
…37274)

This pull request follows elastic#136810 and introduces a new format for 
documents _id fields in time-series datastreams. In order to test this 
new format, the TSDB synthetic terms and postings format 
implementations are also changed.

The current _id format is composed of a routing hash, the hashed 
value of the _tsid and a timestamp. While it is possible to extract the 
routing hash and the timestamp from a document _id value, it is 
not possible to extract the original _tsid value.

This is an issue for synthetic _id as the document _id value is not
indexed anymore: instead the synthetic _id is computed at runtime 
from the value of the routing hash, _tsid and timestamp. Therefore 
we need to be able to extract the routing hash/_tsid/timestamp 
values from the _id value and vice-versa.

The format for the synthetic _id in this pull request has been changed to be:
    _tsid (variable length)
    Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian)
    routing hash _ts_routing_hash (4 bytes)

Extracting the values from the _id is then used for routing GETs and DELETEs 
requests to the appropriate shard or setting the
 _tsid/timestamp/_ts_routing_hash fields in tombstone documents.

Note that in searches, the document _id is built using the existing 
TsIdLoader (that has been adjusted).

It is also important that the generated _id can be sorted lexicographically, 
as Lucene stops applying doc values updates when it seeks to a term that 
is greater than the one used in the soft-update. The ordering of the arrays 
of bytes representing the _id must match the ordering of documents in 
the segment, and to do that the timestamp value is stored in the array 
as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then 
sorted first.

The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in 
elastic#136810 had to be adjusted for the new _id format. We expect their 
implementation to be somewhat slow as it requires several lookups to 
work. In a different change we'll add a bloom filter on top of these 
enumerations to avoid costly lookups.

Relates elastic#136304
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025
BASE=7a23516cce48dcd78aed0075a398b604531f1e81
HEAD=608ff674cd870bd5574c62682139de356f081672
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=7a23516cce48dcd78aed0075a398b604531f1e81
HEAD=608ff674cd870bd5574c62682139de356f081672
Branch=main
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
…37274)

This pull request follows elastic#136810 and introduces a new format for 
documents _id fields in time-series datastreams. In order to test this 
new format, the TSDB synthetic terms and postings format 
implementations are also changed.

The current _id format is composed of a routing hash, the hashed 
value of the _tsid and a timestamp. While it is possible to extract the 
routing hash and the timestamp from a document _id value, it is 
not possible to extract the original _tsid value.

This is an issue for synthetic _id as the document _id value is not
indexed anymore: instead the synthetic _id is computed at runtime 
from the value of the routing hash, _tsid and timestamp. Therefore 
we need to be able to extract the routing hash/_tsid/timestamp 
values from the _id value and vice-versa.

The format for the synthetic _id in this pull request has been changed to be:
    _tsid (variable length)
    Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian)
    routing hash _ts_routing_hash (4 bytes)

Extracting the values from the _id is then used for routing GETs and DELETEs 
requests to the appropriate shard or setting the
 _tsid/timestamp/_ts_routing_hash fields in tombstone documents.

Note that in searches, the document _id is built using the existing 
TsIdLoader (that has been adjusted).

It is also important that the generated _id can be sorted lexicographically, 
as Lucene stops applying doc values updates when it seeks to a term that 
is greater than the one used in the soft-update. The ordering of the arrays 
of bytes representing the _id must match the ordering of documents in 
the segment, and to do that the timestamp value is stored in the array 
as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then 
sorted first.

The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in 
elastic#136810 had to be adjusted for the new _id format. We expect their 
implementation to be somewhat slow as it requires several lookups to 
work. In a different change we'll add a bloom filter on top of these 
enumerations to avoid costly lookups.

Relates elastic#136304
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants